UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Moves made easy : deep learning-based reduction of human motor control efforts leveraging categorical… Saha, Pramit 2021

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2021_may_saha_pramit.pdf [ 34.47MB ]
Metadata
JSON: 24-1.0396540.json
JSON-LD: 24-1.0396540-ld.json
RDF/XML (Pretty): 24-1.0396540-rdf.xml
RDF/JSON: 24-1.0396540-rdf.json
Turtle: 24-1.0396540-turtle.txt
N-Triples: 24-1.0396540-rdf-ntriples.txt
Original Record: 24-1.0396540-source.json
Full Text
24-1.0396540-fulltext.txt
Citation
24-1.0396540.ris

Full Text

Moves Made EasyDeep Learning-based Reduction of Human Motor Control Efforts LeveragingCategorical Perceptual ConstraintbyPramit SahaB.E., Jadavpur University, 2016a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Applied ScienceinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)The University of British Columbia(Vancouver)March 2021© Pramit Saha, 2021The following individuals certify that they have read, and recommend to the Faculty of Graduateand Postdoctoral Studies for acceptance, the thesis entitled:Moves Made Easy:Deep Learning-based Reduction of Human Motor Control Efforts Leveraging Cate-gorical Perceptual Constraintsubmitted by Pramit Saha in partial fulfillment of the requirements for the degree of Master ofApplied Science in Electrical and Computer Engineering.Examining Committee:Sidney Fels, Electrical and Computer Engineering DepartmentSupervisorBryan Gick, Department of LinguisticsSupervisory Committee MemberiiAbstractThe human speech motor control system takes advantage of the constraints in categorical speechperception space to reduce the index of difficulty of articulatory tasks. Taking this for inspiration,we introduce a perceptual mapping from speech-like complex multiple degree-of-freedom (DOF)movement of the hand to a controllable formant space, that allows us to leverage categorical per-ceptual constraints for reducing the difficulty level of hand motor control tasks. The perceptualnetwork is modeled using long short term memory networks (LSTMs) aimed at optimizing a con-nectionist temporal classification (CTC) loss function. Our motor control mapping network consistsof a graph convolutional neural network (GCNN) combined with LSTM encoder-decoder network,that is regularized with the help of the trained perception model. The mapping allows the user’shand to generate continuous kinematic trajectories at a reduced effort by altering the complexityof the task space. This is a human-in-the-loop system where the user plays the role of an expertassessor in evaluating the degree to which the network is capable of reducing the complexity oftask space. Further, we quantitatively formulate the index of difficulty of the task space and thethroughput of the user and demonstrate that our model is able to perform consistently well ingenerating trajectories with considerably reduced effort using mouse and data glove-based inputdevices.iiiLay SummaryDuring fast and spontaneous speech production, speech articulators might have too little time tomove in place and thereby often end up undershooting the targets. However, in most cases, ourauditory perceptual system fills in the gaps. Therefore fast and compressed speech still soundsquite normal as well as easily understandable. We investigate this phenomenon by building compu-tational models mimicking the perceptual mapping and formulate the movement task complexitythrough quantitative metrics in our hand movement space. We start by developing a non-linearmapping between the hand movement and a controllable 2D space (equivalent to the vowel quadri-lateral). We then propose a perceptual model that roughly imitates human categorical perception.We quantitatively show that by leveraging the proposed models, movement tasks can indeed bemade easier. Our study suggests that it is possible for the human motor control system to utilizeperceptual constraints to make speech articulation easier and faster.ivPrefaceThis thesis was a part of the Brain2Speech (B2S) project. A part of the work contained in thisthesis has been presented and published elsewhere. Content for the chapters, as well as many ofthe figures, have been reproduced with permission, as detailed below.Ethics applications for conducting research related to Ultrasound-based speech synthesis (No.H19-01359), Sound Stream interface (No. H07-03063), Glove-to-formant mapping (No. H19-01359),and EEG-based explorations (No. H18-00411) have been approved by the Research Ethics Boardof the University of British Columbia (UBC).The following is a list of publications resulting from the work described in this dissertation.Journal Publications[J1] P. Saha, and S. Fels, ”Your Hands Can Talk : Perceptually-Aware Mapping of Hand GestureTrajectories to Vowel Sequences”, under review for publication.Conference Proceedings Publications[C1] P. Saha and S. Fels, “Learning joint articulatory-acoustic representations with normalizingflows”, Proc. Interspeech 2020, pp. 3196-3200, 2020.[C2] P. Saha, Y. Liu, B. Gick, and S. Fels, “Ultra2speech–a deep learning framework for formantfrequency estimation and tracking from ultrasound tongue images”, International Conferenceon Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020, pp. 473-482. Springer, Cham, 2020. [MICCAI Society Young Scientist Award 2020][C3] P. Saha, M. Abdul-Mageed, and S. Fels, “Speak your mind! towards imagined speech recog-nition with hierarchical deep learning”, Proc. Interspeech 2019, pp. 141–145, 2019.[C4] P. Saha, S. Fels, and M. Abdul-Mageed, “Deep learning the EEG manifold for phonologicalcategorization from active thoughts”, in ICASSP 2019- IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 2762–2766.[C5] P. Saha, P. Srungarapu, and S. Fels, “Towards automatic speech identification from vocaltract shape dynamics in real-time MRI”, Proc. Interspeech 2018, pp. 1249–1253, 2018.vPeer-reviewed Conference Abstracts and Presentations[A1] Y. Liu, P. Saha, B. Gick, ”Visual Feedback and Self-monitoring in Speech Learning via HandMovement ”, ASA Meeting 2020[A2] P. Saha, Y. Liu, B. Gick, and S. Fels, ”Ultra-Arti-Synth - Articulatory Vowel Synthesis fromUltrasound Tongue”, ISSP 2020[A3] V. P. Srungarapu, P. Saha and S. Fels, ”Speed-Accuracy Trade-off In Speech Production”,ISSP 2020[A4] P.Saha, D.R. Mohapatra, S.Fels, ”SPEAK WITH YOUR HANDS - Using Continuous HandGestures to control Articulatory Speech Synthesizer”, ISSP 2020[A5] Y. Liu, P. Saha, B. Gick, and S. Fels, ”Deep learning based continuous vowel space mappingfrom hand gestures”, Acoustics Week in Canada 2019[A6] Y. Liu, P. Saha, A. Shamei, B. Gick, and S. Fels, ”Mapping a Continuous Vowel Space toHand Gestures”, Canadian Acoustics, 48(1) 2020.[A7] H. Goyal, P. Saha, B. Gick, and S. Fels, “EEG-to-f0: Establishing artificial neuro-muscularpathway for kinematics-based fundamental frequency control”, Canadian Acoustics, vol. 47,no. 3, pp. 112–113, 2019.[A8] P. Saha and S. Fels, “Hierarchical deep feature learning for decoding imagined speech fromEEG”, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp.10019–10020 [Among top 10 finalists in Three-Minute Presentation Contest][A9] P. Saha, D. R. Mohapatra, V. P. Srungarapu, and S. Fels, “Sound stream: Towards vo-cal sound synthesis via dual-handed simultaneous control of articulatory parameters”, TheJournal of the Acoustical Society of America, vol. 144, no. 3, pp. 1907–1907, 2018.[A10] P. Saha, D. R. Mohapatra, S. Praneeth, and S. Fels, “Sound-stream II: Towards real-timegesture-controlled articulatory sound synthesis”, Canadian Acoustics, vol. 46, no. 4, pp.58–59, 2018.Peer-reviewed Book Chapters[B1] A. H. Abdi, P. Saha, V. P. Srungarapu, and S. Fels. ”Muscle excitation estimation inbiomechanical simulation using NAF reinforcement learning”, In M. P. Nash, P. M. Nielsen,A. Wittek, K. Miller, and G. R. Joldes, editors, Computational Biomechanics for Medicine,pages 133–141, Cham, 2020. Springer International Publishing.viAuthor’s ContributionsIn [J1], I developed the perceptual mapping between hand kinematics and acoustic space as wellas formulated the index of difficulty in the task space. Dr. Fels supervised this research.In [C1], I developed the method for connecting the articulatory representation (Pink Trombone)and the acoustic representation (mel-spectrogram). I was guided by Dr. Fels.For [C2], my contribution was the development of deep learning model for estimation of formantfrequencies from ultrasound tongue images. Yadong Liu aided in the data collection and pre-processing. I was guided by Dr. Fels and Dr. Gick. This work received the Young Scientist Awardin the MICCAI 2020.In [C3] and [C4], I developed the deep learning model for imagined speech recognition fromEEG signals. In both the works, I was jointly guided by Dr. Abdul-Mageed and Dr. Fels.For [C5], my main contribution included the application of deep learning techniques to vocaltract dynamic MRI sequences and preparing the manuscript. P. Srungarapu aided in the imple-mentation and in running experiments. Dr. Fels acted in supervisory role.In [A1], [A5] and [A6], I implemented the mapping between hand kinematics and acoustic space.In [A6], A. Shamei helped in formulating the initial mapping technique. Yadong Liu ran the user-study and led the analysis in [A1]. Dr. Gick supervised this research [A1]. The other two workswere jointly supervised by both Dr. Fels and Dr. Gick.In [A2], I performed the area function computation, articulatory sound synthesis in ArtiSynthand performance analysis. Yadong Liu aided in the data collection as well as tongue tracking andpalatal tracing from ultrasound images. I was jointly guided by Dr. Fels and Dr. Gick.In [A3], P. Srungarapu led and contributed most of the work related to the reinforcement learn-ing based implementation of muscle excitation estimation. I helped in the experimental planning,interpretation of results and manuscript drafting. Dr. Fels supervised this research.In [A4], I developed the hand gesture-based control strategy of Pink Trombone using data-glove.D.R. Mohapatra helped in connecting the glove output to the Pink Trombone. Dr. Fels playedsupervisory role.In [A7], H. Goyal led and contributed most of the work on mapping brain signals to funda-mental frequency via hand movement kinematic pathway. I helped in the experimental planning,interpretation of results and manuscript drafting. Dr. Fels and Dr. Gick jointly supervised thisresearch.In [A8], I developed the deep learning model for imagined speech recognition from EEG signals.Dr. Fels supervised this research.In [A9] and [A10], I planned, designed and executed the experiment. D.R. Mohapatra and P.viiSrungarapu helped in connecting the Arduino output to the Artisynth tongue. Dr. Fels supervisedthis research.For [B1], Dr. Abdi led and contributed most of the work related to the development of re-inforcement learning techniques. I aided with running experiments and manuscript drafting andediting. P. Srungarapu also aided in running experiments and drafting the manuscript. Dr. Felsplayed supervisory role.viiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Sensory pathway, Motor pathway, and Perception . . . . . . . . . . . . . . . . 31.1.2 Effort of Motor Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 An example: Buzz Wire game . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.4 Speech Perception and Production . . . . . . . . . . . . . . . . . . . . . . . . 91.1.5 Quantal Nature of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2 Experiments with hand gestures and movements . . . . . . . . . . . . . . . . . . . . 121.3 Why hand motion in articulation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4 Research Questions and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Background and Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1 Gesture-to-speech interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2 Index of difficulty and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Fitts’ law in speech production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Vowel formant frequency space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Formants and auditory perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Task Difficulty Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.1 Fitts’ original 1D experiment and its shortcomings . . . . . . . . . . . . . . . 263.1.2 Close variants of Fitts’ law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1.3 Steering law in trajectory task . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Role of trajectory curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 Formulation of difficulty of Task I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3.1 Task Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.2 Contributing factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Formulation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Index of difficulty with 2D target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5 Formulation of difficulty of Task II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5.1 Task Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38ix3.5.2 Contributing factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.5.3 Formulation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.6 Discussions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.7 Chapter summary and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 Perceptually-aware Gesture-to-Formant Mapping . . . . . . . . . . . . . . . . . . . 454.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.2 Choice of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.1 Introduction to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 484.3.2 Graph Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 494.3.3 Long Short Term Memory Network . . . . . . . . . . . . . . . . . . . . . . . . 524.3.4 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . . . . . 564.4 Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4.1 Perception Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4.2 Mapping Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.4.3 Loss Function Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.4.4 Contribution of network components . . . . . . . . . . . . . . . . . . . . . . . 624.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.1 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.1.1 Data generation and collection . . . . . . . . . . . . . . . . . . . . . . . . . . 685.1.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.1.3 Determination of the best perceptual network . . . . . . . . . . . . . . . . . . 695.1.4 Determination of best mapping model . . . . . . . . . . . . . . . . . . . . . . 715.1.5 Model performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2 User evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.2.1 Data acquisition Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.2.3 Motor control paradigms with glove . . . . . . . . . . . . . . . . . . . . . . . 815.3 Index of difficulty calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.4 Index of performance calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.4.1 Without perceptual constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 865.4.2 With perceptual constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.5 Performance analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.6 Chapter summary and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116 Exploration of interfaces and mappings . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1 Kinematic control of interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.1.1 Pink Trombone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.1.2 VT Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.1.3 Discussion and Future Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.2 Mechanical interface control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.2.1 Interface design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.2.2 Acoustic System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123x6.2.3 Experiments and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1266.2.5 Interpretation of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.2.6 Qualitative analysis and design issues . . . . . . . . . . . . . . . . . . . . . . 1286.2.7 Summary and Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.3 ArtiSynth Tongue muscle control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1316.3.1 Proposed methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.3.2 Detailed Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.4 Mapping Tongue Motion to Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.4.1 MRI-based speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 1356.4.2 Ultrasound-to-formant estimation . . . . . . . . . . . . . . . . . . . . . . . . . 1426.4.3 Pink Trombone VT and acoustics . . . . . . . . . . . . . . . . . . . . . . . . 1506.4.4 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606.5 Mapping Active Thoughts to Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606.5.1 EEG-based direct recognition of vowels and words . . . . . . . . . . . . . . . 1606.5.2 EEG-based phonological categorization . . . . . . . . . . . . . . . . . . . . . 1646.5.3 EEG-based word/phoneme recognition via phonological categorization . . . . 1736.5.4 Section Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1816.6 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1827 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1847.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1847.2 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1877.2.1 Main limitations and possible improvements . . . . . . . . . . . . . . . . . . . 1877.2.2 Investigation on throughput improvement . . . . . . . . . . . . . . . . . . . . 1897.2.3 Investigation of physical and biomechanical constraints . . . . . . . . . . . . . 1897.2.4 Investigation in force space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1907.2.5 Investigation with vocal tract movements . . . . . . . . . . . . . . . . . . . . 1917.2.6 Investigation in active thought space . . . . . . . . . . . . . . . . . . . . . . . 1917.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193A Appendix A: Index of Difficulty in toy examples . . . . . . . . . . . . . . . . . . . 206A.1 Task Space I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206A.2 Task Space II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207B Appendix B: Index of Difficulty in formant space . . . . . . . . . . . . . . . . . . . 209B.1 Three Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209B.2 Four Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210B.3 Five Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211B.4 Six Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213xiList of TablesTable 2.1 Formant Frequencies of nine cardinal vowels . . . . . . . . . . . . . . . . . . . . . 23Table 3.1 Effect of categorical constraint on index of difficulty of the movement task (a=4cm, W=.4 cm, L=3.2 cm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 5.1 Perceptual categorization accuracy in % (‘K’ represents length of vowel sequence,‘L’ represents the number of layers and ’N’ represents the number of nodes perlayer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Table 5.2 Performance evaluation (using MSE) for the mapping network . . . . . . . . . . . 73Table 5.3 Index of difficulty (ID) in uniform and partitioned space (in bits) . . . . . . . . . 86Table 5.4 Effect of perceptual constraint on index of difficulty (ID) and information rate(IR) for Mouse-based control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Table 5.5 Effect of perceptual constraint on index of difficulty (ID) and information rate(IR) for Glove 2D control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Table 5.6 Effect of perceptual constraint on index of difficulty (ID) and information rate(IR) for Glove 1D+1D control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Table 5.7 Effect of perceptual constraint on index of difficulty (ID) and information rate(IR) for Mouse-based control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Table 5.8 Effect of perceptual constraint on index of difficulty (ID) and information rate(IR) for Glove 2D control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Table 5.9 Effect of perceptual constraint on index of difficulty (ID) and information rate(IR) for Glove 1D+1D control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Table 6.1 V ariance analysis of Hypothesis 3 . . . . . . . . . . . . . . . . . . . . . . . . . 127Table 6.2 Speech identification performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 140Table 6.3 Performance comparison with baseline methods . . . . . . . . . . . . . . . . . . . 149Table 6.4 Ablation experiments - Removal of spatial, temporal and shuffling blocks . . . . . 149Table 6.5 Classification accuracy on long words . . . . . . . . . . . . . . . . . . . . . . . . . 164Table 6.6 Selected parameter sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169Table 6.7 Results in accuracy on 10% test data in the first study . . . . . . . . . . . . . . . 171Table 6.8 Comparison of classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 172Table 6.9 Selected parameter sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177Table 6.10 Results in accuracy on 10% test data for phonological prediction. C-L-D: CNN+ LSTM + DAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177Table 6.11 Classification Performance metrics on 10% test data in phonological prediction task178Table 6.12 Comparison of accuracy on 10% test data for speech token prediction task . . . . 179xiiList of FiguresFigure 1.1 Overview of the sensorimotor pathway. Motor commands change the states ofour body and the external environment. Our sensory system transduces thesestates and measures sensory consequences of motor commands, but at a delay.Our nervous system also predicts our sensory consequences of motor commandsvia an internal model known as the ”forward model”. The predicted and observedsensory consequences are combined to form a belief about the states of our bodyand environment, which is a reflection of both our predictions and observations . 3Figure 1.2 The Buzz Wire game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 1.3 Variation of difficulty levels of buzz wire game. More the number of controlpoints and sharper the turns, more is the difficulty level . . . . . . . . . . . . . . 7Figure 1.4 Decrease in difficulty levels of buzz wire game with an increase in loop size . . . 7Figure 1.5 Variation of difficulty in an augmented version of buzzwire game . . . . . . . . . 8Figure 1.6 Non-linear variation of the acoustic parameter with respect to the articulatoryparameter, divided into three regions . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 2.1 Fitts’ serial tapping task experiment and its data analysis . . . . . . . . . . . . . 20Figure 2.2 The vowel quadrilateral including the cardinal vowel locations and their relation-ship with tongue positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Figure 3.1 Fitts’ reciprocal task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 3.2 Tunnel task for general formulation of Steering Law . . . . . . . . . . . . . . . . 30Figure 3.3 Variation of difficulty of steering task within tunnels (each of length ’L’ and pathwidth ’w’) having different curvatures . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 3.4 Variation of difficulty of steering task within discrete tunnels . . . . . . . . . . . 32Figure 3.5 Visualization of Task space I of square shape with side length ’a’ . . . . . . . . . 35Figure 3.6 Visualization of six different tasks (Case I - VI) in Task space I . . . . . . . . . . 35Figure 3.7 Alternative way of area computation for 2D target shape. (a) shows a pure 1Dpoint-to-point movement. (b) shows the target point as centroid of the targetstructure, where the target boundaries dictate the tolerance. In present study,we simply compute the target area as the measure of tolerance. (c) shows thecomputation of equivalent width along the line of motion and equivalent heightperpendicular to the line of motion. (d) shows the equivalent rectangular areaparameter of the transformed geometry . . . . . . . . . . . . . . . . . . . . . . . 37Figure 3.8 Visualization of Task space II of square shape with side length ’a’ divided into 5parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Figure 3.9 Visualization of first two tasks (Case I - II) in Task space II. Here I(a) and II(a)represent the paths and the targets for two cases whereas I(b) and II(b) showthe gridlines representing the tunnel width . . . . . . . . . . . . . . . . . . . . . 40xiiiFigure 3.10 Visualization of next two tasks (Case III - IV) in Task space II. Here III(a) andIV(a) represent the paths and the targets for two cases whereas III(b) and IV(b)show show the gridlines representing the tunnel width . . . . . . . . . . . . . . . 40Figure 3.11 Visualization of last two tasks (Case V - VI) in Task space II. Here V(a) andVI(a) represent the paths and the targets for two cases whereas V(b) and VI(b)show the gridlines representing the tunnel width . . . . . . . . . . . . . . . . . . 41Figure 4.1 2D Convolution versus Graph Convolution . . . . . . . . . . . . . . . . . . . . . 50Figure 4.2 Overview of two layered graph convolutional network with first order filters . . . 51Figure 4.3 Unrolling Recurrent Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . 53Figure 4.4 The repeating module in RNNs and LSTMs . . . . . . . . . . . . . . . . . . . . 54Figure 4.5 Encoder-Decoder or seq2seq LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . 55Figure 4.6 An example illustrating the working of CTC coupled with LSTM . . . . . . . . 57Figure 4.7 Overview of the perception block . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Figure 4.8 Overview of the mapping block . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.9 Overview of the Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 4.10 An example illustrating the function of the perceptual block . . . . . . . . . . . . 64Figure 4.11 Examples illustrating how perceptual block responds to speed changes. (a) isdetected as /2/-/æ/-/E/-/i/-/u/-/a/ (b) is detected as /2/-/æ/-/i/-/u/-/o/-/a/.Both the trajectories have similar coordinates as the previous example shown inFig. 4.10, but vary only in terms of velocity profiles . . . . . . . . . . . . . . . . 65Figure 4.12 Examples illustrating how perceptual block responds to curvature changes. (a) isdetected as /2/-/æ/-/E/-/i/-/u/-/a/ (b) is detected as /2/-/æ/-/i/-/u/-/o/-/a/.Both the trajectories have little deviations from the example shown in Fig 4.10 . 65Figure 4.13 Examples illustrating the function of the mapping block. (a), (c), and (e) showthe initial user trajectories whereas (b), (d), and (f) show the final formanttrajectories after passing through the mapping block . . . . . . . . . . . . . . . . 66Figure 5.1 Augmented versions of a synthetic data sample corresponding to the vowel se-quence /æ/-/e/-/u/-/a/. Solid lines represent the trajectories while the dashedand dotted lines represent their augmented versions resulted due to horizontaland vertical sliding respectively. (a) shows the incorporation of one control knot(colored red), (b) shows another augmented version like (a) but with changedlocation of the control knot, (c) shows the incorporation of noises on (a), (d)shows the transformation of the trajectories due to the shifting the start andend points, (e) shows the incorporation of three control knots, and (f) shows theeffect of addition of noise as well as additional shifting of the start and end pointson (e) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Figure 5.2 The quantal formant space with perceptual network-driven decision boundaries . 75xivFigure 5.3 Illustration of a potential success and an adversarial case in the quantal formantspace. (a) represents the vowel sequence /i/-/e/-/E/ or /E/-/e/-/i/ and can beeasily identified by the network. (b) represents the vowel sequence /i/-/e/-/E/-/æ/ or /æ-/E/-/e/-/i/. With minimal change in velocity and curvature, thenetwork can be fooled to output wrong sequences like /i/-/e/-/æ/, /i/-/E/-/æ/or /æ-/e/-/i/, /æ-/E/-/i/ respectively. . . . . . . . . . . . . . . . . . . . . . . . 76Figure 5.4 Sample trajectories showing possible curves for vowel sequences /i/-/u/-/o/ and/E/-/æ/-/a/-/O/ overlaid on the quantal formant space . . . . . . . . . . . . . . . 76Figure 5.5 Data glove based control of formant frequencies. (a) shows three selected instantsinvolving different hand gestures (flexion and extension of finger joints) in thecontinuous joint control of formant frequencies corresponding to /i/-/a/-/u/. (b)shows the side and front views of the hand gestures at three selected instantsinvolving continuous, independent 1D+1D control of formant frequencies (flexionand extension of wrist, abduction and adduction of fingers) corresponding to thesame vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Figure 5.6 Experimental setup with the gloves, laptop and mouse. (a) and (b) show twodifferent views of the mouse-based data collection. (c) and (d) show glove-based2D and 1D+1D data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Figure 5.7 Hand gestures corresponding to glove-based 1D+1D control for making the vowelsequence /u/-/i/-/æ/-/a/. The upper row shows finger adduction and abductionwhile the lower row shows wrist flexion and extension. The movement starts withthe vowel /u/ which has low F1 and F2. This is characterized by wrist extensionand finger adduction. The next vowel is /i/, which is another cardinal vowelwith low F1 but high F2. So to reach /i/ from /u/, the fingers need to befully abducted while the wrist remains almost at the same angle of extension.This is followed by /æ/ which has higher F1 and lower F2 than /i/. This isachieved by performing wrist flexion and lesser degree of finger abduction (i.e.,increased adduction). The last vowel is /a/, with higher F1 and lower F2. This isachieved by further decreasing the degree of abduction while slightly increasingthe wrist flexion. These sequential movements result in the given vowel sequencein acoustic space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Figure 5.8 Hand gestures corresponding to glove-based 2D control for making the vowelsequence /u/-/i/-/æ/-/a/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Figure 5.9 Trajectory generation experiment with mouse and glove. It shows user-generatedtrajectories in formant space through mouse movements and hand gestures. Bluedots represent spatial location of 9 cardinal vowels. Dashed green line indicatesthe target vowel trajectory (for the network) joining 2, /u/, /i/, /ae/, /a/ and/ε/. Solid and dotted lines represent denoised user’s trajectories. . . . . . . . . . 82Figure 5.10 Visualization of the non-quantal formant space with nine vowels and paths con-necting some of them . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 5.11 Sample trajectory tasks for vowel sequence length K = 3, viz., /a/-/i/-/u/, /e/-/u/-/o/, /u/-/i/-/æ/, /O/-/æ/-/i/, and /u/-/a/-/ E/. (a), (b), and (c) show eachof these tasks in Round 1, 2, and 3 respectively . . . . . . . . . . . . . . . . . . . 84xvFigure 5.12 Sample trajectories of the user corresponding to vowel sequences of length K =3. The first three rows represent data from Round 1 (tunnel task in uniformspace), next three rows represent data from Round 2 (with implicit perceptualconstraint) and the last three rows represent data from Round 3 (with explicitperceptual constraint). The first, fourth and seventh rows represent mouse-basedcontrol. The second, fifth and eighth rows represent glove-based joint 2D con-trol. Finally, the third, sixth and ninth rows represent glove-based independent1D+1D control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Figure 5.13 The mean index of difficulty for different vowel sequence lengths before (Round1) and after (Round 2) utilizing the perceptual mapping . . . . . . . . . . . . . . 87Figure 5.14 The average movement times with mouse-based control for different vowelsequence lengths without any perceptual mapping (Round 1), with implicit per-ceptual constraints (Round 2), and with explicit perceptual constraints (Round3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Figure 5.15 The average movement times with glove-based joint 2D control for differentvowel sequence lengths without any perceptual mapping (Round 1), with im-plicit perceptual constraints (Round 2), and with explicit perceptual constraints(Round 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Figure 5.16 The average movement times with glove-based independent 1D+1D controlfor different vowel sequence lengths without any perceptual mapping (Round 1),with implicit perceptual constraints (Round 2), and with explicit perceptualconstraints (Round 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96Figure 5.17 The mean information rates with mouse-based control for different vowel se-quence lengths without any perceptual mapping (Round 1), with implicit per-ceptual constraints (Round 2), and with explicit perceptual constraints (Round3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 5.18 The mean information rates with glove-based joint 2D control for differentvowel sequence lengths without any perceptual mapping (Round 1), with im-plicit perceptual constraints (Round 2), and with explicit perceptual constraints(Round 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Figure 5.19 The mean information rates with glove based independent 1D+1D controlfor different vowel sequence lengths without any perceptual mapping (Round1), with implicit perceptual constraints (Round 2), and with explicit perceptualconstraints (Round 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 5.20 Variation of movement time with index of difficulty without any perceptual map-ping (Round 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100Figure 5.21 Variation of information rate with index of difficulty without any perceptualmapping (Round 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Figure 5.22 Variation of movement time with index of difficulty with implicit perceptualconstraint (Round 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Figure 5.23 Variation of information rate with index of difficulty with implicit perceptualconstraint (Round 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103xviFigure 5.24 Variation of movement time with index of difficulty with explicit perceptualconstraint (Round 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104Figure 5.25 Variation of information rate with index of difficulty with explicit perceptualconstraint (Round 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105Figure 5.26 Variation of movement time with index of difficulty for Mouse-based controlwith implicit (Round 2) and explicit (Round 3) perceptual constraints . . . . . . 106Figure 5.27 Variation of movement time with index of difficulty for Mouse-based controlwithout any perceptual constraint (Round 1) as well as with implicit (Round 2)and explicit (Round 3) perceptual constraints . . . . . . . . . . . . . . . . . . . . 107Figure 5.28 Variation of movement time with index of difficulty for Glove-based joint 2Dcontrol with implicit (Round 2) and explicit (Round 3) perceptual constraints . 108Figure 5.29 Variation of movement time with index of difficulty for Glove-based joint 2Dcontrol without any perceptual constraint (Round 1) as well as with implicit(Round 2) and explicit (Round 3) perceptual constraints . . . . . . . . . . . . . . 108Figure 5.30 Variation of movement time with index of difficulty for Glove-based inde-pendent 1D+1D control with implicit (Round 2) and explicit (Round 3)perceptual constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Figure 5.31 Variation of movement time with index of difficulty for Glove-based indepen-dent 1D+1D control without any perceptual constraint (Round 1) as well aswith implicit (Round 2) and explicit (Round 3) perceptual constraints . . . . . . 109Figure 6.1 Different sound control interfaces discussed in Sections 6.1, 6.2 and 6.3 (a) PinkTrombone, (b) VTDemo, (c) Sound Stream, (d) Sound Stream II . . . . . . . . . 114Figure 6.2 Different mappings discussed in Section 6.4 and Section 6.5. (a) MRI-basedspeech recognition, (b) US-based speech synthesis, (c) EEG-based speech recog-nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115Figure 6.3 Vocal tract configurations in Pink Trombone . . . . . . . . . . . . . . . . . . . . 116Figure 6.4 VTDemo interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118Figure 6.5 The proposed mechanical interface: SOUND STREAM . . . . . . . . . . . . . . 120Figure 6.6 3 DOF Slider Control Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Figure 6.7 2 DOF movement of controlling block . . . . . . . . . . . . . . . . . . . . . . . . 122Figure 6.8 Dual-handed Simultaneous Control (Scheme 1) . . . . . . . . . . . . . . . . . . . 123Figure 6.9 Dual-handed Simultaneous Control (Scheme 2) . . . . . . . . . . . . . . . . . . . 124Figure 6.10 1D area function Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125Figure 6.11 Comparative user ratings on: (a) the suitability of varying tongue shape, (b)the effectiveness of joint control of articulatory and acoustic parameters, (c) theavailability of effective control actions . . . . . . . . . . . . . . . . . . . . . . . . 129Figure 6.12 The proposed SOUND STREAM II hand gesture-to-sound control pathway . . . 133Figure 6.13 ArtiSynth Tongue Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Figure 6.14 Overview of the LRCN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136Figure 6.15 Frames of rtMRI videos for speaker F1 producing [asa]. Time progresses fromleft to right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137Figure 6.16 The Top-1 accuracy for vowel identification . . . . . . . . . . . . . . . . . . . . . 139Figure 6.17 The Top-1 accuracy for VCV identification . . . . . . . . . . . . . . . . . . . . . 141xviiFigure 6.18 Overview of proposed Ultra2Speech. The arrows indicate the data flow. . . . . . 144Figure 6.19 Architecture of the proposed Ultra2Formant (U2F) Net . . . . . . . . . . . . . . 146Figure 6.20 (a) Time-varying formants (Red indicates target and blue indicates predictedtrajectories), (b) Original speech signal, (c) synthesized speech signal . . . . . . . 148Figure 6.21 Saliency maps from U2F showing internal tongue contour localization . . . . . . 148Figure 6.22 The self-attention module in convolutional autoencoder architecture . . . . . . . 153Figure 6.23 The proposed articulatory-acoustic forward and inverse mapping . . . . . . . . . 155Figure 6.24 (a) and (c) respectively shows the mel-spectrogram corresponding to the originalvowels /a/ and /u/, (b) and (d) respectively shows their synthesized versionsfrom VT geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Figure 6.25 The synthesized pink trombone images corresponding to VT configurations for/a/, /ae/, /i/ and /u/ (left to right) . . . . . . . . . . . . . . . . . . . . . . . . . 158Figure 6.26 Overview of the proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 162Figure 6.27 Performance comparison for all subjects on vowels (left) and short words (right) 164Figure 6.28 Overview of the proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 166Figure 6.29 Cross covariance Matrices : Rows correspond to two different subjects; Columns(from left to right) correspond to sample examples for bilabial, nasal, vowel,/uw/, and /iy/. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167Figure 6.30 tSNE feature visualization for ±nasal (left) and V/C classification (right). Redand green colours indicate the distribution of two different types of features . . . 169Figure 6.31 Kappa coefficient values for above-chance accuracy based on Table 6.7 . . . . . . 172Figure 6.32 Overall framework of the proposed approach . . . . . . . . . . . . . . . . . . . . 173Figure 6.33 Overview of phonological prediction of our novel architecture . . . . . . . . . . . 175Figure 6.34 Variation of performance accuracy of phonological prediction with varying training-validation-test data ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178Figure 6.35 Inter-subject confusion matrix for speech token prediction with covariance data(left) and with phonological feature data (right) . . . . . . . . . . . . . . . . . . 179Figure 6.36 Precision and recall metrics corresponding to each speech token on 10% train data180Figure 6.37 Variation of performance accuracy of speech token prediction for top 4 algorithmswith varying training-validation-test data ratio . . . . . . . . . . . . . . . . . . . 181xviiiList of AbbreviationsR2 R-squared1D One-dimensional2D Two-dimensional3D Three-dimensionalALS Amytotropic Lateral SclerosisAMB AmbidextrousANOVA Analysis of VarianceBCI Brain Computer InterfaceCCV Channel CovarianceCIHR Canadian Institute of Health ResearchCNN Convolutional Neural NetworkCT Computerized TomographyCTC Connectionist Temporal ClassificationDAE Deep AutoencoderDBN Deep Belief NetworkDL Deep LearningDNN Deep Neural NetworkDOF Degrees of FreedomECoG ElectrocorticographyEEG ElectroencephalographyELBO Evidence Lower BoundEMA Electromagnetic ArticulographEMG ElectromyographyF1 First formant FrequencyF2 Second formant FrequencyFCN Fully Connected NetworkFDTD Finite Difference Time DomainxixFX Fundamental FrequencyGA Glottal AreaGCNN Graph Convolutional Neural NetworkGLOW Generative FlowGMM Gaussian Mixture ModelGPU Graphics Processing UnitGRU Gated Recurrent UnitHCI Human Computer InterfacesHCT Human Communication TechnologiesID Index of difficultyILSVRC ImageNet Large Scale Visual Recognition ChallengeIP Index of PerformanceIR Information RateJASS Java Audio Synthesis SystemJW Jaw HeightKL Kelly-LochbaumLA Lip AreaLH Larynx HeightLP Lip ProtrusionLPC Linear Predictive CodingLRCN Long-term Recurrent Convolutional NetworkLSF Line Spectral FrequencyLSTM Long Short Term MemoryM MeanMAE Mean Absolute ErrorMASc Master of Applied ScienceMEG MagnetoencephalographyMGC-LSP Mel-Generalized Cepstrum-based Line Spectral PairMICCAI Medical Image Computing and Computer Assisted InterventionMLP Multi-layer PerceptronMRI Magnetic Resonance ImagingxxMSE Mean Square ErrorMT Movement TimeNS Velo-pharyngeal portNSERC National Sciences and Engineering Research Council of CanadaPC Personal ComputerPDE Partial Differential EquationPT Pink TromboneRBF Radial Basis FunctionRBFN Radial Basis Function NetworkReLU Rectified Linear UnitResNet Residual NetworkRNN Recurrent Neural NetworkrtMRI Real time Magnetic Resonance ImagingSD Standard Deviationseq2seq Sequence-to-sequenceSNR Signal-to-noise ratioSSI Silent Speech InterfaceSVM Support Vector Machinet-SNE t-distributed Stochastic Neighbor EmbeddingTA Tongue ApexTCNN Temporal Convolutional Neural NetworkTP ThroughputTS Tongue ShapeU2F Ultrasound-to-FormantU2S Ultrasound-to-speechUBC University of British ColumbiaUS UltrasoundVCV Vowel-Consonant-VowelVT Vocal TractxxiAcknowledgementsThis multi-year cross-departmental work would not have been possible without the constant supportand assistance from many individuals, research teams and departments, whose contributions arestated below.Much of this work has been supported by Natural Sciences and Engineering Research Councilof Canada (NSERC) and the Canadian Institutes of Health Research (CIHR). I would also like tothank UBC for the Faculty of Applied Science Graduate Award and MICCAI Society for the YoungScientist Award.The great memories I take of this degree is due, in part, to my fellow colleagues at the Hu-man Communication Technologies (HCT) Lab, including Venkata Praneeth Srungarapu, DebasishRay Mohapatra, C. Antonio Sanchez, and A.H. Abdi as well as the Brain2Speech Group includ-ing Yadong Liu. I thank them wholeheartedly for all their help in my research, for the greatconversations and invaluable company throughout the last few years.I want to use this opportunity to thank my MASc research supervisor Prof. Sidney Fels andother research guides, Prof. Bryan Gick and Prof. Muhammad Abdul-Mageed. Their invaluablesupport and guidance has made this work possible. I express my deepest gratitude to my supervisoras this thesis would have just remained an idea without his guidance, mentorship and above all,patience. His brainstorming ideas directing my research focus, his belief in my ability and theacademic freedom he provided are the aspects of my masters I will always cherish. Thank you forallowing me the freedom to take these projects in my own directions, and for steering me backwhen I drifted too far. As a graduate student, I could not have wished for any better supervision.xxiiAll glories to Sri Guru and Sri GaurangaTo Sri Hari-Guru-Vaishnavas, who were always there for me,when no one else was or could be...To my beloved parents, who did everything they could for mexxiiiCHAPTER 1IntroductionSpeech is the most common daily form of communication. But how hard is it to speak? Theoreti-cally, speech production is one of the most complex processes within the human motor repertoire[113, 158], which needs precise coordination of several muscles per utterance of a word. Such arefined spatio-temporal control of articulators should be apparently quite difficult to master. Butdo we really find speech articulation tasks so hard? If not, why do we find it easy, given thatspeech production requires intricate control of several articulators including the tongue? Do weutilize some tricks to make our speaking task easier? These are really complicated questions thathave not been clearly addressed in the speech-related literature. Currently, we do not have a wayto quantify the level of difficulty of a speech task and to determine if some words or phonemes aremore difficult to pronounce than the other. In this work, we attempt to address this question byformulating a kinematic trajectory-based task difficulty metric and utilizing a perceptual mappingto reduce the user efforts in performing the given task.Speech is a motor control task that involves the ability to control and coordinate the articulatorslike lips, tongue, jaw, velum, pharyngeal wall, glottis, etc. So the answer to the questions raisedpresumably lies in the general theory of human motor control, particularly, the investigation of themovement space and the information capacity of the motor control system. To quantitatively answerhow difficult a motor task is, in absolute or relative terms, we need to quantify the informationcontent or complexity level of the task in bits. The control of many degrees of freedom of the handcan be used as a potentially useful model to explore the difficulty level of different hand motionsand the information capacity of the underlying motor pathway. More interestingly, if the handmovements are used to control an acoustic vowel (frequency) space with some added perceptualconstraints, then we can establish a closer connection of these movements with speech. Then wecould answer the question, how difficult is it to move in the vowel space with hands? Can theaddition of some favorable constraints known to exist in speech, and a non-linear mapping betweenhand kinematics and speech reduce the difficulty of the movement task? In other words, can thehand motor control system utilize a given set of speech perceptual criteria to make the movementtask easier and faster?Our present study shows that it is indeed possible to reduce the difficulty level of the movement1task by incorporating some perceptually relevant targets in the control space. In addition, wedemonstrate that with the proposed non-linear mapping, it is possible for the human motor controlsystem to take advantage of given constraints for the purpose of reducing the effort of the movementtask in speech acoustic space. In this thesis, we show how the topology of the formant space andincorporated constraints impact the difficulty level of a speech-related movement task and howmuch information capacity is needed in terms of hand throughput to perform such a task. Thesefundamental pieces of evidence with hand motion strengthen the possibility of our articulatorsutilizing the non-linear quantal effects in speech to make the speaking task easier.Reaching intermediate and terminal points in a control space with the hand motion is a complexaction. Multiple degrees of freedom need to be controlled and coordinated in order to arrange thedesired path. The selection, planning, and control of such motor actions is a central problem inmotor neuroscience. The motor system is responsible for computing desired trajectories as well asimplementing the intermediary control scheme connecting the motor commands and the behavioralgoal, i.e., reaching spatial targets. Human reaching movements prefer to follow relatively smoothpaths with Gaussian speed profiles. But how the nervous system selects a limb trajectory amonginfinitely possible routes in order to reach a specified goal is still an unsolved question in motorcontrol. In this work, we revisit the underlying mechanism of how the motor system might learnto compute a desired optimal trajectory while performing a reaching task given discrete perceptualfeedback (related to the success or failure of the task). In one phase of the current study, theuser implicitly tries to determine the perceptual boundaries by trial and error in order to optimizehis throughput for the given trajectory task. The current investigation reveals a new approachin breaking down the contributions of the action and perception model with the help of a singleintegrated task-complexity performance score in a hand-to-speech-like mapping task. The findingsrevealed some evidence on how the perceptual feedback system contributes towards making motorcontrol easier and faster.1.1 MotivationSpeech has a supposedly non-linear relationship between its distinct action (articulatory) and per-ception (auditory) space, most of which is not fully understood or computationally modeled yet.The role of the perception in motor action and the motor control costs related to such action is stillunclear. This inspires us to quantitatively investigate some of the ideas and assumptions preva-lent in speech-related motor control from an information-theoretic point of view. In the followingsubsections, we will go through these aforementioned concepts one after another.2Figure 1.1: Overview of the sensorimotor pathway. Motor commands change the states of ourbody and the external environment. Our sensory system transduces these states and measuressensory consequences of motor commands, but at a delay. Our nervous system also predicts oursensory consequences of motor commands via an internal model known as the ”forward model”. Thepredicted and observed sensory consequences are combined to form a belief about the states of ourbody and environment, which is a reflection of both our predictions and observations1.1.1 Sensory pathway, Motor pathway, and PerceptionSensorimotor skills are the basic foundations of human learning that involve close coordination ofmotor control coupled with sensory feedback from the environment. Our brain plans and generatesmotor commands to perform movements including manipulation, locomotion, etc. based on thesensory inputs that it receives from the sense organs, viz., eyes, ears, nose, tongue, and skin.This essentially constitutes two pathways - the motor control pathway and the sensory feedbackpathway. The motor control pathway is responsible for the regulation of movements via musclefiber contractions. It includes the motor control centers in the brain, motor neural circuits, andinnervated muscle fibers. On the other hand, the sensory pathway is responsible for our vision,hearing, smell, taste as well as touch and includes sensory receptor cells, neural circuits, and centersof the brain involved in sensory perception. Our nervous system integrates the multimodal sensorinformation acquired through the sensory pathway for manipulating the muscle excitations in orderto control human movements. Besides, the nervous system also predicts the sensory consequences3of the motor commands and acts based on the predictions. This forms the basis of our perception[156]. Perception is the ability to estimate the state of the external world as well as our body.The perception is built upon two streams of information: one which deals with the prediction ofsensory consequences arising from the motor system and the other which deals with the actualsensory information that was sensed, as shown in Fig 1.1. If the predictions are unbiased, then ourperceptions and consequent decisions will be more accurate and reliable.Perception is crucial in motor control as it includes vital information about the body andenvironment that is used for organizing and executing movements. Therefore, perception andaction planning are recognized to be highly correlated. How the perception impacts our motorcontrol is a field of ongoing research. Motivated by the relationship between action and perception,in the current work, we attempt to train an easily interpretable perception model and utilize it toinvestigate its impact on human decision making and motor control behavior for a given trajectorytask.The development of perception involves a selection of relevant environmental and bodily infor-mation significant for a goal-directed action while ignoring or rejecting other aspects of the state.After a perception is developed, there is a flow of this information from the perception center tothe action center in the brain that influences the action control. In our study, we incorporatethis selection step as a categorization mechanism, where we partition the 2D control space into anumber of discrete contiguous regions and only consider the categories as the influential factorsresponsible for the adaption of new action behavior.1.1.2 Effort of Motor ControlMovements occur when our motor commands tend to change the state of our body and/or theenvironment to a new state that is expected to be more rewarding or advantageous. The expectedrelative gain in the change of state influences the strength of the causal motor commands generatedby the brain. Our brain manages an implicit cost or effort associated with every movement. Withvariation in the duration of movements, length, and curvature of the path, movements targeted toreach fixed terminal points may vary in different ways while taking advantage of the redundancyof the motor control system. All these variations bring changes in the effort level of the task. Thehigher the length or curvature, the greater is the effort required from the motor control pathway.Given a number of possible movement trajectories, the human motor control system tendsto choose the more rewarding and least effortful option to complete a given task. Therefore, thecomputation of motor control efforts is central to the investigation of human motor skills. However,it is extremely difficult to objectively determine the effort level as it cannot be directly assessed andwe have very little understanding regarding how the brain represents these motor costs or efforts.4One way of estimating the perceived effort is to follow the energy-metrics. The metabolic energyconsumption related to a movement is known to be interrelated to the movement efforts. However,the measurement of the energy-related costs and their relationship with different movement pa-rameters is not straightforward. Besides, the energy calculations tend to consider only the physicalefforts avoiding the cost of mental computations incurred in the motor control information pathway.Furthermore, the fine motor skills generally end up varying the metabolic energy negligibly withrespect to the actual efforts needed to perform the fine-grained task.The other way of predicting the effort associated with a motor-control task is to computethe control costs centering around the notion of the difficulty level of the task itself. This is theinformation-theoretic view of motor control proposed by Fitts [53, 54], that takes into considerationthe effort of the underlying motor control pathway through the computation of throughput orinformation rate. For goal-directed movements, farther targets require movements of longer distanceand duration, i.e, of presumably greater effort. Similarly, wider targets require movements of lesserduration and therefore less effort. But, the question arises regarding how to formulate the effortof user’s movement using the information theory in a partitioned quantal space, where the task isto move from one particular discrete block or region to another such region. Besides, we have noprevalent robust formulation of the index of difficulty if the task is a trajectory task instead of atarget-reaching task, where the user movement has to abide by the given trajectory constraints. Theanswers to these questions will help to extend the prevalent works on the formulation of the indexof difficulty as a better, reliable measure of effort in different task spaces and has been addressedin this work. The following toy example will also make it clearer as to why viewing effort from theperspective of the task complexity seems more acceptable.1.1.3 An example: Buzz Wire gameBuzz Wire or Wire Loop game is a children’s toy that involves traversing a wire using a loop andwand from a definite start to an endpoint as shown in Fig 1.2. If the loop controlled by the userthrough the wand touches the wire, the circuit is closed and it makes a buzzing sound. So the motortask here is to guide the metal loop along the serpentine length of the wire without touching theloop to the wire. The difficulty level of any particular version of the toy depends on the shape orcurvature of the twisted wire, the length of the wire, and the size of the loop. The performance ofthe user depends on his fine hand motor control skills. Besides being used as a toy, this apparatusis often used to supplement rehabilitation of autistic and stroke patients by helping them to regaincontrol of their hand movements with repeated exercise [23, 24]. Studies showed that the mostdifficult part of the game is traversing the curved wires accurately through the ability to rotate thehand (mainly wrist) properly to the correct degree.5Figure 1.2: The Buzz Wire gameLet us consider a multi-level version of this game with an increasing level of complexity. In-tuitively, the number and degree of bends in the wire influence the difficulty which is in turndetermined by the number and relative location of the control points in the wire as shown in Fig1.3. The harder the task, the more is the effort required, i.e., the more is the demand from themotor control pathway. This example makes it much clearer as to why we need to consider asubjective description of the task space to formulate the effort metric corresponding to it.In the above analysis, we ignored an important factor - the size of the loop attached to thewand. The larger the loop, the less is the difficulty level of the task. This is because it reduces therisk of touching the wire and the necessity of taking sharp bends by avoiding collisions betweenthe loop circumference and the wire. This is explained in Fig 1.4 with the last example of Fig1.3. Although seemingly apparent, it indicates a very important aspect of the motor control taskrelated to the reduction of effort. If the user is given the advantage of selecting the size of the loop,the user will obviously choose the loop with the largest circumference to make his task easier. Thissuggests that given any opportunity to reduce our efforts, our decision making and motor controlsystem will want to take advantage of it. We argue that this is the case with speech motor controltoo. The speech motor control learns to leverage the perceptual, physical, biomechanical and anyother constraints to choose the easiest way of performing a speech task.Now to understand the connection of this example with (speech) motor control constraintsbetter, let us consider augmented variants of the same buzz wire game with two wires and a6Figure 1.3: Variation of difficulty levels of buzz wire game. More the number of control pointsand sharper the turns, more is the difficulty levelFigure 1.4: Decrease in difficulty levels of buzz wire game with an increase in loop size7Figure 1.5: Variation of difficulty in an augmented version of buzzwire gamepointed wand (without the loop) as shown in Fig 1.5. The new task is to steer the pointed wand(on a 2D plane) between the two wires in such a way that it does not touch either of them in itsjourney from the initial to the terminal point. It is intuitive that the task becomes easier whenthe distance between the wires increases thereby reducing the effort required from the user. Thisis equivalent to the tunnel task described as a general formulation of Fitts’ task [3, 4, 53, 54]. Thewider the tunnel, the lesser is the difficulty level of the task. The user can therefore learn to takeadvantage of reducing the mean curvature as well as the length of the trajectories by smootheningout the bends.The game changes when the upper wire has a few gaps (or cuts) as shown in the last arrangementof Fig 1.5 as a result of which, the user learns to take advantage of the gaps in order to avoid thehard parts, i.e., the bends. In this last example, the user utilizes his visual perception to identify thegaps and accordingly adapts his trajectory by smoothening it out, thereby leading to a significantdecrease in the efforts. The extent to which the complexity of the task is reduced depends on thenumber and location of the gaps as well as any imposed success criterion. Unlike the previouscases where the throughput of the user was only dependent on the length, curvature, and distancebetween the wires, here, the throughput is additionally dependent on the user’s ability to identifyand take benefit of the gaps in the wire. This is equivalent to the idea of how perception helps8to adjust our movement trajectories by allowing shortcuts to the destination. And this providesvaluable insights regarding the movement of the articulators and the underlying coordination ofmuscles in speech motor control. If we consider the gaps in the upper wire to be dynamicallyvarying location based on the path chosen, then that means the gaps are constantly influencingthe user’s trajectory and the extent to which he can simplify his movements. This is analogous tothe speaker tuning his articulators to the perceived sounds in the auditory areas of the brain andchanging his effort level for moving the articulators based on the auditory perception.Speech is an incredibly complex motor control task but the perceptual and biomechanical con-straints help to make the task easier, in a fundamentally similar way as described above. Thearticulators and the motor control pathways behind the articulatory movements have learned wellto transform the piecewise target reaching tasks to a different task altogether through the uti-lization of a perceptually aware non-linear mapping. In the modified task space, the articulatorscan therefore take shortcuts avoiding hard parts of the movement while still creating the same se-quences of sounds in the auditory perception center. We therefore now turn to a brief introductionof the speech production and perception mechanism to set the background of the proposed researchquestion.1.1.4 Speech Perception and ProductionSpeech communication requires accurate control of the articulatory gestures and precise catego-rization of the acoustic signals [122]. Several pieces of evidence demonstrate that the regions ofpremotor and primary motor cortex involved in speech production also participate in speech per-ception [43, 137, 140, 191, 193]. This suggests that speech perception and production are tightlycoupled and partly rely on the same neural mechanism. While listening to speech, the highlyvariable acoustic signals are sorted into different discrete phoneme categories determined by thelanguage. The phoneme category boundaries divide the entire acoustic space into contiguous butqualitatively discrete regions such that the sounds drawn from the same side of boundaries areperceived to be similar, whereas the sounds sampled from opposite sounds of the boundaries areperceived to be different. In other words, a listener will perceive one phoneme or the other insteadof something intermediate. This aspect of speech perception is known as categorical perceptionand plays a significant role in speech production.Articulatory speech production [6] involves instantaneous dynamic shaping of the vocal tractin such a way that the resultant time-varying acoustic patterns allow the listeners in identifyingthe originally articulated speech sounds. It is one of the most complex processes with the ne-cessity of a refined spatio-temporal control of articulators which is apparently quite difficult tomaster. However, it is astonishing how we can perform such complex articulation spontaneously9without considerable effort. In the continuous speech, the brain has to deal with the challengingtask of rapid and accurate coordination of a set of redundant and interacting articulators, whichrequires the multi-dimensional control of multiple articulators at a dauntingly high rate. Theneuro-computational bases behind such control are still not well understood and how the humancontrol system, comprised of the brain and central nervous system, manages to perform it accu-rately and spontaneously, is an open question in the domain of motor control [66, 67, 175]. In thiswork, we endeavor to address this question by investigating the effect of perceptual feedback-basedregulations for the control of vowel sounds.The estimation and implementation of continuous change in vocal tract geometry take timeand hence, the quality of the generated sound heavily depends on the time duration available forthe utterance. During the fast and spontaneous speech, the articulators might have too little timeto move in place and thereby end up undershooting the targets [99]. However, in most cases, theauditory perceptual system fills in the gaps, as a result of which, fast and compressed speech stillsounds quite normal as well as easily understandable. Our auditory feedback or self-perception ofspeech is said to play a crucial role in speech motor control [56]. The auditory perceptual networkprovides rich afferent information to the brain during speech production that can be utilized forbetter control and sequencing of the articulatory movements. Besides, it has been experimentallyshown that perturbing auditory feedback leads to disruption of speech production and increasedspeech errors [8, 93]. Despite several studies claiming the importance of auditory feedback inadjusting articulatory processes, there has not been any significant investigation on the underlyingmechanism demonstrating how the perceptual system reduces the difficulty of the speech motortask. In this work, we develop a kinematic-to-acoustic mapping and put forth a plausible speech-related action-perception model depicting how an individual can learn to leverage the sensoryfeedback in order to reduce the complexity of the action space.1.1.5 Quantal Nature of SpeechThe word ‘quantal’ is typically used to indicate the non-linear effects in speech, specifically theconnection of articulatory movements with the acoustics as well as the auditory perception. Thismeans that a ‘quantal’ articulatory space consists of stable regions in the vocal tract space inwhich a large variation or movement will produce little response in the acoustic or perceptual spacewhereas a little movement in the unstable region separating the stable regions will result in drasticchanges in the acoustic or perceptual space. Mathematically speaking, let X denote an articulatoryparameter (say position) and Y denote a consequent perceptual parameter (related to perceivedformant frequency), and Y = f(X) is the non-linear mapping between the two spaces. The quantaltheory of speech suggests that f(X) has regions of low slope (i.e., | dfdx | is small) as well as regions10Figure 1.6: Non-linear variation of the acoustic parameter with respect to the articulatory param-eter, divided into three regionsof high slope (i.e., | dfdx | is large). The low-slope regions where a large change in X causes a minutechange in Y are considered to be stable whereas the high-slope regions where a small perturbationin X causes a striking change in Y are considered to be unstable.A hypothetical relationship between an articulatory and acoustic parameter based on the abovediscussion has been schematically represented in Fig 1.6. Region II is designated for the unstableregion or partition that results in large changes in acoustics with small shifts in articulation. RegionI and III represent plateaus or stable regions in the curve where the acoustic variation with respectto articulation shifts is negligible. The difference in the acoustic parameter is huge between regionI and III which implies there is a significant acoustic contrast between the two regions segregatedby the intermediate region II showing an abrupt change in the acoustic parameter. The acousticattribute undergoes a qualitative change resulting in the change of auditory perception as thearticulatory parameter proceeds through region II.For example, let’s consider the alveolar versus palatal sounds. By moving the tongue tip asmall distance before or behind the alveolar ridge (extension of maxilla right behind the teeth), weobserve a dramatic change in the acoustic spectrum, leading to the distinction between ’sip’ and’ship’. Therefore, the alveolar ridge in this case can be regarded as an unstable region or partition.Similarly, our vowel distinctions in perceptual space are also created by some unstable regions inthe articulatory space. In general, the listeners cannot distinguish between two vowel sounds iftheir formant frequencies in the vowel spectrum are close enough and not separated by such anunstable region.This tendency of quantal relations between the vocal tract configurations and the acoustic11space or between the acoustic space and the auditory parameters is suggested to be a principalfactor determining the articulatory gestures and movements. Our biomechanical system underlyingthese vocal tract configurations is presumably aware of these quantal relationships or perceptualconstraints and has learned to take advantage of these constraints to reduce the degrees of freedom ofthe human vocal tract, which is endowed with seemingly innumerable degrees of freedom. Motivatedby the theoretical studies on the quantal nature of speech and the studies on the reduction ofdegrees of freedom of speech motor control (eg: bracing), we attempt to devise an experiment withcomputational models of perception block as well as non-linear mappings between goal-directedmovements and formant categories. The experiment should be able to demonstrate whether a useris able to utilize the perceptual constraints in a way similar to the propositions of the quantaltheory in order to make the given task easier.1.2 Experiments with hand gestures and movementsThe human hand has an extraordinary capability of performing a wide range of voluntary move-ments. Such movements are adaptable with response to modification in task demands, utilizing theavailable degrees of freedom. The small movements and actions occurring in the wrist, hand, andfingers are all parts of fine motor skills. Numerous daily tasks including typing, manipulation, etc.require frequent use of such fine hand motor skills. These skills involve coordinated movements offingers and wrist constrained by the musculoskeletal biomechanical parameters. The contributionof different constraints of hand motor control depends on the specific task requirements. However,such constraints do not impair hand and finger movements.Such movement information can be well collected by data glove sensors or pointing devices. Thedata glove captures the degree of flexion/extension, abduction/adduction of the fingers as well asthe flexion/extension of the wrist. A pointing device like the mouse captures the dragging motionof the hand as a combination of the lateral wrist movement and finger flexion/extension. Thesedevices can therefore be used to collect kinematic measurements, i.e., hand motion coordinates fortrajectory tasks.In this thesis, our intention is to investigate, from an information-theoretic aspect, the behav-ioral characteristics of hand movements for a set of trajectory tasks with perceptually relevant,intermediate, and terminal control points located within an enclosed 2D space. For this, we firstexplore both individual (disentangled) and simultaneous (joint) control of human movements in acontinuous space via glove. In the first paradigm of motor control, we will consider the simultane-ous joint control of both the dimensions (abscissa and ordinates of the points in trajectory) of thespace. Different hand gestures will be continuously created by bending the phalangeal, carpal, andmetacarpal joints to appropriate degrees. Such joint flexion and extension will result in the fingertip12reaching different target points in a vertical 2D plane. The second paradigm of motor control willinvolve an independent control of two different dimensions. The wrist extension and flexion willbe used to increase and decrease the ordinates while the finger abduction and adduction will beused to increase and decrease the abscissae respectively. Control of these two dimensions together,but in an independent way, will bring about continuous changes in the coordinates. Lastly, in thethird paradigm of motor control, we will perform a mouse-based experiment, where the user willbe navigating the cursor in a 2D horizontal plane in order to make the desired trajectories. We willthen repeat these experiments in a transformed 2D space partitioned by pre-decided perceptualboundaries and investigate if the perceptual categories have an influence on their trajectories andthereby their effort level.1.3 Why hand motion in articulation?The speech motor control can be seen as the result of a chain of information flow from the centralnervous system, through the intermediate activation of muscle synergies. However, it is difficultto concretely conclude about different aspects of the tongue motor skills in terms of coordination,speed, range of motions, etc. from the resultant articulatory movements. This is because the vocaltract, being a confined space, is not easily measurable for real-time applications and it is incrediblychallenging to quantify the tongue movements without hampering or interfering with spontaneousnatural speech movements. Besides, the oral motor skills are already fully developed in adults andhence, it is extremely challenging to control the constraints in order to observe their effects on theindex of difficulty (ID). In computational terms, this means that our speech production process is aperfectly trained system and therefore we lack the vital information about the learning phase of thesystem required to draw fundamental conclusions about the trade-offs in articulatory movements.An alternative way of exploring the same motor control problem is to use our hand movementor gesture space for learning to control the vowel space, analogous to learning to play a newmusical instrument. The continuous hand movements or gestures can therefore be thought of asthe continuously controlled tongue movements within the oral tract generating variations of speechsounds. Since the hand gestures are not generally used to match acoustic targets in our daily life, itgives us an opportunity to study how humans coordinate muscle groups and adapt to learn a newmotor control scheme when required. This idea of using the hand movements and gestures (flexion,extension, abduction, and adduction) will thus not only allow us to develop better silent speechinterfaces but also let us investigate the difficulty levels of different motor skills during the trainingas well as post-learned phase. More importantly, this space allows the measurement and control ofthe independent variables using hand gestures and movements, that can produce continuous speechsounds, as an effective methodology to study how the perceptual feedback affects motor skills and13ID trade-offs. Therefore, it will eventually provide us with a better understanding regarding whythe speech motor control in the native language appears to be a rather simple and easy task forus to perform even though it involves incredibly complex tongue movements, brought about by thecoordination of 11 major muscle groups.1.4 Research Questions and ContributionsThis thesis targets a specific lacuna in the field of human motor control, particularly, speech motorcontrol. The primary research questions that we intend to address are summed up as follows:1. How to quantify the difficulty level of a given kinematic trajectory task in a uniform non-quantal as well as a quantal formant frequency space? Is there a difference between thedifficulty levels of the same overall task in these two different spaces?2. Is it possible for the human motor control system to take advantage of a non-linear categoricalperceptual mapping to reduce the effort level, i.e., improve the throughput?We try to answer these two fundamental questions related to the information theory of speechmotor control through a hand kinematics-to-formant mapping as discussed previously. The pri-mary contributions contained within this dissertation involve the demonstration of the reduction indifficulty or effort level of human motor control leveraging categorical perceptual constraints. Weintroduced suitable indices of difficulty that can quantify the complexity of tasks in vowel spacewith respect to the length and curvature of the trajectory as well as the width of the perceptualcategories. With the help of a validation study involving a user and different control paradigms, weshow that it is possible to reduce the task complexity of the space as well as increase the through-put of the user in performing trajectory tasks by taking advantage of proposed deep-learning-basednon-linear mapping.Additionally, we present the following pivotal contributions in the investigation of (speech)motor control:1. We found that Graph Convolutional Neural Networks (CNN) with Gaussian Radial BasisFunction (RBF) adjacency kernels are particularly suitable for extracting the kinematic fea-tures from our input space. They can effectively capture the relative distance informationof the user’s instantaneous position with respect to the location of different cardinal vowelsthrough the graph structure.2. We found that recurrent neural network models (eg: LSTMs) coupled with ConnectionistTemporal Classifier (CTC) can be used to model the quantal vowel formant space for analysisof categorical perception.143. Our analysis in terms of quantitative difficulty index among different control paradigms usingglove and mouse for trajectory task in a 2D formant space showed that simultaneous jointtwo-dimensional glove control with dataglove was comparable to the mouse-based controlin terms of indices of difficulty and performance. Controlling vowel trajectories with theindependent 1D+1D control using the glove was observed to be the hardest.4. Through our proposed model, we provided an intuition that articulators possibly convertthe continuous vocal tract space into discrete contiguous regions, thereby changing the ar-ticulatory task altogether for making speech easier. This provides a plausible quantitativeexplanation behind the quantal theory of speech.5. We found the feasibility of web-based (Pink Trombone [171]) and PC sound interfaces (VT-Demo [1] ) as well as proposed mechanical sound interfaces (’Sound Stream’ and ’SoundStream II’) in investigating the motor control problem in articulatory and muscle space asan extension of the currently used acoustic space. Well-designed experiments with these in-terfaces in future can help us to take a step towards determining the amount of informationneeded to control the articulation task.Besides, we also contributed towards connecting the acoustic space with the active thought space(via EEG) and articulatory motions (via medical imaging modalities and synthetic images) whichwill facilitate further research in speech motor control and the development of silent speech in-terfaces. While our primary contributions, discussed previously, attempt to find the informationrequired to perform a speech-related task, these works aim to find whether this requisite informa-tion can be extracted from the noisy brain signals or images. We derived the insight that the directrecognition of speech tokens with large vocabulary is incredibly challenging as it possibly demandsmore information from the input devices than that is available. This explains why the scalability ofthe recognition-based silent speech devices to a large vocabulary space is poor. Our investigationsalso give us the insight that the development of a robust silent speech interface needs to leverageconstraints in a way fundamentally similar to the articulatory process, in order to make it easierfor the neural network-based mappings to detect and process the minimal information necessary togenerate the speech tokens.The secondary contributions of the thesis related to the silent speech interfaces are summarizedas follows:1. Our MRI-based automatic speech recognition system demonstrated satisfactory performancefor vowels and consonants but failed to recognize vowel-consonant-vowel transitions accuratelyfrom the articulator movements.152. For the first time, we established a successful end-to-end mapping between the ultrasoundtongue images and formant frequencies, that bridges the gap in silent speech interfaces andopens a new dimension for articulatory speech research. We also provided evidence that ournetwork has the ability to model an internal representation of the tongue by optimizing anon-image-based loss function and without requiring tedious manual annotation for tonguetracing. This mapping can be deployed to explore the connection between continuous tonguemovement and the changes in formant frequency.3. In order to allow both-sided control of articulatory and acoustic domains, we developed ajoint latent encoding between the articulatory and acoustic representations of vowel soundsvia convolutional autoencoders and normalizing flow-based invertible mappings, while simul-taneously preserving the respective domain-specific features.4. Our hierarchical deep neural networks composed of parallel spatio-temporal CNN and a deepautoencoder achieved success phonological and speech token prediction from imagined speechEEG data. Our work suggested the existence of a brain imagery footprint for underlyingarticulatory movements representing speech tokens. However, it was concluded that withthe higher number of speech token categories, the classification performance deterioratesdrastically thereby suggesting the limitation of information extraction capacity of the EEG-based data acquisition systems coupled to proposed methods.1.5 Thesis OutlineThe remainder of this thesis is structured as follows:In Chapter 2, we present some contextual information on gesture-to-speech interfaces, the no-tion of the index of difficulty and its implication in speech production, vowel formant space, andauditory perception. This background information is deemed necessary to understand the rest ofthe manuscript. The main contributions of this dissertation are then presented in Chapters 3, 4,and 5.Chapter 3 presents the possible ways of calculating the difficulty level of the aforementionedhand movement task with toy examples, centered around the idea of Fitts’ index of difficulty thatwould reflect the efforts of motor control.Chapter 4 then uses a deep learning-based perception network and gesture-to-formant map-ping network to develop an efficient computational model for illustrating the impact of categoricalperception in hand-to-formant motor control.In Chapter 5, we discuss the experimental evaluations of the proposed networks and the baselinearchitectures for glove and mouse-based control paradigms. We also analyze the indices of difficulty16and performance for different trajectory tasks and draw inferences based on the results.Next, in Chapter 6, we present our exploration of different sound control interfaces that canact as future direction for investigating motor control in articulatory space. We also report ourattempts in connecting brain signals and articulatory movements to speech.Finally, in Chapter 7, we end the thesis by summarizing our work, followed by descriptions ofpotential future directions and concluding remarks.17CHAPTER 2Background and Previous WorksIn this work, we primarily explore the hand movements in a 2D formant frequency space given somecategorical perceptual constraints. Therefore, in this chapter, we first review the other attemptsat connecting hand movement to speech in the form of gesture-to-speech interfaces. Next, weintroduce the existing measures of difficulty and performance in motor tasks. Furthermore, we alsopresent a brief review of the past works investigating speed-accuracy trade-off behavior directlyin speech production. In addition, to familiarize the users with the acoustic space, we provide anoverview of the formant frequencies and their importance in the study of vowel sounds. We endthe chapter by discussing how auditory perception is related to these formant frequencies.2.1 Gesture-to-speech interfacesOur starting point is the previous work on Glove-Talk II by Fels and Hinton [45, 46, 48–50], whichtranslated hand gestures to a set of speech formant parameters through data glove based adaptivespeech interface. It used Radial Basis Function Networks (RBFNs) [130] to perform gesture-to-vowel mapping based on instantaneous X-Y co-ordinates of Cyberglove [82]. However, despitebeing a continuous mapping, it solely relied on discrete point-to-point transformation withoutany encoding of the temporal information. Hence, it appeared considerably difficult for a userto conveniently control the vowel transitions in formant space, thereby limiting its utility in real-world scenario. The applications and investigations of the aforementioned work have been furtherreported in [27, 47, 51, 52, 55, 96, 106, 127, 136, 188].Another hand-to-vowel conversion system proposed by Kunikoshi et al. [92] transformed a setof five discrete hand gestures to five Japanese vowels using the Gaussian Mixture Model (GMM)based mapping function. However, their method had a limited vowel space that cannot be easilyextended. It also used distinct hand gestures corresponding to individual vowels, thereby makingit difficult to make continuous vowel transitions through continuous hand movements. Two similarother works by Ogata et al. [128, 129] focused on training users to produce different patterns ofthree consecutive vowels with five and three fingers respectively. However, it was based on thedirect calibration of the data glove to formant space and did not address the mapping aspect ofthe problem. Another recent work [172] used a touchscreen-based vowel formant synthesizer to18investigate learning and adaptation in speech production independent of the vocal tract. Thisstudy primarily dealt with the users learning an internal audio-motor mapping that enabled themto generate desired speech sounds with the touchscreen. The paper did not intend to aid thelearning process of the individuals in controlling their hands or fingers and therefore, the spatialdimensions of the touch screen were linearly mapped to the formant frequencies.The most recent works on these Glove-based adaptive speech interfaces with human-in-the-loopcontrol were done by Liu et al. [101, 102]. These works contribute towards disentangling the 2Dinput formant control space into two independent 1-D control spaces in order to make it linguisti-cally more intuitive. But the fundamental questions of finding a convenient sequence-to-sequencemapping and quantifying the difficulty level of the tasks still remain unaddressed. Therefore, whilethere have been quite a few works on improving the interface aspect and manipulating input space ofthe system, little research has been devoted so far towards addressing the actual mapping problem,particularly on the reduction of the difficulty level of the continuous control of a multi-degree-of-freedom system. To this end, we build our proposed model as an extension of Glove Talk II by Felsand Hinton [45, 46, 48–50].2.2 Index of difficulty and performanceOne of the most important phenomena in human motor control is the trade-off between move-ment speed and spatial accuracy, also known as the speed-accuracy trade-off. The speed-accuracytrade-off in human motor system is usually formulated using Fitts’ law [53, 54, 107], a well-knownbehavioral model which quantifies the capacity of human motor system to perform discrete tar-geted motor actions like pointing and reaching tasks [53], foot movements [39], human-computerinteraction[10], balance and posture [40], aimed wrist movements [118], rotary hand turning [79]etc.It is an information-theoretic view of motor control behavior motivated by Shannon’s theoremin communication theory. Fitts demonstrated that the time required to rapidly move to a targetfollows a linear relationship with the difficulty of the task. He defined the task difficulty (in bits),known as the index of difficulty (Id), based on an information analogy, as a logarithmic function ofthe ratio of the movement distance (A) and the error bound (W) in reaching the target. He furtherdefined the index of performance or throughput (TP) of the motor task (in bits/s) as the averagerate of information generated by a series of movements and quantified it as the ID divided by themovement time (MT). This is shown in Fig 2.1 and can be mathematically denoted based on thefollowing equation:MT = a+ b log2(2AW) = a+ bId (2.1)19Figure 2.1: Fitts’ serial tapping task experiment and its data analysiswhere a is the intercept and b is the slope in the linear equation. a and b are the regressioncoefficients obtained using MT as the objective variable and Id as the explanatory variable. Thesecoefficients are used to quantify a measure of performance of the user’s motor control pathway incarrying out a given task and depicts how quickly the information is processed.This emphasizes that decrease in error bound i.e., increase in desired spatial accuracy at aspecific distance increases the movement duration and vice versa. Therefore, Fitts’ law yields theexpression of a linear speed-accuracy trade-off function using the features a and b. It is to benoted that Fitts’ law is a behavioral empirical model rather than a model based on human physicaldynamics. Human motor control is fundamentally modeled by the laws of biomechanics, kinematics,and dynamics. From a dynamic modeling perspective, it has been shown that the time durationrequired to move different arm components including forearms, wrists, fingers, etc. is dependenton the moment of inertia and the muscle torque length of those components. However, Fitts’ lawcan explain human arm movements quite well in most conditions. This is because Fitts’ law has astrong connection with human arm dynamics. In a recent study [170], it has been shown that thecoefficients of Fitts’ law have a deeper meaning in the context of human motor performance andcan be expressed by joint forward and inverse arm dynamics considering signal-dependent noiseparameters in generalized multi-joint arm movements.In this work, we use the fundamental concept of Index of difficulty from Fitts’ law and itsvariants including Steering law [3] and Welford formulation [192] to quantify the complexity of20different given tasks and to reflect upon how our proposed model reduces the difficulty of the tasksthereby reducing the information rate demand i.e. the effort required from the user’s motor controlpathway in the context of 2D kinematic hand-to-formant control.2.3 Fitts’ law in speech productionRecently, Fitts’ law has also been investigated for speech-related motor tasks. The first work in thisarea was performed by Lammert et al. [94] using real time magnetic resonance imaging (MRI) dataof 5 male and 5 female native English speakers from a reading task of the USC-TIMIT database.The movement amplitudes and the articulatory targets were defined as elements of approximately50-dimensional vector space. The correlation strengths of the linearity between movement timeand index of difficulty were reported to be low (between 0.03 and 0.52). The study revealed somemethodological challenges in defining and evaluating the Fitts’ law variables from the image-basedarticulatory data. It did not contain explicit information about the values of independent variables(i.e., movement amplitude and target width) and the temporal pressure (i.e., speaking rates), thatare the prerequisites for Fitts’-like experiments. Nevertheless, the work was a first attempt in Fitts’law based-investigations in articulatory movements and acknowledged the challenges arising fromthe methodological limitations in speech tasks.Further, in a theoretical work [161], Sorensen et al. established a connection between the taskdynamic model of speech production and the Fitts’ law. They concluded that the throughput of thespeech task is the square root of the ratio of gestural stiffness and mass, which are the parameters ofthe exponential movement trajectory of the Task Dynamics model explaining speech motor control.This derivation was found to be more general and accurate than that of Lammert et al. [94]. Anextension of the work by Sorensen et al. [159] dealt with further empirical investigations usinga total of 54 electromagnetic articulography recordings of 9 participants, who repeated a pair ofwords (eg: ”top-top”) with varied speaking rates. They reported an information throughput ofabout 32 bits/s at the edge of prosodic domain whereas a throughput of around 39 bits/s insidethe prosodic domain. In other words, their investigations demonstrated steeper slope of Fitts’ lawat prosodic boundaries than inside prosodic domain. Besides, they also found steeper slope formovements in syllable onset position than for movements in syllable coda position. In general, theyconcluded that Fitts’ law was found to be applicable for broad range of articulatory movements forboth tongue and lips.Kuberski and Gafos [90] investigated the applicability of the Fitts’ law in repetitive speechmovements using a metronome-driven speech elicitation paradigm. They recorded articulatorydata from 6 adult speakers corresponding to consonant-vowel transitions (repeated [ta] or [ka]sequences) spoken at 8 different rates (ranging from extremely slow, i.e., 350 bpm to extremely21fast, i.e., 570 bpm) via electromagnetic articulometry. The movement amplitude was determinedas the 3D Euclidean distance between the starting and end points of the movements. The effectivespatial articulatory target width was computed using the trivariate deviation of the end pointsaround the centroid following the work of Wobbrock et al. [194] individually for each speaker andeach metronome rate. The study demonstrated the relevance of Fitts’ law only for faster rates,especially beyond a participant-specific critical speaking rate. No clear evidence for Fitts’ law wasfound at the slowest metronome rate, which goes with the basic assumption of the original Fitts’experiment for tapping task. The authors also investigated the applicability of another celebratedlaw of motor control, known as speed-curvature power law in [91] that examined the relationshipbetween the kinematic property of speed and geometric property of curvature in the context ofarticulatory movements, using similar dataset. Their initial study revealed that the rate andgeometric configuration of articulatory movements potentially follow a speed-curvature relationshipsimilar to the one stated in the law, but the law’s exponent values are highly dependent on thespeech rate.2.4 Vowel formant frequency spaceSpeech is characterized by speedy changes in articulation and its acoustic response. The frequencyresponse of the vocal tract is generally characterized by the frequency of its resonances, also calledformant frequencies. The resonant frequencies change based on the variation in geometrical struc-ture of the vocal tract while we speak and as a result of that, we get a continuously varying acousticfrequency response. F1 and F2 formant values are highly associated with their point locations inthe articulatory diagram of the International Phonetic Alphabet and can be used to constructacoustic working space that is strongly correlated with the articulatory space. The acoustic spacecan then be used to infer the articulatory working space and articulatory kinematics. For theconstruction of the acoustic space, the general practice is to acoustically represent the vowels assingle points in a 2D plane defined by the first and second formant frequencies as listed in Table2.1 and shown in Fig 2.2. The most dominant articulator that brings about the change in vocaltract configuration is the tongue. The position and shape of the tongue are most influential indetermining the non-nasal resonant frequencies during articulation. The correlation between thetongue position and formant frequencies becomes clearer while analyzing the cardinal vowel quadri-lateral in formant space, where the horizontal dimension represents the first formant frequency (F1)and the vertical dimension represents the second formant frequency (F2). It is observed that theheight of the tongue in the mouth is inversely related to the first formant frequency, i.e., low (high)tongue body implies high (low) first formant frequency. Similarly, the second formant frequencyis dominated by the frontness or backness of the tongue body, i.e., front (back) tongue body posi-22Figure 2.2: The vowel quadrilateral including the cardinal vowel locations and their relationshipwith tongue positionsTable 2.1: Formant Frequencies of nine cardinal vowelsVowels First formant (Hz) Second formant (Hz)i 300 2350e 425 2150E 580 1850æ 770 1790@ 600 1300a 800 1170O 540 840o 400 700u 300 750tion implies high (low) second formant frequency. Therefore, by moving the tongue body around,it is possible to manipulate the position of the formants continuously. In this study, we manip-ulate the formant frequencies using hand movements instead of tongue movements to explore ahand-to-formant mapping paradigm.2.5 Formants and auditory perceptionThe vowels are known to be recognized in the peripheral auditory system based on the spatialpatterns of excitation [18, 135, 168]. The formant frequency space has been shown to play a crucial23role in identifying the phonetic identity of the acoustic signals and has been investigated widelyin the past to construct perceptual models based on auditory patterns in different languages [11–13, 19–21, 32, 44, 133, 152–155, 167, 168, 178, 179, 204]. Vowel perception is highly sensitive tochanges in the formant ratio which is considered to be a dominant factor in auditory perception.Studies further demonstrate the evidence of a critical distance or a critical band in formant spacearound different vowel point locations that drives our categorical auditory perception[11–13, 19–21, 152–155, 168, 178, 179]. Earlier quantitative investigations in this direction involved empiricaltransformations of the formant space, followed by classification of the resultant patterns based onsome well-known criterion of distance thresholds [13, 133, 153–155, 167, 168, 179, 204]. However,the empirical transformation and classification criterion was not robust enough and were found tovary from person to person. In this work, we revisit this idea of determining vowel perception fromthe formant frequencies and replace the empirically selected transformation and classification stepwith deep-learning-based powerful temporal feature extractors and classifiers.2.6 ConclusionIn this chapter, we reviewed the basic ideas of formant frequencies and their connections witharticulatory and auditory perceptual space. The essentials of the information-theoretic view ofmotor control were also covered. More detailed discussion on these concepts will be presentedin the next chapters, where we formulate our difficulty metric for computation of movement taskcomplexity in the 2D formant space and use deep learning-based methods to develop a categoricalperceptual mapping to allow the reduction of the complexity.24CHAPTER 3Task Difficulty ComputationA reliable estimation of the complexity of a movement task space is crucial to the analysis ofmotor control behaviour. Fitts’ law [53] is a model of human neuromotor control behaviour derivedfrom Shannon’s information theory that models human movement analogous to the transmissionof information in the communication channel [108]. In Fitts’ formulation, each movement task isassigned a specific index of difficulty based on the parameters controlling the task space and isexpressed in bits of information. It states that the number of bits corresponding to the movementrepresents the number of bits of information transmitted by the human motor system to perform themovement. The number of bits divided by the time required to perform the movement then denotesthe rate of information transmission or throughput of the user’s motor control system and is denotedin bits/second. The law deals with the measurement of human motor channel capacity and one ofits distinctive features is that it disentangles the complexity of the task space from the movementperformance. In other words, Fitts’ formulation tends to determine the inherent complexity ofthe given task in a way that is independent from the users’ performance. The movement timecan be additionally recorded to calculate the throughput or users’ index of performance separately.Consequently, the law gives us an opportunity to compare varieties of movement strategies and taskmappings. It also enables us to evaluate similar movement strategy over time and across differentusers for the same given task. This is because, based on the Fitts’ index of difficulty formulation,the task difficulty is fixed in terms of the information (in bits) required to perform it and does notvary based on the users’ movements.In this chapter, we first elaborate, in Section 3.1, the Fitts’ index of difficulty used in differenttasks as discussed in the literature. In Sections 3.2 and 3.4, we point out the importance ofadditional factors like curvature and 2D target width respectively and explain why they need to beincluded within the index of difficulty formula. Furthermore, in Sections 3.3 and 3.5, we explainhow we adapt the metrics individually to a uniform space and a quantal 2D space in the contextof our current problem. In Section 3.6, we discuss the significance and limitations of the proposedmethod of task difficulty computation. Finally Section 3.7 presents a brief overview of the chapterand points out its main contributions.253.1 Related worksOur proposed index of difficulty metric for motor task in formant space is mostly based on Fitts’law and Steering Law. In this section, therefore, we review the Fitts’ original experiment, ex-plain the need for alternative formulations and mention its closest variants including Welford andMacKenzie’s formulations. We then discuss the Steering Law which was proposed as an extensionof Fitts’ law in trajectory tasks.3.1.1 Fitts’ original 1D experiment and its shortcomingsFitts’ original experiment prepared the basis for the formulation of movement task being limited bythe information processing capacity of the motor control system and hence it is essential to brieflydiscuss his study in this context. Fitts designed three movement tasks viz., the reciprocal tapping,the disc transfer and the pin transfer. The reciprocal tapping task involved moving a stylus backand forth between two plates as fast as possible, accompanied by tapping the plates at their centers.The disc transfer task dealt with transferring eight washers one at a time from the right to the leftpin while the pin transfer task aimed at transferring eight pins one at a time from one set of holesto the other.The reciprocal task has been most widely discussed in the literature and has formed the basis ofdifferent other motor control experiments related to kinematic tasks. So we will only be discussingthe reciprocal tapping experiment here. For further details, the user is encouraged to refer to[53, 54].Before starting the experimental descriptions, we first quickly revise the concept of the index ofdifficulty and index of performance as proposed by Fitts. Based on Shannon’s formulation, Fittsdefined the index of difficulty of a motor task as:ID = log2(2AW) (3.1)Consequently, the index of performance or throughput of user’s motor system was defined as:IP = IDMT(3.2)However, this formulation has some limitations, which will be addressed as a part of MacKenzie’sreformulation in the next subsection.In the main experiment, the target width (W) and target amplitude (A) as shown in Fig 3.1was varied across four values, i.e., W= 0.25 in, 0.50 in, 1.00 in, and 2.00 in, whereas A = 2 in,4 in, 8 in, and 16 in. This resulted in the index of difficulty values from 1 to 7 bits. Average26Figure 3.1: Fitts’ reciprocal taskMovement Time (each obtained from 600 observations) ranged from 180 ms to 731 ms. For eachexperimental condition, the index of performance was calculated directly by dividing the ID byMT. A striking observation was that the rate of information processing was found to be constantacross range of task difficulties. Fitts concluded the mean IP of 10.10 bits/s with SD of 1.33 bits/sto be presumably the average information processing rate of the human motor control system. Therelationship between MT and ID was observed to be 12.8 + 94.7 ID and the reciprocal of the slopeyielded the reported information processing rate.Although a high correlation was noticed between ID and mean MT, some issues were noted.For lower values of ID, scatter plots revealed an upward curvature of the MT away from theregression line suggesting some non-linearity or departure from the proposed linear relationship.Besides, the relative contribution of the movement distance and the target width in Fitts’ law isoften questioned. The effect of these two factors are generally considered to be equal but inverse.However, doubling the amplitude factor leads to adding 1 bit (log2 2) to the index of difficulty andthereby increases the predicted movement time. Similarly, it was shown that reduction in the targetwidth can result in disproportionate increase in movement time compared to similar increase in themovement distance.3.1.2 Close variants of Fitts’ lawTwo closest alternative formulations of Fitts’ law were proposed separately by Welford and MacKen-zie. Welford’s law separated the movement amplitude and the target width component with differ-27ent multipliers.The Welford formulation states:T = a+ b1 log2(A) + b2 log2(1W) (3.3)It can also be reformulated by adding an exponent k to Fitts’ formulation and is known as theKopper’s variation of Welford’s law:T = a+ b log2(AW k) (3.4)There is another formulation close to Fitts’ original formulation that states:T = a+ b log2(AW+ 0.5) (3.5)This demonstrates that making the modification reduces the curve in the best fitting line be-tween the movement time (T) and the index of difficulty (ID) thereby maintaining the linearity.This formula extends the movement amplitude by 0.5 W, i.e., log2( AW + 0.5) = log2(A+0.5WW ) toconsider the movement till the far edge of the target.The MacKenzie’s formula is:T = a+ b log2(AW+ 1) (3.6)MacKenzie’s formulation also ignores the factor 2 but adds 1 instead of 0.5 to the AW ratio toguarantee positive values for the ID. This formula is arguably the most used Fitts’ law variant thesedays. Dropping the scaling factor, however, does not influence the value of the constant ’b’, butonly adds a component to ’a’.T = a+ b ∗ log2(2AW) = a+ b ∗ (log2AW+ log2 2) = a+ b+ b log2(AW) = a′ + b log2(AW) (3.7)where a′ = a+ b.MacKenzie’s refinement is generally preferred because it provides marginally better fit withobservations and precisely mimics the Shannon’s information theory underlying Fitts’ formulation.Besides, when the term AW −→ 0, the ID −→ 0 instead of −∞ which is more rational as it showsthat the user has already reached the target.Another group of variants of Fitts’ law relates the MT with ID using power functions as follows:MT = aAbW c (3.8)28MT = a( AW)b (3.9)MT = a+ b( AW)0.5 (3.10)These formulations find less usage in related application nowadays. Nevertheless, each of theseformulations shows promising potential in modeling the speed accuracy trade-off in different specificvarieties of movement tasks.3.1.3 Steering law in trajectory taskFitts’ law in its original form is only applicable to pointing tasks and its close variants. However, itcannot be used for trajectory tasks like drawing, writing, etc as the index of difficulty of a trajectorytask should be dependent on the path representation rather than just the target representation andlinear distance between the targets. The mechanics of such 2D steering tasks has been extensivelyanalysed by Accot and Zhai [3–5]. In their works, Fitts’ law was extended to a Steering law in thecontext of trajectory task performance. The main idea motivating the law is that it is intuitivelymore difficult to steer a vehicle through narrow tunnel than through a wide tunnel. Instead of thetarget width, they considered the path width as a determining factor of the task complexity. LetW be the fixed width and A is the length of the path, T is the movement duration, a and b are theregression coefficients as discussed in the previous section. Then, the performance equation can bedenoted as:T = a+ b( AW) (3.11)Here AW represents the index of difficulty of the curved path and1b represents the index ofperformance.In most practical cases, however, the permissible width of the path is not constant as shown inFig 3.2. Therefore the general formulation of the steering law for any curvilinear tunnel is expressedas:T = a+ b∫CdsW (s) (3.12)where W (s) is the width of the path C at the point s of infinitesimal length ds. Here the indexof difficulty of the trajectory along the curved path will be calculated as∫CdsW (s) (in bits) and theindex of performance will be computed as 1b (in bits/second).3.2 Role of trajectory curvatureIn layman’s terms, Fitts’ law reveals an intuitive speed-accuracy trade-off in the context of point-ing/target selection tasks. This means that the law essentially addresses only one type of movement29Figure 3.2: Tunnel task for general formulation of Steering Lawand is not suitable for general motor control tasks. The Steering law forms an acceptable modelfor trajectory tasks, but we argue that it is not an adequate model for quantifying the complexityof trajectory-based tasks. It can be understood by simply considering the example of augmentedversion of the buzz-wire game in Fig 1.4. The number and angle of bends clearly have an effect onthe task difficulty but it is not entirely reflected by the path length or the width parameter. Thiscan be further elaborated with the help of an example.Let us consider four tunnels with same fixed length and width as shown in the Fig 3.3. Thetunnel at the top is a straight line while the other three are curvilinear tunnels with varying bends.Based on the formulation of the steering law, the difficulty levels of all the four tasks are equal.However, it is clear intuitively that the task complexity increases from top to bottom. The tunnelin the second row has one wide bend, whereas that in the third row has a narrower bend. The lastone has two sharper bends (i.e., bends with higher degrees of curvature) and hence demands highereffort level from the user’s hand motor system to navigate the tunnel. This demonstrates thatthe inclusion of a curvature parameter is crucial for investigating the difficulty level in trajectorynavigation tasks.The curvature of a trajectory (κmean) represents how sharply the curve bends. In 2D, it isconsidered to be a measure of how much a curve deviates from being a straight line. For a nearlylinear path, the average absolute curvature (κmean) will be negligible. The more the averagecurvature or sharpness of the path, the more effort is required from the user to adapt to the changesin the trajectory, assuming there are no physical constraints. Hence, the more is the informationdemand in the motor control pathway. The curvature at any point (x,y) of a curve is calculated as:κ =∣∣∣ d2ydx2 ∣∣∣(1 +(dydx)2) 32 (3.13)If the derivatives in x and y direction w.r.t time are known, then the curvature can also be30Figure 3.3: Variation of difficulty of steering task within tunnels (each of length ’L’ and pathwidth ’w’) having different curvatureswritten as:κ =∣∣∣d2xdt2 dydt − d2ydt2 dxdt ∣∣∣((dxdt)2+(dydx)2) 32 (3.14)Discretizing this equation and averaging over all points on the trajectory, we have average curvatureas:κmean =1T∑t|(xt − 2xt−1 + xt−2)(yt − yt−1)− (yt − 2yt−1 + yt−2)(xt − xt−1)|((xt − xt−1)2 + (yt − yt−1)2)32(3.15)Circles are used as canonical examples of the curves. It has a curvature exactly equal to thereciprocal of the radius. This means that the smaller the circle, the larger is the curvature. Forany other curve, the curvature at a point is defined as the curvature of its osculating circle, i.e.,31Figure 3.4: Variation of difficulty of steering task within discrete tunnelsthe circle best approximating the curve near the point. The degree of curvature is often used as ameasure of curvature of a circular arc and is defined as the central angle subtended by the ends ofthe curve.However, all these are valid in the case of continuous, differentiable curves. In practical scenarios,the trajectories may involve discrete, non-differentiable sharp bends as shown in Fig 3.4 at whichpoints these measures of curvature are not valid. In such sharp bends, instead, the degree ofcurvature can be approximated by the angle subtended by the bend. The example in Fig 3.4re-emphasizes how the change in route via bends or turns impacts the difficulty of the problem.Although both the cases have the same length L and fixed path width W, the number of 90 degreebends in the bottom one is four while that in the top one is only one. However based on thesteering law formulation, the index of difficulty will yield a value of Lw , which is clearly not anaccurate representation of the task difficulty, as the bottom task is evidently harder than the topone. Therefore, we need a metric to quantify this factor for a better trajectory difficulty indication.Besides, the wider the angle, lesser is the amount of turn required from the user’s motor systemand hence, easier is the task and vice versa. Therefore in this case, we quantify the discrete32curvature factor as the sum of reciprocal of the bend angles (θ).Kdiscrete =N∑i=11θi(3.16)The more the number of discrete bends and/or the lesser the bend angle magnitude, the more isthe value of Kdiscrete. This also ensures that Kdiscrete incurred due to several narrow angles is morethan the Kdiscrete incurred due to one wide angle which is the sum of these narrow angles. This isevident from the following inequality concerning arithmetic and harmonic mean of the angles:1NN∑i=1θi ≥ N∑Ni=11θi(3.17)This means that,N∑i=11θi≥ N2∑Ni=1 θi≥ 1∑Ni=1 θi(3.18)For example, the Kdiscrete for two 45 degree angles is 2.55 whereas the Kdiscrete for a single 90 degreeangle is .64. Therefore, we have shown that our proposed discrete curvature component incorporatesthe number of bends as well as the bend angles and has to potentially be utilized within the indexof difficulty component of the trajectory tasks. It is to be noted that the curvature factor failswhen θ = 0, which can be tackled by adding a small constant to the denominator following theconventional method. However, it is still unclear how difficult is a reverse movement with respectto a forward movement and hence we ignore the case θ = 0 in the current study as it requiresfurther careful investigation and will be addressed in future extension of the current work.3.3 Formulation of difficulty of Task IIn the previous section, we explained the impact of the curvature of the trajectories on the difficultylevel of the trajectory tasks, which is not covered by the Steering Law, even though it is considered tobe the extension of Fitts’ law for trajectory tasks. In order to further demonstrate the contributionof the curvature factor with the help of some toy examples, we design two different task spaces - (i)a uniform continuous 2D square-shaped space and (ii) a quantal 2D square-shaped space dividedinto discrete partitioned regions.In this section, we formulate the index of difficulty for the first task which will later be usedto compute the index of difficulty without perceptual constraint in Chapter 5. The following toyexample sets the background for further analysis and shows how the proposed difficulty metricrepresents the influence of the movement parameters on the tasks.333.3.1 Task SpaceLet’s consider a 2D space with five finite-width circles (of width W) marked 1 to 5 as shown inFig 3.5. For simplicity in calculations, we consider the quadrilateral to be a square of side ’a’ andassume the circles 1-4 are situated at a distance ’c’ from the sides. The circle 5 is located at thecenter of the square. Also assume, each of the distances 1-2, 2-3, 3-4, 4-1 is equal to L and has apermissible path width ’W’. Therefore, L = a− 2c. Further let’s assume W ≈ 0.1a, i.e., the squareside length is 10 times the path width. The task is to navigate from one circle to the other followinga specified route within the tunnels. Let the desired path be along the tunnel joining the circles asshown in Fig 3.5. So in order to follow path 1-2-3, one has to follow the route II as shown in Fig3.6 with a fixed path tolerance W.3.3.2 Contributing factorsWe observe that in the 2D plane with finite sized targets and constant width tunnels, the difficultyof the movement task is dependent on the length of the tunnels (or the sum of distances betweenthe neighbouring points in the path), width of the tunnel and the bend of the path at certainpoints. Therefore, the index of difficulty of the task will increase with increase in path length andnumber of bends, whereas it will decrease with increase in target width and magnitude of bendangles. The angle condition implies that the easiest task is the one with an angle of 180 degrees,i.e., straight line path.3.3.3 Formulation and EvaluationExtending Steering Law formula, we define our index of difficulty for the given task as:ID = α lw+ β log2(∑ipiθi) (3.19)where l is the total length travelled, w is the constant tunnel width, i is the number of bendsand θ is the bend angle magnitude. α and β are the scaling coefficients of the two factors. Forsimplicity of notation, we will drop the base of the logarithm operation henceforth.Now, we evaluate the proposed ID metric with a few examples as shown in Fig 3.6 and get someinsights on how the metric can capture the variation of task complexity. The length-tunnel widthcomponent and the curvature component are denoted with blue and green colour respectively tofacilitate easy interpretation. Here, we consider the simplest version of the equation with boththe scaling factors to be 1. However, empirical evaluation with rigorous user study is essential to34Figure 3.5: Visualization of Task space I of square shape with side length ’a’Figure 3.6: Visualization of six different tasks (Case I - VI) in Task space I35determine the relative scaling of the factors based on their individual contribution or importancein terms of bits in the final ID formulation.We compare the difficulty levels of the tasks below. A detailed description and derivation ofthe ID values for each of the following cases can be found in Appendix A.• Case I: ID =(LW)=(a−2cw)= X (say)• Case II: ID =(2LW)+ log(pi(pi2 ))= 1 +(2LW)= 1 + 2(a−2cw)= 1+2X• Case III: ID =(3LW)+ log[(pi(pi2 ))+(pi(pi2 ))]= 2 +(3LW)= 2 + 3(a−2cw)= 2+3X• Case IV: ID =(4LW)+ log[(pi(pi2 ))+(pi(pi2 ))+(pi(pi2 ))]= 2.58 +(4LW)= 2.58 + 4(a−2cw)=2.58+4X• Case V: ID =(2.71LW)+ log[(pi(pi2 ))+(pi(pi4 ))]= 2.58 +(2.71LW)= 2.58 + 2.71(a−2cw)=2.58+2.71X• Case VI: ID =(3.42LW)+ log[(pi(pi2 ))+(pi(pi4 ))+(pi(pi2 ))]= 3 +(3.42LW)= 3 + 3.42(a−2cw)=3+3.42X3.4 Index of difficulty with 2D targetFitts’ experiments validated human performance related to horizontal motion towards and awayfrom a target where the movement amplitude and the target width was measured along the sameaxis. This implies the model is purely one-dimensional. But how does the index of difficultyformulation change in a 2D target acquisition task? It was found that the original 1D formulationresults in unrealistically low rating for rectangular targets. To accommodate the 2D nature ofthe tasks, another extension of Fitts’ law was proposed by MacKenzie and Buxton [109]. The 2Dversion took into consideration the roles of both width and height of the given target. It providedways to substitute the original target width by four alternative quantities related to the targetwidth and height. These are: (i) sum of width and height (W+H), (ii) area (W×H), (iii) smaller ofheight and width (min(w, h)) and (iv) projected width along the line of approach. Based on userstudies, all these substitutions were found to be acceptable and statistically significant.The aforementioned ’area’ metric appears particularly intuitive because of the natural extensionof 1D parameters like length or width to area in 2D space. However, the area was defined asthe product of height and width because of the consideration of only rectangular shape of thetarget object in [109]. For application to irregular shapes, the target area parameter needs tobe modified accordingly to consider the geometry of the target better. Therefore, we consider the36Figure 3.7: Alternative way of area computation for 2D target shape. (a) shows a pure 1Dpoint-to-point movement. (b) shows the target point as centroid of the target structure, where thetarget boundaries dictate the tolerance. In present study, we simply compute the target area as themeasure of tolerance. (c) shows the computation of equivalent width along the line of motion andequivalent height perpendicular to the line of motion. (d) shows the equivalent rectangular areaparameter of the transformed geometryactual area of the target object via integration computation in our study for simplicity. Another wayof approaching the same problem might be approximating the geometrical shape with a rectanglehaving width and height along and perpendicular to the line of approach and then using the W ×Hformula as illustrated in Fig 3.7. But it leads to an additional computation overhead and hencewe restrict our index to the computed area of a 2D region for simplicity. It is to be noted thatempirical work is needed to determine more precisely how the target shape impacts the index ofdifficulty.3.5 Formulation of difficulty of Task IIIn the previous task, we considered a continuous uniform 2D space with no partitions. But oncewe partition the continuous 2D task space into discrete categories, the fundamental movementcharacteristics of the user, in the context of our target reaching task, change. This notion isparticularly useful for exploring the quantal effects in speech or any other categorized task space.In this section, therefore, we will define a categorized version of the previous task space and proposean index of difficulty applicable to this new task space.37Figure 3.8: Visualization of Task space II of square shape with side length ’a’ divided into 5 parts3.5.1 Task SpaceLet’s consider the previous task in a 2D space (with five circular targets marked 1 to 5), but thistime the points are contained within partitioned 2D regions as shown in Fig 3.8. These regionsdefine the new width of the circular targets as well as modify the width of the tunnel joining them.Let the sides of the square region be of length a. Additionally, assume the side length of the innersquare surrounding the central target to be a/3. The partitions between the regions bisect theinner and outer square boundaries perpendicularly. All other measurements remain same as theprevious task. In this new task, instead of having to reach the circular targets 1-5, the user needsto reach the target 2D regions that contain them. As a result, the importance of the circular targetdiameter vanishes.3.5.2 Contributing factorsIn the partitioned task space, the difficulty of the movement task for navigating from a startingpoint to a target region through a given route depends on : (i) the length of the trajectory joiningthe start and end points via the intermediate points, (ii) the width of the tunnel defined by theparallel boundaries (along the path), (iii) the curvature or bend of the trajectory and (iv) the2D version of width of target region. Just like the initial case, the movement difficulty increaseswith the increase in trajectory length, decrease in tunnel width, increase in number of bends anddecrease in bend angle. Additionally, the new index of difficulty should decrease with the increasein effective 2D width of the target region. With this background, let’s formulate the modified indexof difficulty and compare the difficulty of movement tasks in the quantal space with respect to the38continuous one.3.5.3 Formulation and EvaluationModifying our previous formulation, we define our index of difficulty for the modified task as:ID = α(∑k∆swk) + β log2( 1A)+ γ log2(∑ipiθi) (3.20)where ∑k ∆s = l is the total length travelled, wk is the variable tunnel width at kth location, Ais the area of the target region, i is the number of bends and θ is the bend angle. k = 0 correspondsthe starting point and k = N corresponds to the terminal point. ∆s is the sampling interval of k.Let a ≈ 10w where w was the tunnel width in previous case. Here, for simplicity, we consider allthe weighting factors or scaling coefficients (α, β, γ) to be 1. However, further empirical evaluationwith rigorous user study is essential to determine the relative scaling of the factors based on theirindividual contribution or importance in terms of bits in the final ID formulation.The area of each of the regions 1, 2, 3 and 4 as shown in Figures 3.9, 3.10 and 3.11 can becomputed as:((a2)2 − (a6)2) = 2a29 = .22a2 (3.21)Alternatively,(a3 ×a6) + (a3 ×a2) =2a29 = .22a2 (3.22)The area of region 5, on the other hand, is a29 = .11a2.The length and varying width component of each branch 1-2, 2-3, 3-4, 4-5 is calculated as:(N∑k=1∆swk) = ∆s(N∑k=11wk) = LN(N∑k=11wk) ≈ LN(N − 2a3+ 2.47a) = L(1− 2Na3+2N.47a) (3.23)where N is the number of samples.As N →∞, 2N → 0, (∑Nk=1∆swk)→ la3= 3La . For example, with N=20, the expression takes thefollowing value:(N∑k=1∆swk) = L(1− 220a3+220.47a) = 2.91La≈ 3La(3.24)Now we evaluate the proposed ID metric in the new quantal space with a few examples asshown in Figures 3.9, 3.10 and 3.11. In each of the figures, part (a) indicates the path trajectory(orange coloured line) and target area (yellow coloured region), whereas part (b) shows the tunnel39Figure 3.9: Visualization of first two tasks (Case I - II) in Task space II. Here I(a) and II(a) repre-sent the paths and the targets for two cases whereas I(b) and II(b) show the gridlines representingthe tunnel widthwidths (multi-color grids). Following the previous convention, the length-width component ofthe tunnel and curvature component in the following equations are denoted with blue and greencolour respectively. The additional component, 2D target width i.e., the area component, has beendenoted with red colour. The difficulty levels of the tasks are analyzed below. Detailed descriptionand derivation of the ID values for each of the following cases can be found in Appendix A.• Case I: ID =(3La)− log (.22a2) = A−B = Y (<< X)• Case II: ID =(2×3La)+ log(pi(pi2 ))− log (.22a2) = (6La )− log (.11a2) = 1+2Y +B• Case III: ID =(3×3La)+log[(pi(pi2 ))+(pi(pi2 ))]−log (.22a2) = (9La )−log (.06a2) = 2+3Y+2B• Case IV: ID =(4×3La)+ log[(pi(pi2 ))+(pi(pi2 ))+(pi(pi2 ))]− log (.22a2) = (12La )− log (.04a2) =2.58+4Y+3B• Case V: ID =(2×3La +1.27La)+ log[(pi(pi2 ))+(pi(pi4 ))]− log (.11a2) = (7.27La )− log (.02a2) =3.58+2.42 Y +1.42 BFigure 3.10: Visualization of next two tasks (Case III - IV) in Task space II. Here III(a) and IV(a)represent the paths and the targets for two cases whereas III(b) and IV(b) show show the gridlinesrepresenting the tunnel width40Table 3.1: Effect of categorical constraint on index of difficulty of the movement task (a=4 cm,W=.4 cm, L=3.2 cm)Examples ID-I (bits) ID-II (bits) Reduction factor1-2 8.0 0.6 13.31-5-2 12.4 1.2 10.31-2-3 17.0 4.0 4.31-2-3-4 26.0 7.4 3.51-2-3-5 24.3 6.6 3.71-2-5-3 22.0 5.2 4.21-2-3-4-1 34.6 10.4 3.31-2-3-4-5 32.7 10.4 3.11-2-3-5-4 30.4 8.0 3.81-2-3-4-5-1 38.7 10.7 3.61-2-3-5-4-1 39.0 11.0 3.5• Case VI: ID =(2×3La +2×1.27La)+ log[(pi(pi2 ))+(pi(pi4 ))+(pi(pi2 ))]− log (.22a2) = (8.54La ) −log(.03a2)= 3+2.85Y+1.85 BAll these cases show how our proposed metric reflects the changes in movement parameterscrucial for the information theoretic understanding of a trajectory movement task. Simultaneously,a quick glance at Table 3.1 demonstrates that the categorization of the task space essentiallyreduces the task complexity, thereby providing a quantitative evidence behind the plausibility ofthe speech-related complex motor control paradigms utilizing its quantal nature to reduce thespeech task complexity.Figure 3.11: Visualization of last two tasks (Case V - VI) in Task space II. Here V(a) andVI(a) represent the paths and the targets for two cases whereas V(b) and VI(b) show the gridlinesrepresenting the tunnel width413.6 Discussions and LimitationsFormulation of indices of difficulty of a trajectory task is a challenging problem. The indices haveto be carefully designed in order to capture the complexity of the task space with appropriateweights on different complexity components. In this chapter, we have derived the indices intu-itively incorporating some basic mathematical tricks. But our formulation also includes some basicassumptions and hypotheses that need to be rigorously verified by running more user studies. Dif-ferent possible formulations of the indices of difficulty can then be computed and correlated withthe movement duration to arrive at the one with the best correlation and acceptable statisticalsignificance. In what follows, we discuss some of the factors in our proposed metric that need tobe further addressed via the aforementioned method of validation.Firstly, we used the initial Fitts formulation for our study as a starting point of our investigation.However, it is to be verified if the Welford formulation, Scott MacKenzie’s formulation or the powerformulation is a better choice of baseline to derive the metrics.Secondly, we used the component log2(∑ipiθi) to represent the effect of change in angles or bendsin the trajectory on the task difficulty. We particularly used the pi as the scaling factor and avoidedthe use of trignonometric functions to make it straightforward. The formulation is realistic becausewhen the angle is 180 degree (i.e., in case of straight line paths), we get log2 pipi = 0 that does notadd any new contribution to the index of difficulty. However, one could investigate the use of betterangle or curvature components.Thirdly, we devised the angular metric based on the assumption that sharper the angle, moredifficult is the task. This is possible in a number of motor control as well as other physical scenar-ios. One of the ways to intuitively understand the rationale behind such assumption is vehicularmovements where the angle of steer at the corners of roads dictate the difficulty of the driving taskto some extent. However, the procedure of evaluating the contribution of the angle component isoften dependent on the motor control paradigm. For example, in an independent or disentangledcontrol, a 90 degree transition may appear easier than a 60 degree transition. On the contrary, injoint-control scenarios, the biomechanical constraints such as the organization of joints may poten-tially act as the primary factor in determining the correct formulation of the angles. For example,in some scenarios, a 30 degree angular transition indeed appears easier than undergoing a perpen-dicular transition. However, in Fitts’ task space, the index of difficulty is defined independently ofthe user’s motor control. Therefore, this issue needs to be addressed thoroughly in the future worksby developing a controller-independent robust metric and by figuring out better manifold spaces.Fourthly, one particular caveat of the proposed angle component is that it cannot be appliedto immediately reverse movements i.e., where θ = 0, in which case the index of difficulty takesan infinite value. Therefore, in our current study, we ignore such cases from further investigation.42This can be solved by adding a small fixed component to the denominator or deriving a betterangle component altogether.Fifthly, our task space has two fundamental differences from the original experiment definedby Fitts. The first difference is that Fitts’ index was primarily defined for repetitive task. Thiscondition ensures that the average movement time would be free from any outlier. However, forthis experiment, since we train the user carefully and allow him to perform a trial and error analysisby himself before running the actual study, we assume that the user data is not free from errorsand hence perform a one-time trajectory movement for each given sequence instead of repeatedtrajectory movements. We follow this protocol also because speech is not a reciprocal task andFitts’ law is known to be valid even for non-reciprocal tasks. The second difference is that, in Fitts’original tapping task experiment, the users were allowed to lift off the plane between the initial andfinal points. However, in this study, we assume that the movement stays on a 2D plane for all thecontrol paradigms and that it should not drastically alter the basic experimental criteria.Lastly, as previously mentioned, we consider the scaling coefficients of each of the factors inthe index of difficulty formula as 1 for simplicity. However, empirical research is required to beconducted to conclusively determine the individual coefficients.3.7 Chapter summary and ContributionIn this chapter, we have identified a major lacuna in the area of information theoretic explanationbehind motor control as well as human computer interfaces and accordingly presented plausibleways of computing indices of difficulty in different scenarios of 2D trajectory tasks. We started byreviewing the past works on computing movement difficulty metrics and emphasized the necessityof finding better indices for quantifying the difficulty of 2D trajectory tasks in Section 3.1. Section3.2 and 3.4 identified and introduced new components to be included within the 2D task difficultymeasure with intuitive examples. We then defined two different task spaces, having continuousand quantal regions, relevant to our current problem in Subsections 3.3.1 and 3.5.1 respectively.We also enumerated the factors essential for formulating the appropriate difficulty metrics in suchtasks including the path length and width, trajectory curvature, 2D target width etc. in Subsections3.3.2 and 3.5.2. Based on these criteria, we formulated and validated two novel indices of difficultymetrics suitable for two different task spaces in the context of our problem in Subsections 3.3.3and 3.5.3. These two parts include the most significant contributions of the chapter where weproposed two different toy problems fundamentally similar to the problem of movement difficultycomputation in the formant space and analytically computed different constituent elements of thedifficulty metrics in a number of representative cases. In Section 3.6, we discuss the implicationsand limitations of the computed indices of difficulty and suggest means of reaching better indices43that would represent the task complexity more closely.We pointed out the contributions of each component of the proposed difficulty metrics. We alsodemonstrated how the variation in movement parameters correspondingly alters the informationdemand of the motor control pathway in bits. We examined how change of the trajectory pathscan modify the information demand of the task despite having same start and end points. Lastbut not the least, we systematically proved why the categorization of task space requires strikinglyless information than that in the corresponding continuous task space, which is at the core of ourresearch question. In other words, we showed that one can utilize a categorized task space asopposed to a continuous task space to effectively reduce the complexity of the motor task. Thisanalysis strengthens our basic assumption that the speech motor control might also have learnedto take advantage of the categorical perception and the quantal effects in a similar way.44CHAPTER 4Perceptually-aware Gesture-to-Formant MappingOne of the most important components of the thesis is the development of a perceptual mappingfrom speech-like multi-degree of freedom movement of the hand to a controllable speech formantspace, so that the user can leverage the mapping to reduce the difficulty level of the control task,given some constraints. This chapter deals with the aforementioned hand-to-formant mappingwith perceptual categorization. Our first task is to project the user’s time-varying hand trajec-tories towards the target formant trajectories. To this end, we use a coupled Graph CNN andLSTM mapping that performs a non-linear regression-based mapping from the hand kinematicsto the formant frequency space. We also know, from the quantal nature of the speech, that theformant-based perception space is highly non-uniform or non-linear and it has unstable bound-aries partitioning stable plateau-like regions surrounding the cardinal vowels. This implies that ourformant space needs to be partitioned into a number of stable areas each representing a cardinalvowel. This perceptual categorization can be implemented with a multi-class classification modeland needs to be coupled to the mapping network. With this coupled network, we will be able toachieve the spatial classification of the formant space based on the temporal segmentation of thetime-varying hand trajectories projected onto the formant frequency space.In this chapter, we begin by formulating the problem at hand in Section 4.1 that briefly discussesthe task space and the experimental considerations. We then justify the choice of models for tacklingthe challenges related to the hand-to-formant mapping in Section 4.2. Further, we provide anintroduction to deep learning and a detailed description of the mechanism of chosen neural networkmodels in Section 4.3. After that, in Section 4.3, we illustrate the application of these models in thecontext of the current problem along with the introduction of a novel loss function regularizationtrick. Finally, we explain the expected contribution of each component of the mapping networkwith intuitive examples and conclude the chapter with a brief summary.4.1 Problem FormulationOur task space is the acoustic vowel space consisting of the first two formant frequencies of ‘N’canonical vowels {(f1i, f2i) | ∀i ∈ {1, 2, ..., N}} as landmarks. For a given target vowel sequenceS {∪i si | ∀i ∈ {1, 2, ...,K},K ≤ N}, the user tries to follow a given trajectory in the formant45space while moving as fast as possible, by means of changing the hand position/gesture coordinates(xt, yt), t ∈ {1, 2, ..., T} where T is the total time required in completing the trajectory in theformant space. The trajectories are formed by joining the cardinal vowel formant frequencies F:{(f1i, f2i) | ∀i ∈ {1, 2, ...,K}} consecutively. The user’s intention is to complete the task whileinvesting minimal effort. For this, the user looks for the available facilities and constraints in thetask space in order to leverage those towards reducing his efforts.In this work, motivated by the quantal theory in speech production and perception, we considerusing a categorical perception constraint in the formant space. In other words, we partition the2D formant space into areas of stable perceptual categories partitioned by unstable boundarylines. The implementation of this, first of all, necessitates the development of a many-to-oneperceptual mapping ρ : F −→ S, that connects the temporal formant trajectories to the perceptualvowel category sequences. As a result of this arrangement, the user, by repeated trials, will learnto perform the task with minimal effort required to achieve the target sequence, by essentiallyreducing the movement time of the given task leveraging the perceptual feedback. In other words,the user will keep minimizing the time required to complete a given vowel sequence task by varyinghis hand trajectories as long as the output perceptual vowel sequence matches the target sequence.The difficulty level of the task space can be manipulated from the experiment designer’s sideby changing the vowel sequence lengths and combinations. For an instance, increasing the lengthof vowel sequence (/aiu/ to /aiuai/) will result in increasing difficulty level of the task, as a re-sult of which, the user will start figuring out a suitable way to reduce the movement time whilestill achieving the target output. To assist the user in achieving better perceptual accuracy whilesubstantially reducing the time, we develop an intermediate non-linear gesture-to-formant mappingbetween the user’s input and the formant trajectories. The central objective of this study is to val-idate the plausibility of the proposed paradigm, by analyzing the index of difficulty and movementtime, as an information theoretic explanation of the underlying mechanism behind the articulatorymovements in fast and spontaneous natural speech. More details regarding the problem space willbe illustrated in Chapter 5.4.2 Choice of modelsWe observe that the current problem of analyzing and mapping the user’s hand trajectory sharessome underlying similarities with the pedestrian (or vehicular) trajectory analysis and predictionproblem. Both the problems require an egocentric spatio-temporal feature extraction but with im-portance to the neighbouring objects. In our case, the objects are the vowel centers, whereas, in thecase of vehicular trajectories, the objects are represented by the obstacles, pedestrians and vehicles.Therefore, the network is not only required to process the user’s trajectory (or pedestrian/vehicular46trajectory) but also consider the influence of the proximity of neighbouring objects on the trajec-tory coordinates in order to plan the automated mapping or prediction. Both the sets of problemsorganize a similar spatial graph structure on a 2D plane at any given time with the user’s currentlocation as well as the neighbouring object locations. Hence we conclude that the best choice ofnetworks in this setting are the graph neural networks which are widely used in the context ofvehicular trajectory prediction and have demonstrated their effectiveness in extracting informationfrom spatial nodes and the edges connecting them [98, 120, 165, 187, 197, 203]. In the case ofvehicular or pedestrian trajectory prediction and analysis, the obstacles are typically static (Eg:trees, buildings, pavements, etc.) as well as dynamic (Eg: pedestrians, other vehicles within spanetc.) whereas in our case, the obstacles are only static as the vowel centers have fixed locations inthe vowel quadrilateral. In the current problem space, we have only one super-node (i.e., the user’scoordinates) that is connected to all the cardinal vowel coordinates. Additionally, we need temporalfeature extractors to model the time-varying interactions resulting from the change of user’s handcoordinates. Recurrent Neural Networks like Long Short Time Memory (LSTM), Gated RecurrentUnit (GRU) etc. are known for their powerful temporal feature extraction capabilities and haveachieved success in several sequential processing applications. Therefore, taking motivation fromthe approach commonly followed in vehicular trajectory prediction problems, we couple the spatialgraph CNNs with LSTM networks thereby enabling the model to learn a hidden representation ofthe temporal dimension of the supernode. This spatio-temporal Graph CNN-LSTM network willbe used to achieve the desired non-linear mapping.In order to model the categorical perception block, we need to classify the input sequencesto perceptual sequence categories like /a/-/i/-/u/ or /a/-/i/-/u/-/o/. This necessitates the em-ployment of a temporal sequence classifier that should be able to process the temporal data andaccurately classify it into one of the possible output categories. The desired mapping should bemany-to-one, alignment free as well as should be able to ignore duplicate vowels (Eg: it should beable to categorize ”aaaaaiiiuuu” as /a/-/i/-/u/) and outliers (Eg: it should be able to recognize”aaaaaiiiiiuoooo” as /a/-/i/-/o/ and not /a/-/i/-/u/-/o/). Therefore, we choose LSTMs coupledwith Connectionist Temporal Classifier (CTC) that have demonstrated the desired behaviour inapplications like handwriting recognition, phone continuous typing, speech recognition etc. Havingchosen the desired model components, we now turn to describe the mechanisms of each type ofnetworks.4.3 Related WorksIn this section, we briefly introduce the readers to deep learning and then describe the mechanismof Graph Convolutional Neural Networks, Long Short Term Memory Networks and Connectionist47Temporal Classification models which are the building blocks of our proposed network.4.3.1 Introduction to Deep LearningDeep Learning [60] is a subfield of machine learning that includes algorithms based on artificialneural networks that are structurally and functionally inspired by biological neural networks. It letsa computer model learn to perform classification and regression tasks from images, videos, signals,texts, etc., in a way somewhat close to humans. Recently, it has received a lot of attention for avariety of tasks including computer vision applications, medical image analysis, natural languageunderstanding and processing, audio recognition and synthesis, etc. Deep learning can be super-vised, unsupervised, or semi-supervised. In this work, we will be focusing only on the supervisedlearning paradigm.The adjective ’deep’ is derived from the use of multiple layers in the artificial neural network.Deep Learning models are known for their abilities in modeling complex non-linear relationships. Adeep neural network is composed of multiple layers between input and output representation thatinvolves the components: neurons, synapses, weights, biases, and non-linearities. The basic unitof a neural network is a neuron. A series of neurons or nodes create a layer. Each node in a layeris generally connected to all nodes in the previous and next layers through ’weights’. Neurons areresponsible for receiving inputs from other nodes or external sources, weighting the inputs with theconnection weights, and summing all those weighted inputs. Weights are assigned to (or learnedby) the neurons based on the relative importance of mutual connections between various neurons.The layers hidden between the input and output layers are called hidden layers, where the maincomputation of the neural network takes place.The deep learning architecture is a hierarchical architecture that learns features from the datadirectly without the need for manual feature extraction, but via a layerwise representation of thefeatures, i.e., the deeper layers act on the lower level features extracted by the previous layers.For example, for object detection applications, the first hidden layer could learn how to detectedges, the next hidden layer could learn to use a combination of the edges to determine simpleshapes, and eventually, the last layer could learn to detect more complex geometries related to thetargeted object shape. Therefore, in deep neural networks, the data first flows from the input layerto the output layer without looping back and this phenomenon is known as the forward pass. Thisgenerally starts with creating layers of artificial sets of neurons and assigning normalized randomnumerical values called weights connecting those neurons from one layer to another. The weightsand inputs are then multiplied to return an output between 0 to 1. If the network is unable torecognize the intended pattern at the output, the error is propagated back to adjust the weightsand biases in order to steer the learning process towards better performance accuracy. One of the48important factors behind the success of deep neural networks is the nonlinear activation functionsattached to each neuron that decide whether the neuron should be activated (i.e., fired) or not.They act as mathematical gates between the previous layer neurons feeding the current neuronand the output is passed on to the next layer neurons. The non-linear activations like Sigmoid,ReLU, tanh functions allow the neural network to develop complex mappings between the input andoutput space of the neural networks. For the classification tasks, the last layer generally consistsof softmax activation.In deep neural networks, the learning process is performed by dividing the data into threedifferent sets viz., the training, validation, and test dataset. The training dataset primarily allowsthe neural network to compute the connections or weights between the nodes. The validationdataset allows fine-tuning of the performance and the testing dataset is used to evaluate the ultimateaccuracy and error margin of the neural network. The learning process is an optimization process,specifically, the minimization of a loss function that measures its performance. The loss function isoften composed of an error term and an additional regularization term that prevents overfitting bycontrolling the complexity of the architecture. For regression tasks, generally mean square errors(MSE) are used whereas for the classification tasks, cross-entropy is the typical choice for lossfunction. The training algorithms first compute a training direction followed by minimization ofthe selected loss function in the training direction. The ultimate aim of the learning task is to findthe set of optimal parametric values (weights and biases) that correspond to the minimal valueof the loss function. This is done using multidimensional optimization algorithms like gradientdescent that decreases the loss function value in the direction of the downhill gradient by varyingthe weights accordingly.4.3.2 Graph Convolutional Neural NetworkConvolutional Neural Networks are known for their spatial feature extraction capabilities. However.a standard convolution operation useful for image-like regular grids that is applicable to imagedata is not suitable for the general graph structures as seen in Fig. 4.1. Graph CNNs [31, 41,87, 97, 185, 196] were introduced to extend the concept of CNNs into graph structures and havereceived increasing attention in the past years. Convolution over graphs is defined as the weightedaggregation of the attributes of a certain target node with the attributes of the neighbouring nodes.There are two different avenues of research investigating the generalization of graph convolutionalneural networks to structured data formats. One approach is the rearrangement of the vertices intocertain grid forms that can be processed by normal convolution operation. The alternative approachincludes the introduction of a spectral framework and manipulation using graph convolution inspectral domain and is termed spectral graph convolution. We follow the latter approach in this49Figure 4.1: 2D Convolution versus Graph Convolutionwork, where the convolution operation is acted on the adjacency matrix of the graphs, similar tothe images. The spectral graph convolutions were motivated by wave propagations. Informationpropagation in spectral graph convolution can be considered to be similar to the signal propagationalong the nodes. Spectral convolutions utilize the Eigen decomposition of graph Laplacian matrixin order to implement the information propagation.Let’s construct an undirected spatio-temporal graph G =(V, E) with N points and T frames.The set of nodes V = {vti| t = 1, 2, ..., T ; i = 1, 2, .., N}. Let t be a time instant such that t ∈ T .Then Gt is defined as a set of spatial graphs representing the relative locations of different nodes ona plane at each time instant t. Gt = (Vt, Et) where the set of vertices Vt = {vit| ∀i ∈ {1, 2, .., N}}and the set of edges Et = {eijt | ∀i, j ∈ {1, 2, .., N}}. The attribute of the vertices are generallydesigned based on the problem definition and the solution requirements. If two vertices vit and vjtare connected, the edge eijt = 1 else 0. The strength of influence between two nodes is modeledwith an adjacency weight, aijt . This weight is computed based on a chosen kernel function that isdefined using the prior knowledge about the problem space. A straightforward kernel function isoften the L2 distance between two vertices i.e., ||vit − vjt ||2. Alternative kernels include GaussianKernels, inverted L2 norm, etc. The weights are organized into a weighted adjacency matrix At.Standard convolution on 2D grid maps is defined as:zl+1 = σ(k∑h=1k∑w=1(p(z(l), h, w)).w(l)(h,w)) (4.1)where k is the kernel size, p(.) is the sampling function accumulating the neighbourhood informationcentered around z, l is the network layer and σ is the activation function.50Figure 4.2: Overview of two layered graph convolutional network with first order filtersGraph convolution in arbitrarily structured grids is defined as:vl+1 = σ( 1Ω∑vj(l)∈B(vi(l))(p(vi(l), vi(l))).w(vi(l), vi(l))) (4.2)where 1Ω is a normalization term, i.e., Ω is the cardinality of the neighbour set and B(vi(l)) isthe neighbouring set of vertices vi i.e., B(vi(l)) = {vj | d(vi, vj) ≥ D} with d(vi, vj) denoting theshortest path between two vertices vi and vj .We know, in general, every neural network can be mathematically represented as a non-linearfunction f such that:H(l+1) = f(W (l)H(l) + b(l)) (4.3)where W (l) and b denotes a weight matrix and bias for lth layer of a neural network.In a graph structure, the neural network can be represented as a non-linear function f such that:H(l+1) = f(H(l), A) (4.4)where A is an adjacency matrix. H(0) = X (where, X is input)and H(L) = Y (where Y is output),L is the number of layers. The different models differ based on how f is selected and parameterized.Adjacency matrix represents the connection between the nodes.51Now. let’s consider a simple layer-wise propagation rule as follows:f(H(l), A) = σ(AH(l)W (l)) (4.5)where W (l) denotes a weight matrix for lth layer of a neural network and σ is a non-linear activationfunction (eg: ReLU as shown in Fig 4.2).In the above formulation, multiplication with A implies the summation of all the feature vectorsof all neighbouring nodes except the node itself. We can add the self-loop simply by adding anidentity matrix to A, i.e., considering Aˆ = A + I. The major limitation of the model is that A istypically unnormalized. Therefore product of A with the other elements will change the scale ofthe feature vectors entirely. This can be solved by normalizing A such that the rows add up to1, which is done by performing D−1A where D is the diagonal node degree matrix. In practice,symmetric normalization is used by performing D− 12AD 12 which enhances the network performance.Therefore, the new propagation rule can be written as:f(H(l), A) = σ(Dˆ−12 (A+ I)Dˆ12H(l)W (l)) (4.6)where Dˆ is the diagonal node degree matrix of Aˆ.Interested readers are directed to [31, 41, 87, 97, 185, 196] for more details on related works.4.3.3 Long Short Term Memory NetworkLong Short Term Memory (LSTM) networks are special type of recurrent neural networks (RNNs)that are capable of capturing long range dependencies in sequence prediction problems [60]. Basicrecurrent neural networks are networks of neurons organized into successive layers and joined withdirected connections. The neuron-like nodes receiving data from external source are called inputnodes and the nodes yielding result output are called output nodes. All other nodes are calledhidden nodes lying in layers between input and output layers. Each node in a given layer hastime-varied real-valued activation function and is connected to all nodes in the next layer throughmodifiable real-valued weights. RNNs have the ability to utilize the data received at the previousinstants in order to process the data at the current instant. Simply put, the Recurrent NeuralNetworks (RNNs) are neural networks with loops that allow the information to persist as shown inFig 4.3 and Fig 4.4. In the left side of the Fig 4.3, a chunk of neural network ‘A’ receives an input′x′t at a time instant ’t’ and outputs a value ′h′t. The loop ’A’ contains the information that is passedfrom one time instant to the other. These loops are nothing but multiple copies of the same neuralnetwork which is revealed as we unroll the network as shown in Fig 4.3. This chain-like structureof RNNs enables them to capture the temporal dependency of the sequences. However they suffer52Figure 4.3: Unrolling Recurrent Neural Networks.from the problem of being unable to learn long range dependency of the temporal sequences.LSTM networks are explicitly designed to handle the problem of long range dependencies. Thestandard RNNs have a very simple repeating module with tanh activation as shown in Fig 4.4.LSTMs, on the other hand, have a more complex repeating structure with 5 non-linear activationfunctions. The main speciality of LSTM is the addition of a new state other than hidden state,called cell state. It only involves minor interactions and runs through the entire chain of LSTMsthereby creating a passage for the information to flow almost unchanged. Information can beadded to or removed from the cell state via structured gates composed of sigmoid activations and apointwise linear multiplication operation as shown in Fig 4.4. There are three sigmoid layers, eachof which outputs values between 0 and 1, quantifying the extent to which new information shouldbe passed through it. A sigmoid activation of value 0 implies no information allowed to pass, whileactivation value of 1 implies all information allowed to pass.The first step in each LSTM cell is to decide which information to remove from the cell state.This is done in the ’forget gate’. The gate is fed the previous hidden state and the current input,based on which it outputs a number between 0 to 1 for each cell state value. Let xt be the inputat instant t, ht−1 be the hidden state at a time instant t− 1, Wf be the weight of the forget gateand bf be the bias. The output of the sigmoid gate is then represented as:ft = σ(Wf .[ht−1, xt] + bf ) (4.7)The second step is to decide which information to store in the cell-state. This is composed oftwo substeps. The first part includes a sigmoid ’input gate’ that decides which values to updateand a tanh gate that creates new candidate values for adding to the cell state. The next step53Figure 4.4: The repeating module in RNNs and LSTMscombines the outputs of these two gates via pointwise multiplication to update the cell state withthese values. Let the weights and biases of the sigmoid gate layer are Wi and bi and that of tanhlayer are Wc and bc. Then the output from the sigmoid gate can be represented as:it = σ(Wi.[ht−1, xt] + bi) (4.8)The output from the tanh gate can be represented as:C˜t = tanh(WC [ht−1, xt] + bc) (4.9)The next step is to update the old cell state Ct−1 by the new cell state Ct.Ct = Ct−1 ∗ ft + it ∗ C˜t (4.10)Now the output will be calculated as a filtered version of the cell state after passing it throughanother sigmoid layer with weights Wo and bias bo. Then,ot = σ(Wo[ht−1, xt] + bo) (4.11)54Figure 4.5: Encoder-Decoder or seq2seq LSTMsThe cell state is put through tanh activation and the new hidden state is obtained by multiplyingthe output of sigmoid gate with the output of tanh gate, i.e.,ht = ot ∗ tanh(Ct) (4.12)This is the internal mechanism of one LSTM cell. Multiple such cells are connected recurrently(known as memory blocks).An application of LSTMs that finds usefulness in sequential modeling tasks and have power tem-poral modeling capabilities are called sequence-to-sequence (seq2seq) LSTMs or encoder-decoderLSTMs [22, 111, 132, 166]. It involves an LSTM based encoder structure that is used to read theinput sequence (one step at a time) in order to obtain a rich fixed dimensional vector representation,followed by an LSTM decoder configuration that is conditioned on the input sequence. The vectorrepresentation acts as the hidden state for the LSTM decoder that generates the target sequence.The encoder-decoder representation can be used to transform a variable length input representationto an encoded fixed length representation and then get back a variable length target representationfrom the fixed hidden state, as shown in Fig 4.5. The hidden states of encoder can be written as:ht = φ(W hhht−1 +W hxxt) (4.13)The hidden state of the decoder can be represented as:ht = φ(W hhht−1) (4.14)554.3.4 Connectionist Temporal ClassificationConnectionist Temporal Classification (CTC) [28, 63, 64] is the name given to a particular type ofneural network output and the scoring function utilized in performing the classification task. It isused jointly in conjunction with other neural network architectures to guide the training process.It is mostly coupled with recurrent neural networks like LSTMs in order to handle sequentialapplications with variable time. It is to be noted that CTC refers to the scoring strategy andthe output rather than a network and has to be used with a backbone network. The CTC worksindependently of the underlying neural network architecture. As opposed to the neural networkclassification, CTC does not learn decision boundaries and timing information. It is primarily usedin scenarios where the alignment between the output sequence yielded by the neural network andthe target labels is not known, such as in cases where the number of observations is more than thatof the labels, for example, where there are multiple time instants in speech signal representing thesame phoneme. CTC considers all the duplicate phonemes to be a single element and thereforeall the duplicate output sequences are considered to be equivalent to the label sequences. Thisis the characteristic of CTC that makes it particularly suitable for our perceptual categorizationapplication.Let’s consider an input sequence X = [x1, x2, x3, .., xT ] and a corresponding output sequenceY = [y1, y2, y3, .., yN ] such that N < T and our task is to find an accurate mapping between X’sand Y’s. In the given problem, both X and Y can vary in length, the ratio of lengths of X and Ycan vary and the exact alignment between corresponding elements of X and Y are unknown. TheCTC algorithm provide us with an output distribution overall possible sets of Y’s for any givenX. By accumulating the probabilities of the distribution, we can then assess the probability of aparticular output. The output sequence with the highest probability is considered to be the finalnetwork output.For a given input, our task is to train the model to maximize the probability of the correctanswer. This necessitates the efficient computation of a differentiable conditional probability func-tion p(Y |X) so that we can run gradient descent optimization. The CTC algorithm provides anapproximate solution for finding the desired output Y ∗ = argmaxY p(Y |X).The key feature of CTC algorithm is that it is alignment-free and so it does not need any givenalignment between the input and output. It sums over the probabilities of all possible alignmentsbetween the input and output to find the probability of an output candidate Y given an input X. Theobjective function for an input-output pair (X,Y ) is defined as: p(Y |X) = ΣA∈AX,Y ΠTt=1pt(at|X).In the above expression, Π implies the computation of probability for a single alignment forevery time step and Σ implies the marginalization overall sets of valid alignments. The modelparameters are tuned by minimizing the negative log-likelihood Σ(X,Y )∈D − log p(Y |X). Af-56Figure 4.6: An example illustrating the working of CTC coupled with LSTM57ter training the model, in the inference stage, we want to find the most likely output for agiven input, Y ∗ = argmaxY p(Y |X). This can be now translated to the following problem:A∗ = argmaxAΠTt=1pt(at|X). Therefore, finding the most likely input at each time step gives thealignment with the highest probability. This has been further explained in Fig. 4.6 with an exam-ple. We start with an input sequence of length 100 that is fed into an LSTM network. The networkyields a distribution over the outputs C1, C2, C3, C4, C5 for each input step. Next, with the outputdistribution, we compute the probability of different sequences. Finally, we get a distribution overoutputs by marginalizing over alignments and choose the output sequence with highest probabilityas shown in the last row of the figure.4.4 Proposed modelHaving described the building blocks of our model, we now turn to illustrate the proposed model.In our investigation, we perform the following two individual tasks:• Regression: We formulate the task of projecting the user’s trajectories to the target formanttrajectory as a non-linear regression problem.• Classification: We formulate the task of categorizing the mapped formant trajectories intotemporal segments as a multi-class classification problem.These two machine learning models are trained separately. We first train the classifier networkon synthesized formant trajectories (with additive noise) and call it the ’Perception Block’. Once thecategorical perception model is trained and the weights are frozen, we separately train the gesture-to-acoustic mapping block with mean-square error loss function augmented by a regularizationcomponent based on the perceptual block.4.4.1 Perception BlockConsider the auditory perception problem of determining an accurate mapping from an input for-mant sequence F = [f1, f2, ...fT ] to perceptual output sequence S = [s1, s2, ...sT1 ]. It is noteworthyhere that the exact correspondence or alignment between ft and st is not known. We model thiswith the help of LSTM networks cascaded with connectionist temporal classifier (CTC) as shown inFig 4.7. The working mechanism of the LSTM and CTC have been illustrated in Section 4.3. Thefirst and second formants are passed into the LSTM modules through a fully connected layer. Therelevant temporal features extracted by LSTM network from the formant trajectories are then fedto CTC loss function to guide the training process of the perceptual module. As already discussed,CTC algorithm is alignment-free and allows many-to-one mapping with monotonic alignments. In58Figure 4.7: Overview of the perception blockthe context of our problem, it works by cumulating the probability of all possible alignments be-tween input formants and output vowel categories CF,S . For a given input, the CTC outputs adistribution over all possible perceptual output classes which is then used to infer the most likelyoutput vowel sequence. For a single (F, S) pair, the CTC conditional probability is therefore rep-resented as p(S | F ) = ∑C∈CF,S ∏Tt=1 pt(pit | F ), where the sequence posterior probability pt(pit | F )is determined by the LSTM network.4.4.2 Mapping BlockIn this section we illustrate the cross-domain mapping of kinematic trajectories originating frommulti-DOF hand movements to the acoustic formant space. The continuously varying distancesof user’s instantaneous hand position from the fixed cardinal vowel locations organize as a spatio-temporal graph structure in the 2D formant frequency space as seen in Fig 4.8 and Fig 4.9. There-fore, we formulate the connection between the current formant position and that of the cardinal59Figure 4.8: Overview of the mapping blockvowels as a graph representation. We first construct a set of spatial graphs Gt which representthe relative locations of user’s formant trajectories at every time instant t, using the set of ver-tices Vt and set of connecting edges Et viz. Gt = (Vt, Et) where Vt = {νit | ∀i ∈ {2, ..., N}} andEt = {εi,jt | ∀j ∈ {1, ..., N}}. The user’s coordinates ν1t are connected to the cardinal vowel coordi-nates νi through edge ε1,jt i.e., ε1,jt = 1. Further, to represent the strength between the super-nodeν1t i.e., the user’s hand coordinates and other nodes νi, i.e., vowel locations, we attach distance-based weights α1,jt to the edges ε1,jt and construct a weighted adjacency matrix At related to theconnection of the cardinal vowel locations and user’s trajectory. The choice of edge kernel functionα1,jt is crucial because it imparts prior knowledge to the network about the relationship betweenthe user’s formant and the cardinal vowel formant frequencies. The kernel function maps the at-tributes at ν1t and νi to the weights α1,jt and thereby determines the impact of the cardinal vowelnodes on the convolution operation. Motivated by the use of RBFNs in the previous work by Felsand Hinton [45, 46, 48–50], we model the non-zero edge-weights employing Gaussian RBF kernelsα1,jt = exp(− ||ν1t−νjt ||2σ2 ). This goes with our intuition that the auditory response of articulatorytrajectories in the peripheral auditory system tend to be dominated by the closer vowels in formantspace. The spatial graph convolution can then be defined as:νi(n+1)t = σ 1M∑νj(n)t ∈ϕ(νi(n)t )p(νi(n)t , νj(n)t )w(νi(n)t , νj(n)t ) (4.15)where M = 9 is the cardinality. ϕ(νit) and σ denotes activation function. To help the learning60Figure 4.9: Overview of the Proposed Modelprocess, we normalize the adjacency matrix At as:At = Λ− 12t (At + I)Λ− 12t (4.16)following [87], where Λt denotes the diagonal node degree matrix of At and I is identity matrix.The spatial graph convolution is then expanded in the temporal direction by constructing a newgraph whose attributes are the set of attributes of Gt, such that G = (V, E) where V = {νi | ∀i ∈{2, ..., N}}, E = {εi,j | ∀j ∈ {1, ..., N}} and the adjacency matrix A = {A1, A2, ..AT }. The outputembedding of the graph can therefore be represented as:φ(V, A) = σ(Λ− 12 (A+ I)Λ− 12VW) (4.17)The temporal dimension of the super-node in the spatio-temporal graph encoding is then passedon to an LSTM-based encoder-decoder architecture [22, 166] where the encoder learns a hiddenrepresentation of the two-dimensional output sequence of spatio-temporal graph CNN and thedecoder utilizes this encoded representation to map it to the formant trajectories. During inference,the predicted formant ft at every time step t is used to obtain the decoder hidden state ht+1 andpredict the formant value ft+1 for the next step t+ 1.4.4.3 Loss Function RegularizationWe include two loss components for training our gesture-to-acoustic mapping. The MSE loss thatmeasures the average displacement between the ground truth and estimated trajectory is used asthe primary loss function and indicates how far the generated samples are from the ground truth.Minimization of this loss term tends to train the network with the sole motive of transformingthe kinematic trajectories to a predefined target formant trajectory. However, we note that the61mapping is not robust and does not converge when the input trajectory is infused with additivenoise components having higher standard deviation. This poor regression performance of ourmodel propagates further and in turn drastically hampers the perceptual output which is cascadedto the regression output. This is in contrast to the expected performance of our coupled mapping-perception network. We address the issue by adding a regularization term to our loss functionbased on the CTC loss obtained from the trained perception network cascaded to the regressionoutput. This is because a significant objective of learning the mapping is to achieve correct vowelsequence categorization and adding this regularization constrains the network in such a way that theoutput vowel trajectories are as close to the ground-truth trajectory as possible while simultaneouslyensuring that they maximize the likelihood of falling into the right perceptual category. Thisconstraint can be thought of as an internal estimate of the sensory (perceptual) information madein the speech motor control pathway that effectively assists the network in figuring out whichcardinal vowels to selectively focus on, for generating the particular acoustic trajectory. Jointlywith the MSE loss, this regularizer thus helps the network to identify the outlier points in noisydata and leads to improved regression and classification performance. The regularized loss functionis given by:L = λ∑Tt=1 ||fˆt − ft||2T− (1− λ) log ∑C∈C[f1,f2,...fT ],[s1,s2,...sT ]T∏t=1pt(pit | f1, f2, ...fT ) (4.18)where λ is a tunable scalar coefficient such that λ : 0 ≤ λ ≤ 1. Minimizing L, therefore, impliesminimizing the MSE component as well as minimizing the negative log probability component(CTC Loss) which is equivalent to maximizing the log probability component. Here, [f1, f2, ...fT ]denotes the formant frequency sequence of length T and [s1, s2, ...sT1 ] denotes the vowel sequenceof length T1.4.4.4 Contribution of network componentsIn the previous sections, we have described the working mechanism of the neural network blocksas well as the rationalization behind the choice of the networks. However, it might be difficultto intuitively comprehend the contribution of the networks in the context of our problem space.Therefore, we next turn to provide further illustrations about the functionalities of different com-ponents of the proposed model with some examples that will help the readers get a better intuitiveunderstanding.624.4.4.1 Perception BlockLet us consider a formant trajectory (as shown with green color) in Fig. 4.10. The trajectory has aformant sequence of length N = 38 (sampled at a rate of 1 in every 10 points for ease of explanation).The LSTM-CTC model takes the formant sequence as input and outputs the softmax probabilitiesof each formant value corresponding to each of the 9 output vowel classes. These classes representthe nine cardinal vowels. The probabilities corresponding to the classes are color-coded in the Fig4.10. For example, let’s say the probabilities of F1 (first point in data sequence) being one of /2/,/æ/, /E/, /e/, /i/, /u/, /o/, /O/, and /a/ are .55, .2, .1, 0, 0, 0, 0, .05, and .1 respectively. Asimilar trend continues till F3. In F4, the highest probability is noticed for /æ/ instead of that for/2/, indicating a transition and now, this trend continues. In this way, all the outputs along withtheir corresponding probabilities define a probability distribution. The conditional probabilitiescorresponding to each element of a sequence are multiplied to derive the probability of an entiresequence. Therefore, now we have a probability corresponding to each possible output sequence.Some of the sequences have been included in the figure. Finally, the most probable output isselected from the distribution over all output sequences to find the perceptual output, which comesout to be /2/-/æ/-/i/-/u/-/a/.However, this is the case assuming that the movement velocity is more or less uniform through-out the trajectory and the major change is only in terms of curvature. However if the velocity hasconsiderable change in acceleration around some points, the LSTM-based perception network willpick that up. For example, let’s consider the trajectories shown in Fig. 4.11 (a) and Fig. 4.11 (b).Both the trajectories have the same coordinates as that of the previous example in Fig 4.10. Theonly difference is in terms of the movement velocity. The velocities are shown with the help of thecolor coding of the text representing the formant sequences (i.e., Fi). Both the trajectories showmoderately high velocity on average. However, we find considerably low velocity around /E/ in (a)and /O/ in (b). This serves as an indication to the network that the user intends to include thatvowel within the output sequence even without changing the trajectory curvature. Alternatively,let’s assume user chooses to shift the trajectory towards /E/ or /o/ as shown in Fig. 4.12. In thesecases also, our perceptual model will yield an output of /2/-/æ/-/E/-/i/-/u/-/a/ and /2/-/æ/-/i/-/u/-/o/-/a/ respectively. These examples show how the curvature and velocity of the trajectoriescan be manipulated by the user to vary the network outputs.4.4.4.2 Mapping Block with the regularizerThe main purpose of using the mapping block is to transform the user hand trajectories to thedesired formant trajectories, which are then utilizable by the perceptual block for the categorization.The target formant trajectories are the lines joining the cardinal vowels as shown with dashed cyan63Figure 4.10: An example illustrating the function of the perceptual block64Figure 4.11: Examples illustrating how perceptual block responds to speed changes. (a) is detectedas /2/-/æ/-/E/-/i/-/u/-/a/ (b) is detected as /2/-/æ/-/i/-/u/-/o/-/a/. Both the trajectories havesimilar coordinates as the previous example shown in Fig. 4.10, but vary only in terms of velocityprofilesFigure 4.12: Examples illustrating how perceptual block responds to curvature changes. (a) is de-tected as /2/-/æ/-/E/-/i/-/u/-/a/ (b) is detected as /2/-/æ/-/i/-/u/-/o/-/a/. Both the trajectorieshave little deviations from the example shown in Fig 4.10line in Fig. 4.13. For this, the network needs to be aware of the relative location of the usercoordinates with respect to the cardinal vowels. This is achieved with the help of the graphconvolution network. The adjacency matrix has the information about the distances between theuser coordinates and all the cardinal vowels for all instances and is used to extract useful features.The nearer the user’s coordinates from a particular cardinal vowel location, the stronger is theweight or the connection between them. A closer look into the network weights will reveal thatthe stronger connections (nearer vowels) are given more importance than the weaker connections(farther vowels) at any instance. After the extraction of the spatial information, we need to utilizethe temporal information from the dynamic change of connection weights with the user hand65Figure 4.13: Examples illustrating the function of the mapping block. (a), (c), and (e) show theinitial user trajectories whereas (b), (d), and (f) show the final formant trajectories after passingthrough the mapping blockmovements. This is performed by the temporal part of the graph CNNs. Coupling it with LSTMencoder-decoder networks boosts its temporal feature extraction capabilities. After the networkis trained, the LSTM decoder yields the regressed formant trajectories. The mapping networknot only projects the user trajectories to formant trajectories, but also increases the robustnessof the model against noises and helps to smoothen out the occasional roughness (if any) in theuser trajectories. In the absence of this mapping model, the perceptual model can consider thenoisy areas and rapid curvature changes as sites of relevant information and yield wrong perceptualoutput.During the training process, the mapping block is assisted by the perceptual loss functionregularizer. For example, in cases where the user’s trajectory is halfway between two cardinal66vowels and the mapping network cannot figure out which one to project the trajectory towards(based on the Mean Square error loss), the regularizer gives an additional indication through theextra loss incurred when the trajectory is projected towards the wrong direction during trainingprocess.The input-output relationship corresponding to the mapping block is shown in Fig. 4.13 withthe help of three sets of examples. Each of the trajectories (a), (c), and (e) have same vowel sequencelength 4. The first two and the last vowels are kept fixed. Only the third vowel is changed to explainthe influence of the single intermediate vowel on user’s trajectory and the mapped trajectory. Thecases (a) and (b) represent the input and mapped trajectories corresponding to the vowel sequence/i/-/u/-/æ/-/a/. The cases (c) and (d) represent the input and mapped trajectories correspondingto the vowel sequence /i/-/u/-/e/-/a/. The cases (e) and (f) represent the input and mappedtrajectories corresponding to the vowel sequence /i/-/u/-/O/-/a/.4.5 Chapter SummaryIn this chapter, we proposed deep neural network-based approaches to map the kinematic move-ments to target acoustic trajectories in the formant frequency space and to categorize the for-mant trajectories into perceptual classes respectively. The combined perceptually-aware gesture-to-formant mapping system, therefore, connects continuous hand movements and gestures to formanttrajectories and then categorizes the trajectories based on perceived vowel sequence of variablelength. We began by formulating the current problem mathematically and then discussed the ra-tionale behind choosing the particular type of networks. We then explained the mechanisms of therelevant deep learning models with examples and presented how our proposed network architec-ture makes use of these model components. We also introduced a regularizer in the loss functionto assist the training process of the mapping network that utilizes the trained perceptual modeloutput. This implies that the perceptual model has to be trained and the network parameters haveto be frozen before starting to train the mapping network. Finally, we explained the contributionsof each of the model components in solving our problem. In the next chapter, we will describe theexperiments, the training procedure, and the performance of the proposed architecture in detail.67CHAPTER 5Experiments and ResultsThe main goal of our perceptual mapping is to reduce the complexity level of the task and toimprove user performance by increasing the throughput. In order to achieve this, we need tofirst train the proposed neural networks and freeze the network weights. In this chapter, we shalltrain our proposed mapping and categorization models and then run user studies to evaluate theeffect of the perceptual constraints in the hand-to-formant trajectory task. Section 5.1 describesthe generation of synthetic data, the collection of real-time data for training, the implementationdetails as well as the performance evaluation of the networks along with the baseline methods. Next,in section 5.2, the descriptions of data acquisition devices and protocols for user study have beenprovided. Section 5.3 presents an analysis of the movement task complexity in the formant spacebefore and after categorization. Thereafter, in Section 5.4, the main results regarding the increasein user throughput leveraging the proposed mapping are provided. Section 5.5 then discusses thesignificance of the results and their limitations. Finally, Section 5.6 concludes and summarizes thekey contributions of the chapter.5.1 Training ProcedureTraining a neural network necessitates the availability of abundant data. In this work, we generatesynthetic data as well as collect real-world data with pilot studies and augment it with noises andperturbations, taking inspiration from the process of training Glove Talk II [50].5.1.1 Data generation and collectionWe generate synthetic trajectory data by fitting splines (with different control points and knots) tothe lines joining different combinations of the cardinal vowels. The length of the vowel sequencesrange from K = 3 to K = 6. Total number of sample trajectories is therefore ∑K=6 NK=3 CK = 420,where N = 9. To increase the size of training datasets, we use a number of data augmentationtechniques. For this, we first incorporate additional control knots (2 to 6) and fit polynomials ofvarying degrees (2 to 6). We also add uniform noise of range [−0.2, 0.2] to F1 (i.e., first formant) andF2 (i.e., second formant) components of the trajectories in the normalized formant space at intervals68of 10 samples and interpolate it to resemble a realistic trajectory. Similarly, we also randomlyvary the coordinates of the control points within a distance of 0.2 from the actual trajectory andadditionally translate (both horizontally and vertically) the trajectories within a range of [−0.2, 0.2]in F1 and F2 axis to increase the number of plausible synthetic training samples. We also include theacceptable real-world data from pilot studies within the training dataset, add noise and apply thetransformations in a similar fashion. The trajectories are truncated before they cross the boundariesof the normalized rectangular space. Combination of all these transformations and noise injectionsaugment the data up to 105000 samples to be utilized in the next step. Fig 5.1 shows differentaugmented versions of data sample corresponding to the vowel sequence /æ/-/i/-/u/-/a/ that weutilize for training.5.1.2 Implementation detailsWe divide the data (105000 samples) into train (80%), development (10%), and test sets (10%)following the standard procedure of training neural networks. We develop our GUI using tkinterin Python and implement our model in Pytorch. We set a training batch size of 64 and train themodels for 200 epochs using Adam optimizer on NVIDIA GeForce GTX 1080 Ti GPU. We useReLU as the activation function across all the models. The initial learning rate is set to 0.005 andthen changed to 0.001 after 100 epochs. The momentum is set to 0.9. To avoid overfitting, weuse a drop-out ratio of 0.25 and Batch Normalization in every layer. The architectural parametersand hyperparameters are selected through an exhaustive grid-search based on the development set.The final perceptual network has 4 LSTM layers. Further, we set the number of nodes of the fullyconnected input layer as well as each of the hidden LSTM layers as 256. Our best mapping modelhas a single layer of spatio-temporal graph CNN coupled with LSTM encoder-decoder architecturehaving two layers of LSTM with 256 nodes each. When training the entire model, we first pre-trainthe perceptual network on synthesized data. After pre-training, we freeze the network and starttraining the mapping network. The value of the coefficient λ is initially kept .9 and decreased by.1 after every 50 epochs. This is because we want the network to first learn to reduce the overallerror between the predicted trajectory and the target trajectory using mean square error (MSE)and then fine-tune it further with an increased emphasis on the regularization component.5.1.3 Determination of the best perceptual networkWe first train the perceptual mapping network using labeled vowel sequence data and later useit to bootstrap the perceptual output generation from formant trajectories in the later stage ofthe experiment. Our hypothesis is that the neural networks with temporal feature extractionabilities will be particularly useful in temporally segmenting the vowel sequences. From the deep69Figure 5.1: Augmented versions of a synthetic data sample corresponding to the vowel sequence/æ/-/e/-/u/-/a/. Solid lines represent the trajectories while the dashed and dotted lines representtheir augmented versions resulted due to horizontal and vertical sliding respectively. (a) shows theincorporation of one control knot (colored red), (b) shows another augmented version like (a) butwith changed location of the control knot, (c) shows the incorporation of noises on (a), (d) showsthe transformation of the trajectories due to the shifting the start and end points, (e) shows theincorporation of three control knots, and (f) shows the effect of addition of noise as well as additionalshifting of the start and end points on (e)70Table 5.1: Perceptual categorization accuracy in % (‘K’ represents length of vowel sequence, ‘L’represents the number of layers and ’N’ represents the number of nodes per layer)Cases MLP (L=6, N=128) RNN(L=3, N=256) LSTM(L=4, N=256)w/o CTC w/ CTC w/o CTC w/ CTC w/o CTC w/ CTCk=3 52.44 78.49 62.44 88.42 68.87 96.22k=4 48.90 75.90 57.90 86.47 64.55 94.38k=5 42.57 74.01 52.66 85.30 59.80 93.80k=6 40.28 69.93 50.32 83.58 56.07 93.05All 30.75 62.00 42.72 79.26 52.30 90.13learning literature, we identify and explore three intuitive, simple network architectures useful in ourscenario. The first baseline uses a simple multilayer perceptron model (MLP), with and withoutCTC objective function. Similarly, the second baseline uses standard recurrent neural networks(RNNs), with and without CTC objective function. The last model uses LSTM networks, with andwithout utilizing the CTC objective function. We consider different variants of the architectureswith a varying number of layers and nodes per layer by running a grid search and only report the bestchoices here for simplicity. Also, we notice little to no change in performance while increasing thenumber of LSTM layers after 4 and therefore fix it to 4 to avoid higher computational expense. Theresults demonstrating the performance of the proposed model and baseline models are summarizedin Table 5.1. We make two observations from the reported results. First, stacking the CTC moduleleads to strikingly improved accuracy for all three models. Second, regardless of the vowel sequencelengths, LSTM networks are better suited for categorizing the formant sequences and hence arechosen for solving the current problem.5.1.4 Determination of best mapping modelIn our study, we found that the Graph Networks are particularly suitable for extracting featuresfrom the input data. They can effectively capture the relative distance information of the user’sinstantaneous position with respect to the location of different cardinal vowels. We also considerusing different relevant networks including recurrent neural networks and combinations of the graphand recurrent networks to explore the network’s ability to minimize the losses. In this work, weconstrain our experiments to Graph CNN and recurrent neural network models owing to their well-known good performance in temporal modeling as well as their simplicity. Out of those networks,here we selectively present the five best baseline models (with and without regularization in training)with optimally chosen hyperparameter sets. Besides, we also report the performance of differentselected versions of the proposed network.71• Graph CNN Mapping: We report the performance of three variations of spatio-temporalGraph CNN architecture here: (1) with single-pass and RBF adjacency kernel (with regu-larized loss function), (2) with double pass and RBF adjacency kernel (with regularized lossfunction). The third mapping (3) in Table 5.2 corresponds to single-pass spatio-temporalGraph CNN using RBF adjacency kernel as (1) but without any regularization term in theloss function.• GRU Mapping: We report the performance of three variations of Gated Recurrent Unit(GRU) architecture here: (1) with single-layer and 256 nodes (with regularized loss function),(2) with double layer and 256 nodes in each layer (with regularized loss function). The thirdGRU mapping in Table 5.2 corresponds to double layer GRU as (2) but trained without anyregularization term.• Graph CNN-GRU Mapping: We report the performance of three variations of GraphCNN-GRU architecture here: (1) single pass of Spatio-temporal Graph CNN cascaded withsingle layer GRU having 256 nodes (with regularized loss function), (2) single pass of Spatio-temporal Graph CNN cascaded with double layer GRU having 256 nodes in each layer (withregularized loss function). The mapping Graph CNN-GRU (3) in Table 5.2 corresponds tothe architecture (2) but without any regularization term in loss function during training.• LSTM Mapping: We report the performance of three variations of LSTM architecture here:(1) with single-layer and 256 nodes (with regularized loss function), (2) with double layer and256 nodes in each layer(with regularized loss function). The mapping LSTM (3) in Table 5.2corresponds to double layer LSTM like (2) but without any regularization.• Graph CNN-LSTM Mapping: We report the performance of three variations of GraphCNN-GRU architecture here: (1) single pass of Spatio-temporal Graph CNN cascaded withsingle-layer LSTM having 256 nodes (with regularized loss function), (2) single pass of Spatio-temporal Graph CNN cascaded with double layer LSTM having 256 nodes in each layer(with regularized loss function). The Graph CNN-LSTM (3) in Table 5.2 corresponds to thearchitecture (2) but without any regularization.• Proposed Mapping: We present the results of five variations of the proposed model here:(1) single pass of spatio-temporal Graph CNN with RBF adjacency kernel cascaded to LSTMencoder-decoder architecture, where both encoder and decoder have single layer LSTMs with256 nodes (with regularized loss function), (2) same as (1) but has double passes of GraphCNN instead of one, (3) single pass of spatio-temporal Graph CNN with RBF adjacencykernel cascaded to LSTM encoder-decoder architecture, where the encoder has two layers ofLSTM with 256 nodes in each layer and decoder have single layer LSTM with 256 nodes72Table 5.2: Performance evaluation (using MSE) for the mapping networkMethods Vowel sequencesK=3 K=4 K=5 K=6 AllGraph CNN (1) 0.04 0.05 0.07 0.07 0.06Graph CNN (2) 0.10 0.11 0.14 0.15 0.12Graph CNN (3) 0.22 0.22 0.24 0.25 0.24GRU (1) 0.05 0.06 0.06 0.06 0.06GRU (2) 0.04 0.05 0.06 0.06 0.05GRU (3) 0.20 0.21 0.22 0.23 0.22Graph CNN-GRU (1) 0.04 0.05 0.05 0.05 0.05Graph CNN-GRU (2) 0.03 0.04 0.04 0.05 0.04Graph CNN-GRU (3) 0.18 0.20 0.21 0.21 0.21LSTM (1) 0.05 0.05 0.05 0.06 0.05LSTM (2) 0.04 0.05 0.05 0.05 0.05LSTM (3) 0.19 0.20 0.20 0.21 0.20Graph CNN-LSTM (1) 0.04 0.04 0.05 0.05 0.05Graph CNN-LSTM (2) 0.03 0.04 0.04 0.05 0.04Graph CNN-LSTM (3) 0.15 0.17 0.18 0.19 0.18Proposed (1) 0.03 0.04 0.04 0.05 0.04Proposed (2) 0.07 0.08 0.09 0.10 0.09Proposed (3) 0.03 0.03 0.04 0.04 0.04Proposed (4) 0.02 0.03 0.03 0.04 0.03Proposed (5) 0.14 0.16 0.17 0.18 0.17(with regularized loss function). The internal state of both the encoder layers is cascaded toform the context vector that is passed on to the decoder architecture. (4) same as (3) exceptthat the decoder has two layers of LSTMs with 256 nodes in each layer. This is our bestperforming model. (5) in Table 5.2 is the same as (4) but without any regularization term inthe loss function.5.1.5 Model performance analysisWe measure and evaluate the performance of the proposed and baseline mapping models adoptingMean Square Errors (MSE) between the target trajectory and the output trajectory in the normal-ized formant space. Table 5.2 summarizes the performance of our baseline methods and selectedversions of the proposed method as discussed in the previous section, for all the sequence lengths.The results show that our proposed mapping outperforms all the baseline architectures. In mostcases, the error increases with the increasing length of the sequence. This is possibly because of73the higher deviation of the trajectories from the target trajectory for higher vowel sequence length.The MSE of each mapping indicates the precision of the transformed trajectory. Single-pass ofspatio-temporal graph CNN is observed to work better than two passes. This is a well-known issueof going deep using graph CNN. However, the combination of graph CNN with the recurrent neuralnetworks (LSTM and GRU) is seen to boost the network performance and outperform individualnetworks. The performance of LSTM is marginally above GRU and the performance of the LST-M/GRU networks increase with an increase in the number of layer (from 1 to 2) but no furtherimprovement is seen on stacking more LSTM/GRU layers and hence is not included in the table.In each of the aforementioned cases, mapping performance is seen to degrade considerably withoutthe regularization term which demonstrates the importance of categorization regularizer in trainingthe regression network.With a pilot study, we evaluated the performance of our best performing model i.e., Proposed(4), on actual real-world data with the mouse, glove (2D), and glove (1D+1D) for an initial ver-ification before running the final user study. In this test phase, we used random combinations ofdifferent sequence lengths from 3 to 6 as the target task with varying speed. Utilizing the proposedmodel, we noticed that our mapping network yielded a respective average MSE of 0.04, 0.04, and0.06 while the perceptual model resulted in 97%, 96%, and 92% accuracy respectively with themouse, glove (2D) and glove (1D+1D) over 100 trials for each motor control. With the help ofa thorough sweep, we found the decision boundaries of the LSTM classification in the perceptualcategorization step. The perceptual network-driven boundaries are presented in Fig 5.2. A fewsample trajectories for vowel sequences /i/-/u/-/o/ and /E/-/æ/-/a/-/O/ are also overlaid on topof the quantal perceptual space and shown in Fig. 5.4.We ran more test simulations and pilot studies in order to observe and identify the cases inwhich the coupled network fails to achieve the correct categorization with the simulated and real-world test data. The failure cases are the ones where there are no considerable velocity or curvaturechanges around the intermediate targets. For example, let’s consider a curve joining /æ/-/E/-/e/-/i/ as shown in Fig. 5.3. This path has very little curvature change. If the trajectory is almostlinear and does not show any reduction in velocity around particular intermediate trajectory, thenthe network gets confused among the possible output candidates: /æ/-/E/-/e/-/i/ or /æ/-/E/-/i/or /æ/-/e/-/i/. A little shift of curvature towards /e/ and /E/, for example, makes it easier for thenetwork to identify the correct sequence.The fact that the network takes decisions, to some extent, based on the trajectory curvature andvelocity is beneficial in all other cases. For example, let us consider the case of /æ/-/i/-/O/. Thenetwork can correctly identify the vowel sequence despite there being different possible alternativeoutcomes like /æ/-/i/-/u/-/O/, /æ/-/i/-/u/-/o/-/O/, /æ/-/i/-/o/-/O/, /æ/-/i/-/u/-/o/-/O/, etc.This is possibly due to the above-mentioned feature of the network, i.e., understanding curvatures74Figure 5.2: The quantal formant space with perceptual network-driven decision boundariesand velocities.On the contrary to our expectation, we observed that the network performs better on real-world data rather than on simulated data. Our intuition behind such behavior is that the real-world trajectories have velocity changes at the turns as opposed to the simulated data where thereis relative uniformity in velocity profiles. Nevertheless, in order to ensure that the user studygoes smoothly without any wrong inference by the network, we determined the vowel sequencecandidates to be rejected in the user study, such as the ones in the path: /i/-/e/-/E/-/æ/, /E/-/2/-/O/, /æ/-/2/-/O/, /u/-/o/-/O/. For a final validation before running the ultimate user study, were-ran a few more sets of pilot studies with all the remaining possible vowel sequence candidatesand achieved a 100% accuracy in performance (ignoring the rejected samples). This ensured thatthe user would not get confused with the wrong feedback because of network inaccuracies.5.2 User evaluationWe performed the final study on a single male participant of age 30 years, motivated by the userevaluation in [45, 46, 48–50]. The user was a native English speaker, right-handed and had no75Figure 5.3: Illustration of a potential success and an adversarial case in the quantal formantspace. (a) represents the vowel sequence /i/-/e/-/E/ or /E/-/e/-/i/ and can be easily identified bythe network. (b) represents the vowel sequence /i/-/e/-/E/-/æ/ or /æ-/E/-/e/-/i/. With minimalchange in velocity and curvature, the network can be fooled to output wrong sequences like /i/-/e/-/æ/, /i/-/E/-/æ/ or /æ-/e/-/i/, /æ-/E/-/i/ respectively.Figure 5.4: Sample trajectories showing possible curves for vowel sequences /i/-/u/-/o/ and /E/-/æ/-/a/-/O/ overlaid on the quantal formant spaceknown motor difficulties. Though more user studies will eventually help to rigorously validate theoutcomes of the experiment, one user was enough to check if the method is viable. In other words,this one-user study sets the stage for running extensive user studies in future by acknowledging thatthis is the right way of running experiments and that the study enables us to collect our desired setof data that can strengthen our inference and validate our hypothesis. In this section, we describethe data acquisition devices and protocols involved in the data collection procedure.76Figure 5.5: Data glove based control of formant frequencies. (a) shows three selected instantsinvolving different hand gestures (flexion and extension of finger joints) in the continuous jointcontrol of formant frequencies corresponding to /i/-/a/-/u/. (b) shows the side and front views of thehand gestures at three selected instants involving continuous, independent 1D+1D control of formantfrequencies (flexion and extension of wrist, abduction and adduction of fingers) corresponding tothe same vowelsFigure 5.6: Experimental setup with the gloves, laptop and mouse. (a) and (b) show two differentviews of the mouse-based data collection. (c) and (d) show glove-based 2D and 1D+1D datacollection5.2.1 Data acquisition DevicesWe use a Logitech M100 USB Optical Mouse with medium sensitivity level for the mouse-basedtrajectory task. For the data-glove based experiment, we use CyberGlove II, manufactured byImmersion Inc., to capture continuous hand gestures. It is equipped with 18 sensors that recordkinematic information such as flexion, bend, abduction and adduction of the wrist and fingers. Itfeatures two bend sensors on each finger, four abduction sensors, plus sensors measuring thumbcrossover, palm arch, wrist flexion, and wrist abduction. It provides rotation and translationcoordinates corresponding to each of the 18 sensors. The same data-glove is used to investigate twosets of motor control employed for the same trajectory task. Some sample trajectories correspondingto glove and mouse-based control are shown in Fig. 5.5 and Fig. 5.9.77Figure 5.7: Hand gestures corresponding to glove-based 1D+1D control for making the vowelsequence /u/-/i/-/æ/-/a/. The upper row shows finger adduction and abduction while the lowerrow shows wrist flexion and extension. The movement starts with the vowel /u/ which has lowF1 and F2. This is characterized by wrist extension and finger adduction. The next vowel is /i/,which is another cardinal vowel with low F1 but high F2. So to reach /i/ from /u/, the fingersneed to be fully abducted while the wrist remains almost at the same angle of extension. Thisis followed by /æ/ which has higher F1 and lower F2 than /i/. This is achieved by performingwrist flexion and lesser degree of finger abduction (i.e., increased adduction). The last vowel is /a/,with higher F1 and lower F2. This is achieved by further decreasing the degree of abduction whileslightly increasing the wrist flexion. These sequential movements result in the given vowel sequencein acoustic space.Figure 5.8: Hand gestures corresponding to glove-based 2D control for making the vowel sequence/u/-/i/-/æ/-/a/785.2.2 ProcedureThe entire experiment (including practise time for the user as well as testing time before and afterutilizing perceptual constraints) takes about 6 hours in total collected over three days (one motorcontrol paradigm per day) to avoid user fatigue. We develop a Graphical User Interface (GUI) ofsize 12 cm × 18 cm using tkinter [76] with the 2D formant frequency space as our background andcardinal vowels as the landmarks on the formant space. The user is instructed to always look at thedisplay screen (i.e., GUI) for controlling the trajectories and not at his own hand movements. This isbecause our Fitts’ law formulation is based on the display screen. We start by familiarizing the userwith the experimental setup, providing him with the relevant instructions and taking his consent.The user’s forearm is kept fixed at all instants in such a way that the wrist joint can perform radialand ulnar deviation as well as flexion and extension, but there is no significant contribution fromthe elbow via extension. Utmost care is taken in order to avoid all other unintended motions. Theuser is asked to complete each given vowel sequence as fast as possible similar to Fitts’ originalstudy. After the completion of each mouse-based trajectory task, the user has to left-click themouse button to indicate that he believes he has reached the target. For the glove-based study, theuser has to press the key ’A’ on the laptop keyboard with his left hand to stop further recording. Inour pilot study, we also tried avoiding the use of left hand and replacing it with a thumb movementwhich is otherwise not used in the study. But we figured out that the thumb movement adds anextra time to the movements and hence it results in wrong timing information eventually. Howeverif the user keeps a left-hand finger on the key, he can instantaneously indicate the completion ofthe task by quickly pressing the key. As a result, it does not introduce any significant error to thetiming measurement.We keep a practise session of about 10-15 minutes for the user before starting the first set ofdata collection with each input device so that the user can familiarize himself with the process andask questions if necessary. In this period, the user realizes the amount of precision required from histrajectory to achieve the correct categorical perceptual output. This practise time was found to besufficient based on a number of pilot studies and as a result, the user did not perform any mistakesin the current study. However, it might be essential to include some error control strategies forlarge-scale user studies in future.For the mouse-based study, the control space was horizontal and of size 6 cm × 9 cm with acontrol display ratio of 0.5. This measurement was decided based on the extent to which a usercan comfortably drag the mouse without any contribution from the elbow. For the glove 2D study,again based on the span of comfortable finger and wrist movements, we set the horizontal dimensionof the control space to be 6 cm and the vertical dimension to be 9 cm. Similarly, for the glove1D+1D study, we keep the horizontal dimension of the control space to be 6 cm (distance between79full finger abduction and adduction positions) and the vertical dimension to be 9 cm (distancebetween wrist flexion and extension). We do not provide any physical boundaries in the user study,rather explain the control space to the user via oral instruction and physical demonstration. Thisis because, based on an initial pilot study we found that the user often tends to take advantageof the physical boundary obstruction. This is a physical constraint as opposed to the perceptualconstraint that we want to investigate through the study and will otherwise interfere with theintended experiment. The glove coordinates are appropriately scaled to match the formant space.The experimental setup is shown in Fig 5.6, while the sample hand movements for glove-basedcontrol are presented in Fig 5.7 and Fig 5.8.The perceptual feedback that the user receives every time after the completion of the task ispurely visual. He is shown the sequence of vowels visually in terms of symbols and alphabetslike “/i/-/a/-/u/” for the trajectory shown in Fig 5.5. This is because our task is a visuomotorcoordination task and so the user is already doing a visual task. The visual symbols can be thoughtto be somewhat equivalent to the perception of a potential listener, particularly at a stage whenhis neural perceptual center has already figured out the perceptual output from the speech. Sothis is a way to look at the listener’s perceptual outcomes and ensuring that the user (speaker)can control acoustics in the formant space with the aim that the listener’s understanding matcheswith his original intentions. In our study, the original intention is represented in terms of thegiven vowel sequences whereas the listener’s understanding is represented via the visual symbolicfeedback. Besides, the addition of an auditory feedback to it will further complicate the task. Thisis because it will then be difficult to segregate the contribution of different feedbacks in the user’sbehavior. So to keep the task simple, we currently allow only visual feedback in our study butit can be extended to auditory feedbacks at a later stage if necessary. The incorporation of theauditory feedback can then account for the relative timing information in the perceptual space aswell, which is out of scope for the current study.The task begins with mouse-based data acquisition (first day), followed by Glove 2D (secondday) and ends with Glove 1D+1D (third day) based data collection. It is ensured that the centralvision of the user is always directed at the GUI and not his own movements. We divide each set ofdata collection into three rounds:• Round 1: On initial continuous non-quantal formant space, i.e., without utilizing the cate-gorical perceptual constraints.• Round 2: On transformed quantal formant space, i.e., utilizing the categorical perceptualconstraints, but implicitly (without showing the categorical boundaries to the user).• Round 3: On transformed quantal formant space, while explicitly utilizing categorical per-80ceptual constraints, i.e., in this case the categorical boundaries are revealed to the user.After the completion of the first round (without any perceptual constraints), we give a 15min break to the participant and then start introducing him to the next set of experiments (withperceptual constraints). In this round, by trial and error, the user figures out the advantage that theperceptual mapping offers, but we do not explicitly show him the decision boundaries. So the userwould have to develop a conception about the perceptual boundaries in the formant space implicitlyin his mind through trial and error. This is the round where the user starts utilizing the mappingand leveraging the perceptual feedback. So we keep a practise session of about 10-15 minutes forthe user before actually starting the second set of data collection with each input device so thatthe user can familiarize himself with the perceptual categorization scheme and implicitly learn theperceptual boundaries. In this period, the user also learns the amount of precision required fromhis trajectory and the extent to which he can relax his efforts to achieve the correct categoricalperceptual output. After completion of the second round, we again keep a 15 minutes rest timeand introduce him to the third and last set of experiments, i.e., with explicit categorical constraint.Here we reveal the decision boundaries to the user and let him take advantage of the groupedperceptual space to perform the given tasks. After 10-15 minutes of practise time, we start thethird round of the experiment.Each round consists of 4 sections with different lengths of target vowel sequence (K = 3, 4, 5, 6).We start with the seemingly least difficulty level i.e K=3 and then increase the difficulty level tillK=6. In each set of experiments, we design 15 target samples for each K thereby collecting dataof 60 samples for each device. We keep a 2 seconds preparation time after showing the targettrajectory and a 5 seconds wait-time after the completion of every trajectory task and before givingthe next target trajectory task to the user. Each sample is chosen by joining different cardinalvowels in GUI and the participant is instructed to follow the target trajectory as fast as they can,while being as accurate as possible to match the perceptual output.5.2.3 Motor control paradigms with gloveThe first paradigm of motor control involves simultaneous joint control of both the dimensions (F1& F2). In this control, the user tries to imagine his hand as a moving tongue. Different gesturesare continuously created by bending the phalangeal, carpal and metacarpal joints to appropriatedegrees. Such joint flexion and extension result in the finger tip reaching different cardinal vowelsin the formant space as shown in Fig 5.5 and Fig 5.8.The second paradigm of motor control involves an independent control of two different dimen-sions (F1 & F2). The wrist extension and flexion are used to increase and decrease the first formant81Figure 5.9: Trajectory generation experiment with mouse and glove. It shows user-generatedtrajectories in formant space through mouse movements and hand gestures. Blue dots representspatial location of 9 cardinal vowels. Dashed green line indicates the target vowel trajectory (forthe network) joining 2, /u/, /i/, /ae/, /a/ and /ε/. Solid and dotted lines represent denoised user’strajectories.frequency while the finger abduction and adduction are used to increase and decrease the secondformant frequency respectively. Simultaneous control of these two dimensions brings about con-tinuous changes in both the formants in 2D formant space as demonstrated in Fig 5.5 and Fig5.7.Both types of control are intuitive from speech production or articulatory point of view andeasy to learn for a layman. The joint 2D control in general is easier in achieving continuous voweltransitions. This control is also helpful to imagine one’s hand as a tongue. However, one needs to beintroduced to the relationship between the tongue height or frontness/backness and the resultantvowels to be able to fully utilize the control properly.The 1D+1D control is easier for specific target-reaching tasks, i.e., for making peripheral discretevowel configurations like /a/, /i/, /u/ and /æ/. It is comparatively harder to achieve vowel tran-sitions as it involves independent control of two dimensions. Therefore it often becomes piecewisebroken for most part of the trajectory, in some parts of which the user focuses on abduction/adduc-tion of fingers (horizontal control) while at some other parts, the user is mostly concerned aboutflexion and abduction of wrist (vertical control) as shown in Fig 5.5 (b). However it is easy to relateto the speech production process from a layman’s perspective. For example, making /u/ impliesfinger adduction while making /i/ implies finger abduction. Similarly, making /u/ implies wrist82Figure 5.10: Visualization of the non-quantal formant space with nine vowels and paths connectingsome of themflexion while making /a/ means wrist extension.5.3 Index of difficulty calculationAs discussed in Chapters 2 and 3, the index of difficulty provides us a measure of the task complexitythat helps us to understand, how difficult is a particular movement task and how does it comparewith other related movement tasks irrespective of the user movements. In Chapter 3, we haveformulated the indices of difficulty suitable for our task spaces extending Fitts’ and Steering Law.We have also validated the proposed indices on different sets of tasks similar to our task at hand.In this section, we evaluate the indices on the quantal and non-quantal formant space following ourprevious formulation.Let’s consider a rectangular 2D space somewhat similar to Subsection 3.3.1, but with 9 finitewidth circles instead of 5 that are placed at different locations. The center coordinates of the circlesare determined by the formant values of different vowels that they represent. There are 9C2 = 45possible undirected paths connecting a pair of vowels. To avoid clutter, we only show some ofthe connections (in form of 2D tunnels) in Fig 5.10 (b). Next, we compute the difficulty of themovement task in the given space using the previously introduced formula in Equation (3.19).The proposed deep learning model partitions the vowel space into 9 quantal regions surroundingthe 9 chosen vowels. In the transformed space also, there are 9C2 = 45 possible undirected pathsconnecting each pair of vowels. Modifying our previous formulation, we redefine our index ofdifficulty for the modified task using Equation (3.20).We consider sequence lengths K = 3, 4, 5, and 6 in this experiment as mentioned before.Therefore, we next compute the index of difficulty of five sample movement tasks of each sequencelength from the user study to show how the difficulty level of the tasks varies based on different83Figure 5.11: Sample trajectory tasks for vowel sequence length K = 3, viz., /a/-/i/-/u/, /e/-/u/-/o/, /u/-/i/-/æ/, /O/-/æ/-/i/, and /u/-/a/-/ E/. (a), (b), and (c) show each of these tasks inRound 1, 2, and 3 respectivelymovement parameters in the vowel formant space. Fig 5.11 visualizes the sample tasks for sequencelength K = 3. We present the indices of both the task spaces, i.e., with and without utilizing theperceptual constraints, in the Table 5.3. Further, the computed indices of difficulty values for all thetrajectories for both the cases are summarized in Fig. 5.13. The examples showing the computationof difficulty metrics are provided in Appendix B. The average ratio of reduction of the index ofdifficulty of the task due to the incorporation of the proposed mapping and the partitioning of theformant space is found to be 3.6.5.4 Index of performance calculationThe index of performance is denoted in terms of the ratio of the index of difficulty of the task and thetime required by the user to complete the task. The harder the task, the lesser is the expected indexof user’s performance. A given task has a fixed index of difficulty, but the user performance mayvary from time to time and from one participant to another. The index of performance, also calledthroughput, is a measure of the human performance. Fitts defined it as the average information permovement task divided by the time per movement. Therefore, lesser the time required to completethe task, the more is the index of performance for a task with particular difficulty level. The indexof performance can vary from person to person based on one’s familiarity with the task. Besides,the index of performance increases with practise. This is presumably because, the informationcapacity of the motor control system increases with repeated training, and can account for a taskwith fixed difficulty index in lesser time. In this section, we will analyze the user’s throughputbefore and after introducing the perceptual mapping and derive insights from the user’s behaviour.Sample user trajectories corresponding to all the three rounds of experiment with all the threemodes of control are shown in Fig. 5.12.84Figure 5.12: Sample trajectories of the user corresponding to vowel sequences of length K =3. The first three rows represent data from Round 1 (tunnel task in uniform space), next threerows represent data from Round 2 (with implicit perceptual constraint) and the last three rowsrepresent data from Round 3 (with explicit perceptual constraint). The first, fourth and seventhrows represent mouse-based control. The second, fifth and eighth rows represent glove-based joint2D control. Finally, the third, sixth and ninth rows represent glove-based independent 1D+1Dcontrol85Table 5.3: Index of difficulty (ID) in uniform and partitioned space (in bits)Cases ID in uniform space ID in partitioned space/a/-/i/-/u/ 18.16 4.46/e/-/u/-/o/ 9.58 2.78/u/-/i/-/æ/ 13.19 3.08/O/-/æ/-/i/ 11.09 3.20/u/-/a/-/ E/ 11.21 2.94/a/-/u/-/i/-/æ/ 19.47 5.01/æ/-/O/-/u/-/o/ 14.47 6.20/i/-/u/-/e/-/a/ 25.09 7.48/E/-/a/-/i/-/æ/ 20.69 5.21/o/-/2/-/a/-/e/ 16.94 4.90/æ/-/i/-/a/-/e/-/u/ 30.65 7.66/i/-/E/-/a/-/u/-/2/ 21.32 5.32/æ/-/i/-/u/-/2/-/o/ 23.66 7.17/O/-/u/-/e/-/a/-/E/ 27.76 8.49/a/-/e/-/u/-/O/-/æ/ 24.27 6.77/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 10.12/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 6.97/2/-/o/-/u/-/e/-/E/-/a/ 21.96 6.07/i/-/u/-/o/-/2/-/a/-/e/ 26.52 7.72/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 9.875.4.1 Without perceptual constraintsFor the mouse-based task, the user movement times range from 6.02 to 16.44 seconds. The rate ofinformation processing in this motor control paradigm is nearly constant as evident from the sampleexamples provided in Table 5.4. The mean value of IR is 2.12 bits/s and the SD is 0.20 bits/s.This can be inferred to be the information processing rate of the mouse-based human hand motorcontrol system for trajectory tasks without any categorical constraint. Regressing the movementtime on the index of difficulty results in MT = 0.45 ID + 1.31. The correlation between themovement time and the index of difficulty is 96.90%. The R-squared metric for regression analysisis found to be 92.33%.For the glove-based joint 2D control task, the user movement times range from 5.95 to 12.88seconds. The rate of information processing in this motor control paradigm is presented in Table5.5 with the help of the sample examples and is observed to be nearly constant as seen in theprevious case. The mean value of IR is 3.44 bits/s and the SD is 0.32 bits/s. This can be inferredto be the information processing rate of the glove 2D based human hand motor control system for86Figure 5.13: The mean index of difficulty for different vowel sequence lengths before (Round 1)and after (Round 2) utilizing the perceptual mappingtrajectory tasks without any categorical constraint. Regressing the movement time on the index ofdifficulty results in MT= 0.29 ID + 3.70. The correlation between the movement time and theindex of difficulty is 94.87%. The R-squared metric for regression analysis is found to be 88.64%.For the glove-based independent 1D+1D control task, the user movement times ranged from6.80 to 23.65 seconds. The rate of information processing in this motor control paradigm is seenin table 5.6 is almost constant and lower than the previous two cases. The mean value of IR is1.54 bits/s and the SD is 0.22 bits/s. This can be inferred to be the information processingrate of the Glove 1D+1D based human hand motor control system for trajectory tasks withoutany categorical constraint. Regressing the movement time on the index of difficulty results inMT= 0.65ID + 1.40. The correlation between the movement time and the index of difficulty is90.44%. The R-squared metric for regression analysis is found to be 84.70%.5.4.2 With perceptual constraintsIn this subsection, we report the results of Round 2 and Round 3 of the experiments with thedata-glove and the mouse.87Table 5.4: Effect of perceptual constraint on index of difficulty (ID) and information rate (IR) forMouse-based controlCases Without perceptual constraint Implicit perceptual constraintID (bits) MT (s) IR (bits/s) ID (bits) MT (s) IR (bits/s)/a/-/i/-/u/ 18.16 8.80 2.06 4.46 2.53 1.76/e/-/u/-/o/ 9.58 6.02 1.59 2.78 2.24 1.24/u/-/i/-/æ/ 13.19 6.30 2.09 3.08 2.55 1.21/O/-/æ/-/i/ 11.09 6.74 1.65 3.20 2.68 1.20/u/-/a/-/ E/ 11.21 6.07 1.85 2.94 2.61 1.13/a/-/u/-/i/-/æ/ 19.47 10.43 1.87 5.01 2.67 1.87/æ/-/O/-/u/-/o/ 14.47 6.85 2.11 6.20 3.24 1.91/i/-/u/-/e/-/a/ 25.09 11.18 2.24 7.48 3.32 2.25/E/-/a/-/i/-/æ/ 20.69 9.13 2.27 5.21 2.60 2.00/o/-/2/-/a/-/e/ 16.94 7.67 2.21 4.90 2.55 2.85/æ/-/i/-/a/-/e/-/u/ 30.65 14.82 2.07 7.66 2.98 2.57/i/-/E/-/a/-/u/-/2/ 21.32 9.68 2.20 5.32 2.75 1.93/æ/-/i/-/u/-/2/-/o/ 23.66 12.41 1.91 7.17 3.02 2.37/O/-/u/-/e/-/a/-/E/ 27.76 12.12 2.29 8.49 3.35 2.83/a/-/e/-/u/-/O/-/æ/ 24.27 10.42 2.33 6.77 2.80 2.42/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 12.66 2.26 10.12 4.05 2.50/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 14.23 1.87 6.97 3.35 2.08/2/-/o/-/u/-/e/-/E/-/a/ 21.96 12.03 1.83 6.07 3.25 1.87/i/-/u/-/o/-/2/-/a/-/e/ 26.52 12.15 2.18 7.72 3.65 2.11/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 16.44 2.12 9.87 3.92 2.525.4.2.1 Round 2: With implicit estimationIn general, we found a striking decrease in user movement times as well as an increase in userthroughput after introducing the mapping. More details regarding this will be presented in theDiscussion section.For the mouse-based task, the user movement times ranged from 2.24 to 4.05 seconds. Therate of information processing in this motor control paradigm lies between 1.13 bits/s and 2.85bits/s as evident from the sample examples provided in Table 5.4 (Implicit perceptual constraint).Regressing the movement time on the index of difficulty results in MT = 0.20 ID + 1.75. Thecorrelation between the movement time and the index of difficulty is 91.45%. The R-squaredmetric for regression analysis is found to be 82.38%. From the slope of the equation, the meanvalue of IR is found to be 5 bits/s and the SD is 0.50 bits/s. This can be inferred to be theinformation processing rate of the mouse-based human hand motor control system for trajectorytasks while implicitly utilizing the categorical constraint.88Table 5.5: Effect of perceptual constraint on index of difficulty (ID) and information rate (IR) forGlove 2D controlCases Without perceptual constraint Implicit perceptual constraintID (bits) MT (s) IR (bits/s) ID (bits) MT (s) IR (bits/s)/a/-/i/-/u/ 18.16 9.02 2.01 4.46 2.55 1.75/e/-/u/-/o/ 9.58 5.95 1.61 2.78 1.80 1.54/u/-/i/-/æ/ 13.19 7.36 1.79 3.08 1.79 1.72/O/-/æ/-/i/ 11.09 6.49 1.71 3.20 1.68 1.90/u/-/a/-/ E/ 11.21 6.74 1.66 2.94 2.07 1.42/a/-/u/-/i/-/æ/ 19.47 9.30 2.09 5.01 2.62 1.91/æ/-/O/-/u/-/o/ 14.47 6.70 2.16 6.20 2.32 2.67/i/-/u/-/e/-/a/ 25.09 10.32 2.43 7.48 2.60 2.88/E/-/a/-/i/-/æ/ 20.69 10.90 1.90 5.21 2.70 1.93/o/-/2/-/a/-/e/ 16.94 8.02 2.11 4.90 2.48 1.98/æ/-/i/-/a/-/e/-/u/ 30.65 12.88 2.38 7.66 3.20 2.39/i/-/E/-/a/-/u/-/2/ 21.32 9.40 2.27 5.32 2.46 2.16/æ/-/i/-/u/-/2/-/o/ 23.66 10.52 2.25 7.17 2.59 2.79/O/-/u/-/e/-/a/-/E/ 27.76 10.80 2.57 8.49 2.84 2.98/a/-/e/-/u/-/O/-/æ/ 24.27 9.67 2.51 6.77 2.66 2.56/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 12.65 2.56 10.12 3.86 2.62/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 10.44 2.54 6.97 3.14 2.22/2/-/o/-/u/-/e/-/E/-/a/ 21.96 9.32 2.36 6.07 2.48 2.45/i/-/u/-/o/-/2/-/a/-/e/ 26.52 10.26 2.58 7.72 3.20 2.41/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 11.72 2.97 9.87 3.64 2.71For the glove-based joint 2D control task, the user movement times range from 1.68 to 3.86seconds. The index of difficulty, as well as the rate of information processing in this motor controlparadigm, is presented in Table 5.5 with the help of the sample examples. Regressing the movementtime on the index of difficulty results in MT= 0.22 ID + 1.24. The correlation between themovement time and the index of difficulty is 92.74%. The R-squared metric for regression analysisis found to be 84.25%. The mean value of IR is 4.54 bits/s and the SD is 0.45 bits/s. This canbe inferred to be the information processing rate of the glove 2D-based human hand motor controlsystem for trajectory tasks while implicitly utilizing the categorical constraint.For the glove based independent 1D+1D control task, the user movement times range from4.23 to 9.20 seconds. The rate of information processing in this motor control paradigm is seenin table 5.6 is almost constant and lower than the previous two cases. The mean value of IR is1.72 bits/s and the SD is 0.18 bits/s. This can be inferred to be the information processing rateof the Glove 1D+1D based human hand motor control system for trajectory tasks while implicitly89Table 5.6: Effect of perceptual constraint on index of difficulty (ID) and information rate (IR) forGlove 1D+1D controlCases Without perceptual constraint Implicit perceptual constraintID (bits) MT (s) IR (bits/s) ID (bits) MT (s) IR (bits/s)/a/-/i/-/u/ 18.16 14.07 1.29 4.46 6.65 0.67/e/-/u/-/o/ 9.58 8.96 1.07 2.78 5.86 0.47/u/-/i/-/æ/ 13.19 6.84 1.93 3.08 4.23 0.73/O/-/æ/-/i/ 11.09 8.88 1.25 3.20 5.81 0.55/u/-/a/-/ E/ 11.21 6.80 1.64 2.94 5.97 0.49/a/-/u/-/i/-/æ/ 19.47 14.28 1.36 5.01 7.96 0.63/æ/-/O/-/u/-/o/ 14.47 11.12 1.30 6.20 8.47 0.73/i/-/u/-/e/-/a/ 25.09 13.61 1.84 7.48 9.56 0.78/E/-/a/-/i/-/æ/ 20.69 14.64 1.41 5.21 5.22 1.00/o/-/2/-/a/-/e/ 16.94 11.78 1.44 4.90 5.93 0.83/æ/-/i/-/a/-/e/-/u/ 30.65 18.29 1.68 7.66 7.54 1.02/i/-/E/-/a/-/u/-/2/ 21.32 16.40 1.30 5.32 6.56 0.81/æ/-/i/-/u/-/2/-/o/ 23.66 15.70 1.51 7.17 7.80 0.92/O/-/u/-/e/-/a/-/E/ 27.76 21.23 1.31 8.49 8.18 1.04/a/-/e/-/u/-/O/-/æ/ 24.27 14.80 1.64 6.77 7.05 0.96/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 18.18 1.58 10.12 9.20 1.10/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 22.56 1.18 6.97 7.40 0.94/2/-/o/-/u/-/e/-/E/-/a/ 21.96 14.20 1.55 6.07 7.33 0.83/i/-/u/-/o/-/2/-/a/-/e/ 26.52 16.84 1.57 7.72 8.37 0.92/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 23.65 1.47 9.87 9.17 1.08utilizing the categorical constraint. Regressing the movement time on the index of difficulty resultsin MT= 0.58ID + 1.45. The correlation between the movement time and the index of difficultyis 86.78%. The R-squared metric for regression analysis is found to be 83.22%.5.4.2.2 Round 3: With boundaries shownWe noticed further significant decrease in user movement times as well as an increase in userthroughput after showing the perceptual boundaries to the user. For the mouse-based task, theuser movement times range from 1.65 to 3.25 seconds. The rate of information processing in thismotor control paradigm is shown with the help of the sample examples in Table 5.7. Regressingthe movement time on the index of difficulty results in MT = 0.20 ID + 3.70. The correlationbetween the movement time and the index of difficulty is 94.47%. The R-squared metric forregression analysis is found to be 88.48%. The mean value of IR is found to be 5 bits/s and theSD is 0.51 bits/s. This can be inferred to be the information processing rate of the mouse-based90Table 5.7: Effect of perceptual constraint on index of difficulty (ID) and information rate (IR) forMouse-based controlCases Without perceptual constraint Explicit perceptual constraintID (bits) MT (s) IR (bits/s) ID (bits) MT (s) IR (bits/s)/a/-/i/-/u/ 18.16 8.80 2.06 4.46 1.93 2.31/e/-/u/-/o/ 9.58 6.02 1.59 2.78 1.73 1.61/u/-/i/-/æ/ 13.19 6.30 2.09 3.08 1.65 1.87/O/-/æ/-/i/ 11.09 6.74 1.65 3.20 1.89 1.69/u/-/a/-/ E/ 11.21 6.07 1.85 2.94 1.90 1.55/a/-/u/-/i/-/æ/ 19.47 10.43 1.87 5.01 1.94 2.58/æ/-/O/-/u/-/o/ 14.47 6.85 2.11 6.20 2.29 2.71/i/-/u/-/e/-/a/ 25.09 11.18 2.24 7.48 2.42 3.09/E/-/a/-/i/-/æ/ 20.69 9.13 2.27 5.21 2.09 2.49/o/-/2/-/a/-/e/ 16.94 7.67 2.21 4.90 2.10 2.33/æ/-/i/-/a/-/e/-/u/ 30.65 14.82 2.07 7.66 2.42 3.17/i/-/E/-/a/-/u/-/2/ 21.32 9.68 2.20 5.32 2.21 2.41/æ/-/i/-/u/-/2/-/o/ 23.66 12.41 1.91 7.17 2.46 2.91/O/-/u/-/e/-/a/-/E/ 27.76 12.12 2.29 8.49 2.64 3.21/a/-/e/-/u/-/O/-/æ/ 24.27 10.42 2.33 6.77 2.41 2.81/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 12.66 2.26 10.12 3.25 3.11/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 14.23 1.87 6.97 2.73 2.55/2/-/o/-/u/-/e/-/E/-/a/ 21.96 12.03 1.83 6.07 2.68 2.26/i/-/u/-/o/-/2/-/a/-/e/ 26.52 12.15 2.18 7.72 3.05 2.53/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 16.44 2.12 9.87 3.07 3.21human hand motor control system for trajectory tasks while explicitly utilizing the categoricalconstraint.For the glove-based joint 2D control task, the user movement times range from 1.33 to 3.07seconds. The rate of information processing in this motor control paradigm is presented in Table5.8 with the help of the sample examples. Regressing the movement time on the index of difficultyresults in MT= 0.18 ID + 0.87. The correlation between the movement time and the index ofdifficulty is 91.66%. The R-squared metric for regression analysis is found to be 82.95%. Themean value of IR is 5.55 bits/s and the SD is 0.52 bits/s. This can be inferred to be the informationprocessing rate of the glove-based joint 2D human hand motor control system for trajectory taskswhile explicitly utilizing the categorical constraint.For the glove based independent 1D+1D control task, the user movement times ranged from1.98 to 4.33 seconds. The rate of information processing in this motor control paradigm is presentedin Table 5.9 and is found to be lower than the previous two cases. Regressing the movement time91Table 5.8: Effect of perceptual constraint on index of difficulty (ID) and information rate (IR) forGlove 2D controlCases Without perceptual constraint Explicit perceptual constraintID (bits) MT (s) IR (bits/s) ID (bits) MT (s) IR (bits/s)/a/-/i/-/u/ 18.16 9.02 2.01 4.46 2.02 2.21/e/-/u/-/o/ 9.58 5.95 1.61 2.78 1.55 1.79/u/-/i/-/æ/ 13.19 7.36 1.79 3.08 1.35 2.28/O/-/æ/-/i/ 11.09 6.49 1.71 3.20 1.33 2.41/u/-/a/-/ E/ 11.21 6.74 1.66 2.94 1.76 1.67/a/-/u/-/i/-/æ/ 19.47 9.30 2.09 5.01 1.86 2.74/æ/-/O/-/u/-/o/ 14.47 6.70 2.16 6.20 1.81 3.43/i/-/u/-/e/-/a/ 25.09 10.32 2.43 7.48 1.96 3.82/E/-/a/-/i/-/æ/ 20.69 10.90 1.90 5.21 1.74 2.99/o/-/2/-/a/-/e/ 16.94 8.02 2.11 4.90 2.04 2.40/æ/-/i/-/a/-/e/-/u/ 30.65 12.88 2.38 7.66 2.49 3.07/i/-/E/-/a/-/u/-/2/ 21.32 9.40 2.27 5.32 2.12 2.51/æ/-/i/-/u/-/2/-/o/ 23.66 10.52 2.25 7.17 2.40 2.99/O/-/u/-/e/-/a/-/E/ 27.76 10.80 2.57 8.49 2.61 3.25/a/-/e/-/u/-/O/-/æ/ 24.27 9.67 2.51 6.77 2.06 3.29/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 12.65 2.56 10.12 3.07 3.30/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 10.44 2.54 6.97 2.44 2.86/2/-/o/-/u/-/e/-/E/-/a/ 21.96 9.32 2.36 6.07 2.12 2.86/i/-/u/-/o/-/2/-/a/-/e/ 26.52 10.26 2.58 7.72 2.85 2.71/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 11.72 2.97 9.87 2.96 3.33on the index of difficulty results in MT= 0.25ID + 1.41. The correlation between the movementtime and the index of difficulty is 92.72%. The R-squared metric for regression analysis is foundto be 84.24%. The mean value of IR is 4 bits/s and the SD is 0.40 bits/s. This can be inferredto be the information processing rate of the glove-based independent 1D+1D human hand motorcontrol system for trajectory tasks while explicitly utilizing the categorical constraint.92Table 5.9: Effect of perceptual constraint on index of difficulty (ID) and information rate (IR) forGlove 1D+1D controlCases Without perceptual constraint Explicit perceptual constraintID (bits) MT (s) IR (bits/s) ID (bits) MT (s) IR (bits/s)/a/-/i/-/u/ 18.16 14.07 1.29 4.46 2.60 1.72/e/-/u/-/o/ 9.58 8.96 1.07 2.78 1.98 1.40/u/-/i/-/æ/ 13.19 6.84 1.93 3.08 2.04 1.51/O/-/æ/-/i/ 11.09 8.88 1.25 3.20 2.00 1.60/u/-/a/-/ E/ 11.21 6.80 1.64 2.94 2.56 1.15/a/-/u/-/i/-/æ/ 19.47 14.28 1.36 5.01 2.85 1.76/æ/-/O/-/u/-/o/ 14.47 11.12 1.30 6.20 2.69 2.30/i/-/u/-/e/-/a/ 25.09 13.61 1.84 7.48 2.91 2.57/E/-/a/-/i/-/æ/ 20.69 14.64 1.41 5.21 2.80 1.86/o/-/2/-/a/-/e/ 16.94 11.78 1.44 4.90 2.90 1.69/æ/-/i/-/a/-/e/-/u/ 30.65 18.29 1.68 7.66 3.50 2.19/i/-/E/-/a/-/u/-/2/ 21.32 16.40 1.30 5.32 2.95 1.80/æ/-/i/-/u/-/2/-/o/ 23.66 15.70 1.51 7.17 2.88 2.49/O/-/u/-/e/-/a/-/E/ 27.76 21.23 1.31 8.49 3.24 2.62/a/-/e/-/u/-/O/-/æ/ 24.27 14.80 1.64 6.77 2.92 2.31/æ/-/O/-/u/-/e/-/a/-/2/ 28.65 18.18 1.58 10.12 4.33 2.33/æ/-/i/-/a/-/u/-/o/-/2/ 26.59 22.56 1.18 6.97 3.62 1.93/2/-/o/-/u/-/e/-/E/-/a/ 21.96 14.20 1.55 6.07 2.90 2.09/i/-/u/-/o/-/2/-/a/-/e/ 26.52 16.84 1.57 7.72 3.43 2.25/æ/-/O/-/u/-/i/-/a/-/E/ 34.79 23.65 1.47 9.87 4.00 2.475.5 Performance analysis and DiscussionWe first start our discussion with the index of difficulty metrics. A quick glance at Fig 5.4 demon-strates that the index of difficulty increases with the increase in vowel sequence lengths (from K=3to K=6), which is obvious, as it leads to the increase in the average distance travelled. Besides,the index of difficulty of the trajectory task gets drastically reduced due to the incorporation ofthe categorization in the formant space. The mean index of task difficulty in the non-quantal,continuous space was found to be 21.23 whereas the mean index of task difficulty in the quantal,partitioned space was computed to be 6.05 which implies an average 3.5-fold reduction in taskcomplexity due to the perceptual categorization. From this observation, it can be concluded thatthe perceptual categorization indeed makes the trajectory tasks easier.The rate of increase of index of difficulty value was found to gradually decrease with the increasein vowel sequence length for both the initial and the transformed task spaces. It is noteworthy thatthe rate of increase in difficulty level of the trajectory tasks with increasing vowel sequence length93Figure 5.14: The average movement times with mouse-based control for different vowel sequencelengths without any perceptual mapping (Round 1), with implicit perceptual constraints (Round2), and with explicit perceptual constraints (Round 3)was lesser in the transformed space than that in the initial continuous formant space. This showsthat the influence of the length parameter on the task complexity decreases in the quantal space. Inother words, the variation in the vowel sequence length has less impact on the perceived difficultylevel of the tasks in the categorized formant space. This might be a plausible justification regardingwhy the perceived difficulty level does not increase strikingly with the increase in temporal speechsequence length in our running speech.As clearly evident from Fig 5.14, Fig 5.15 and Fig 5.16, the movement time of the user withall the control paradigms is the highest in the case without any constraint, followed by the casewith implicit constraint. The user’s movement time is the least with explicit perceptual constraint,i.e., when the user is directly shown the perceptual boundaries. As expected, the movement timeincreases with the increase in vowel sequence lengths and rate of increase in movement time withrespect to the sequence length is significantly higher in Round 1 than in Round 2 and 3. Thisagain shows that the mapping reduces the impact of variation of the length parameter on the user’strajectory movement performance. The average movement times for the same set of tasks withthe mouse-based control in Round 1, Round 2 and Round 3 are respectively 10.34 seconds, 3.03seconds and 2.36 seconds. Therefore, the average reduction in the movement times with implicit94Figure 5.15: The average movement times with glove-based joint 2D control for different vowelsequence lengths without any perceptual mapping (Round 1), with implicit perceptual constraints(Round 2), and with explicit perceptual constraints (Round 3)(Round 2) and explicit (Round 3) constraints with respect to the case without any perceptualconstraint (Round 1) is found to be 3.41, and 4.38 times respectively. Similarly, for the glove-based2D joint control, the average movement times are observed to be 9.44 seconds, 2.65 seconds and2.14 seconds for Round 1, 2 and 3 respectively. This shows that in this case there is a 3.65-fold and4.4-fold reduction in movement times with implicit and explicit categorical perceptual constraintsrespectively with respect to Round 1 (having no constraints). Lastly, for the glove-based 1D+1Dcontrol, the average movement times are noted to be 14.69 seconds, 7.17 seconds and 2.96 secondscorresponding to Rounds 1, 2 and 3 respectively. This shows that in this case there is a 2.05-fold and4.96-fold reduction in movement times with implicit and explicit categorical perceptual constraintsrespectively with respect to Round 1 (having no constraints).In general, for all the rounds, we observe that the movement times are higher in case of glove1D+1D control followed by mouse-based control and then glove 2D control. On average for Round1, the Glove 1D+1D and mouse-based control are noted to take 1.56 and 1.10-fold more time thanthe Glove 2D control. The ratios take the values of 2.71 and 1.14 respectively for Round 2 and thevalues of 1.38 and 1.10 for Round 3. The highest benefit in terms of the reduction of movement timeis noted in glove 1D+1D with the incorporation of explicit constraints, followed by the glove 2Dand mouse-based controls. On the other hand, with the incorporation of implicit constraints, the95Figure 5.16: The average movement times with glove-based independent 1D+1D controlfor different vowel sequence lengths without any perceptual mapping (Round 1), with implicitperceptual constraints (Round 2), and with explicit perceptual constraints (Round 3)highest benefit is observed in the glove 2D control, followed by mouse and glove 1D+1D control.From this, it can be inferred that the user learns to take maximum advantage of the implicitconstraints in the Glove 2D whereas he takes the maximum advantage of the explicit constraints inGlove 1D+1D control. Mouse-based control takes intermediate values in both the cases. Among allthe cases utilizing the proposed mapping (implicitly or explicitly) with all the control paradigms,the glove 1D+1D control with implicit constraints is seen to take highest average movement time.This demonstrates that in this case, the user could not fully leverage the perceptual categorizationconstraints implicitly for making his trajectories faster enough. This might be due to the fact thatthe independent control appeared much harder to the user and therefore he would have neededmore practise time to be able to implicitly figure out the perceptual constraints better. In all othercases, the user performed satisfactorily in terms of reducing the movement times using both theimplicit and explicit perceptual constraints.The average ratios of the index of difficulty and the movement time (i.e., the information rate)for different vowel sequence lengths have been reported in Fig. 5.17, Fig. 5.18, and Fig. 5.19 for allthe three control paradigms. The bar chart in Fig 5.17 shows that using mouse-based control and96Figure 5.17: The mean information rates with mouse-based control for different vowel sequencelengths without any perceptual mapping (Round 1), with implicit perceptual constraints (Round2), and with explicit perceptual constraints (Round 3)with the incorporation of implicit constraints, the information rate increases for all vowel sequencelengths except for K=3. With the incorporation of explicit constraints, the information rate isseen to increase for all vowel sequence lengths. The case K=3 represents an anomaly, where theinformation rate increases marginally with the introduction of explicit constraint and even decreaseswith the incorporation of implicit constraint. A closer look at the Table 5.4 shows that this happensdue to the fact that the reduction in movement time with implicit perceptual constraint is less thanthe reduction in index of difficulty for all the examples corresponding to K=3. This is also observedin the glove-based control (discussed later) and is possibly because of the fact that with only onetransition vowel sequence length and implicit constraints, the user cannot utilize the mapping asmuch as he can for the longer sequences.The increase in information rate due to both the implicit and explicit constraints in mouse-based control is the highest for vowel sequence length K=5, followed by K=6 and K=4. Theaverage information rates calculated over all trajectories are 1.98 bits/s, 2.06 bits/s and 2.54 bits/sindividually for rounds 1, 2, and 3 respectively. This shows that while the average information ratedoes not show significant improvement with the incorporation of implicit constraints, the explicitconstraints in round 3 are capable of increasing the information rate in the mouse study by 1.2897Figure 5.18: The mean information rates with glove-based joint 2D control for different vowelsequence lengths without any perceptual mapping (Round 1), with implicit perceptual constraints(Round 2), and with explicit perceptual constraints (Round 3)times.The bar chart in Fig 5.18 presents the average information rates in terms of IDMT for glove-basedjoint 2D control paradigm for all the vowel sequence lengths K=3 to K=6 and for all the threerounds. It shows that the introduction of explicit constraints increases the information rate for vowelsequence lengths, whereas the incorporation of implicit constraints only increase the informationrate in K=4 and K=5. In both the mouse-based control and glove 2D control scenario, we find thatthe user is able to enhance the information rate with implicit constraints for the vowel sequencelength K=5 (i.e., with four transitions). With explicit constraints, the user’s information rate isaround 3 bits/s (i.e., quite similar) for K= 4, 5 and 6. This gives us the insight that this value isprobably near the maximum allowable information rate for the given set of tasks with mouse-basedcontrol and for K=3, the user simply cannot utilize the mapping fully for the reason discussedabove.The increase in information rate is highest in case of K= 4, followed by K=5, K=6 and K=3 forRound 3 (with explicit constraints). The average information rates calculated over all trajectoriesin glove 2D control are 2.20 bits/s, 2.27 bits/s and 2.80 bits/s individually for rounds 1, 2, and98Figure 5.19: The mean information rates with glove based independent 1D+1D controlfor different vowel sequence lengths without any perceptual mapping (Round 1), with implicitperceptual constraints (Round 2), and with explicit perceptual constraints (Round 3)3 respectively. This shows that similar to the mouse-based control, while the average informationrate does not show significant improvement with the incorporation of implicit constraints, theexplicit constraints in round 3 are capable of increasing the information rate in the mouse study1.27 times. It is to be noted in this context that the increment factor of average information ratedue to incorporation of perceptual constraints in the glove 2D and mouse-based control are verysimilar to each other. This validates our study and gives us more insights regarding the informationcapability of the hand motor control pathway.The bar chart in Fig 5.19 presents the average information rates in terms of IDMT for glove-basedindependent 1D+1D control paradigm for all the vowel sequence lengths K=3 to K=6 and for allthe three rounds. It shows that the introduction of explicit constraints increases the informationrate whereas the incorporation of implicit constraints decreases the information rate for all vowelsequence lengths. Similar to our observations in both the mouse-based control and glove 2D controlscenario, here also we find that the user is able to enhance the information rate to the highest withexplicit constraints for the vowel sequence length K=5 (i.e., with four transitions). Besides, withexplicit constraints, the user’s information rate is just above 2 bits/s (i.e., quite similar) for K= 4,5 and 6. This gives us the insight that this value is probably near the maximum information rate99Figure 5.20: Variation of movement time with index of difficulty without any perceptual mapping(Round 1)that can be utilized by glove-based 1D +1D control for the given set of tasks. Here also, the usersimply cannot utilize the mapping fully for K = 3 for the reason discussed above.One striking observation from the bar chart is that the information rate with implicit constraints(Round 2) is lower than that without any constraints (Round 1) for all individual cases. This isdue to the higher movement times required by the user to perform the task. The possible reasonbehind this has already been elaborated above. Another important observation is that the averageinformation rate for round 1 (without any constraint) is relatively constant and increases extremelyslowly with increase in vowel sequence length.The increase in information rate is the highest in case of K= 5, followed by K=6, K=4 andK=3 for Round 3 (with explicit constraints). The average information rates calculated over alltrajectories in glove 2D control are 1.46 bits/s, 0.82 bits/s and 2.00 bits/s individually for rounds1, 2, and 3 respectively. This shows that similar to the other two controls, while the averageinformation rate does not show improvements with the incorporation of implicit constraints, theexplicit constraints in round 3 are capable of increasing the information rate in the mouse study1.37 times. It is to be noted in this context that the increment factor in average information ratedue to the incorporation of explicit perceptual constraints in the glove 1D+1D is higher than that inboth glove 2D and mouse-based control. This shows that the user can achieve highest performance100Figure 5.21: Variation of information rate with index of difficulty without any perceptual mapping(Round 1)improvement in glove 1D+1D control leveraging the explicit perceptual constraints.In general, for all the rounds, we observe that the average information rates are higher in caseof glove 2D control followed by mouse-based control and then glove 1D+1D control. This is evidentfrom the previous discussion as well as the Fig. 5.21, Fig. 5.23, and Fig. 5.25. On average forRound 1, the information rates with Glove 2D and mouse-based control are noted to be 1.51 timesand 1.36 times that with the Glove 1D+1D control. The ratios take the values of 2.77 and 2.51respectively for Round 2 and the values of 1.40 and 1.27 for Round 3.The highest benefit in terms of the improvement of average information rate is noted in glove1D+1D followed by the mouse-based control and glove 2D control with the incorporation of ex-plicit constraints. On the other hand, with the incorporation of implicit constraints, there is nostrikingly noticeable benefit observed in any of the controls in terms of improvement of informationrate. Among all the cases utilizing the proposed mapping (implicitly or explicitly) with all the con-trol paradigms, the glove 1D+1D control with implicit constraints is seen to have lowest averageinformation rate. This again demonstrates that in this case, the user could not properly leveragethe perceptual categorization constraints implicitly for making his trajectories faster enough toincrease his throughput. In all other cases, the user performed satisfactorily in terms of increasing101Figure 5.22: Variation of movement time with index of difficulty with implicit perceptual constraint(Round 2)the average information rate (atleast marginally) using both the implicit and explicit perceptualconstraints.The increment of movement times with the increase in index of difficulty in all the three controlshave been demonstrated in Fig. 5.20, Fig. 5.22, and Fig. 5.24. The regression lines presented inFig. 5.20 demonstrate that the movement times corresponding to the Glove 1D+1D control arehigher than the movement times corresponding to the other controls, for all indices of difficulty,without any perceptual mapping. The movement times corresponding to the Glove 2D control arelower than those corresponding to the Mouse control for all indices of difficulties after ID = 15 bits.This means that for lower indices of difficulty and without any perceptual mapping, mouse-basedcontrol needs lesser time to do the trajectory tasks than Glove 2D. The figure also demonstratesthat the regression line corresponding to the Glove-based independent 1D+1D control has thehighest magnitude of slope (m=0.65) followed by the Mouse control (m=0.45) and glove 2D control(m=0.29). This means that the user throughput takes the highest value (i.e., 3.44 bits/s) with the102Figure 5.23: Variation of information rate with index of difficulty with implicit perceptual con-straint (Round 2)Glove 2D control and the lowest value (i.e., 1.54 bits/s) with the Glove 1D+1D control. Mousecontrol takes an intermediate value of 2.12 bits/s. From this, it can be inferred that without theperceptual mapping, the Glove-based joint 2D control is the easiest control paradigm followed bymouse-based control, while Glove-based independent 1D+1D control is the hardest in case of thegiven trajectory tasks in the formant space. Figure 5.21 further shows that information rate remainsmore or less constant for the Glove 1D+1D control with the increase in the index of difficulty. Forthe other two controls, however, the information rate increases slightly with the increase in difficultylevel of the task. The glove 2D control has the highest rate of rise with respect to the task difficultylevel. This explains why the glove 2D control appears to be a better control paradigm than themouse-based control for harder tasks.Next, we discuss the impact of the perceptual constraints on all the three control paradigms.The regression lines presented in Fig. 5.22 demonstrate that the movement times corresponding tothe Glove 1D+1D control are higher than the movement times corresponding to the other controls,for all indices of difficulty, with implicit perceptual constraint. The movement times correspondingto the Glove 2D control are lower than those corresponding to the Mouse control for all indices oftask difficulty. This means that for all indices of difficulty and with implicit perceptual constraint,103Figure 5.24: Variation of movement time with index of difficulty with explicit perceptual constraint(Round 3)the mouse-based control demands an overall less time to perform the trajectory tasks. The figurealso demonstrates that the regression line corresponding to the Glove based independent 1D+1Dcontrol has the highest magnitude of slope (m=0.58) followed by the glove 2D control (m=0.22) andthe Mouse control (m=0.20). This means that the user throughput takes the highest value (i.e., 5bits/s) with the Mouse-based control and the lowest value (i.e., 1.72 bits/s) with the Glove 1D+1Dcontrol. Glove 2D control in this case takes an intermediate value of 4.15 bits/s. From this, it can beinferred that with the implicit perceptual constraint, the mouse-based control is the easiest controlparadigm followed by the Glove-based joint 2D control, while Glove-based independent 1D+1Dcontrol is the hardest in case of the given trajectory tasks in the formant space. It is noteworthyhere that the Fitts’ law slope is less steep for all control paradigms with the implicit constraints thantheir corresponding no-constraint versions. This proves that the overall user throughput increasesdue to the incorporation of the perceptual categorization, although the boundaries are implicitlyestimated by the user. Another striking observation in this round is that due to the addition of theconstraints, mouse-based control appears to be easier than the Glove 2D control (which was the104Figure 5.25: Variation of information rate with index of difficulty with explicit perceptual con-straint (Round 3)other way round before the addition of the perceptual constraints).Figure 5.23 further shows that the increase in information rate for the Glove 1D+1D controlis the least with the increase in the index of difficulty. For the other two controls, however,the information rate increases slightly with the increase in difficulty level of the task. The glove2D control and the mouse-based control both have almost similar rates of rise in informationrate with respect to the task difficulty level. The rate of increase of information rate for all thecontrol paradigms is observed to be higher than their no-constraint versions. This again proves theeffectiveness of incorporating the perceptual constraints on user performance.Now, we turn to discuss the impact of the explicit perceptual constraints on all the three controlparadigms, i.e., in the stage where the actual perceptual boundaries are revealed to the user. Theregression lines presented in Fig. 5.24 demonstrate that the movement times corresponding to theGlove 1D+1D control are higher than the movement times corresponding to the other controls,for all indices of difficulty, with explicit perceptual constraint. Besides, the movement times corre-sponding to the Glove 2D control are lower than those corresponding to the Mouse control for allindices of task difficulty. This means that for all indices of difficulty and with explicit perceptual105Figure 5.26: Variation of movement time with index of difficulty for Mouse-based control withimplicit (Round 2) and explicit (Round 3) perceptual constraintsconstraint, the mouse-based control demands an overall less time to perform the trajectory tasks.The figure also demonstrates that the regression line corresponding to the Glove-based independent1D+1D control has the highest magnitude of slope (m=0.25) followed by the Mouse-based control(m=0.20) and the glove 2D control (m=0.18). This means that the user throughput takes thehighest value (i.e., 5.55 bits/s) with the glove 2D control and the lowest value (i.e., 4 bits/s) withthe Glove 1D+1D control. Mouse-based control in this case takes an intermediate value of 5 bits/sand is very close to that of the Glove-based joint 2D control. From this, it can be inferred that withthe explicit perceptual constraint, the Glove-based joint 2D control is the easiest control paradigmfollowed by the mouse-based control, while Glove-based independent 1D+1D control is the hardestin case of the given trajectory tasks in the formant space. It is noteworthy here that the Fitts’ lawslope is less steep for all control paradigms with the explicit constraints than their correspondingno-constraint and implicit-constraint versions. This again proves that the overall user throughputincreases due to the incorporation of the perceptual categorization. Another striking observationin this round is that due to the addition of the constraints, Glove 2D control appears to be eas-ier than the mouse-based control (which was the other way round with the implicit perceptual106Figure 5.27: Variation of movement time with index of difficulty for Mouse-based controlwithout any perceptual constraint (Round 1) as well as with implicit (Round 2) and explicit (Round3) perceptual constraintsconstraints). However, the difference in user throughput is very minute and therefore a concreteconclusion regarding this can be attained after more empirical research with more users to eliminateany user-related biases.Figure 5.25 further shows that the increase in information rate is the least with the increasein the index of difficulty for the Glove 1D+1D control. It is closely followed by the other twocontrols. The glove 2D control and the mouse-based control both have almost similar rates of risein information rate with respect to the task difficulty level. The rate of increase of informationrate for all the control paradigms is observed to be higher than both their implicit constraint andno-constraint versions. This demonstrates the high effectiveness of incorporating the perceptualconstraints explicitly, with harder tasks. It follows that the perceptual categorization constraintcan be utilized more in tasks with greater difficulty levels. In other words, the more complex thetask is, the more is the demand for leveraging any available perceptual constraints that would allowthe user to complete the task faster.The correlation coefficients and regression metrics (reported in Section 5.4) are satisfactory for107Figure 5.28: Variation of movement time with index of difficulty for Glove-based joint 2Dcontrol with implicit (Round 2) and explicit (Round 3) perceptual constraintsFigure 5.29: Variation of movement time with index of difficulty for Glove-based joint 2Dcontrol without any perceptual constraint (Round 1) as well as with implicit (Round 2) and explicit(Round 3) perceptual constraints108Figure 5.30: Variation of movement time with index of difficulty for Glove-based independent1D+1D control with implicit (Round 2) and explicit (Round 3) perceptual constraintsFigure 5.31: Variation of movement time with index of difficulty for Glove-based independent1D+1D control without any perceptual constraint (Round 1) as well as with implicit (Round 2)and explicit (Round 3) perceptual constraints109all the control paradigms, which show that the Fitts’ law is indeed applicable to these trajectorytasks, both with and without the perceptual constraints. It also demonstrates that the formulationof the indices of difficulty metrics is reasonable, although there is scope for further improvements.In order to allow easier comparison of the effects of implicit and explicit constraints on differentcontrol paradigms, we include Figures 5.26, 5.28, and 5.30 that show the variation of movementtimes with change in index of difficulty separately for each individual control. The solid linesindicate implicit constraint whereas the dashed lines indicate explicit constraint. A quick glanceat these figures clearly shows that the movement times and the slope of the lines correspondingto explicit constraints are lower than their respective implicit constraint cases. This proves thatexplicitly showing the perceptual boundaries in the formant space leads to a significant improvementin user throughput and reduction in movement times required to perform the trajectory tasks usingall the three control paradigms under consideration. Further, we introduce the regression lines withthe initial no-constraint cases in these line graphs, as shown in Figures 5.27, 5.29 and 5.31. This isto point out the overall improvement in the index of performance due to the incorporation of themapping as well as the reduction of the index of difficulty of the task space due to the categorizationof the formant space (which is a consequence of the perceptual mapping as well). All the threegraphs depict that the incorporation of the perceptual mapping has shifted the line-graphs to theirleft thereby demonstrating the reduction of task complexity. Besides, the movement time for therespective trajectory tasks also decreases due to the impact of the perceptual constraints, which isreflected in vertical downward shift of all the line graphs. Furthermore, it also shows that overall,the slope of the lines decreases after the introduction of the perceptual constraints, which impliesan increase in user throughput.To conclude, our study leads to the following significant findings:1. The proposed mapping reduces the difficulty level of the trajectory tasks by including per-ceptual boundaries in the formant space thereby introducing speech-like quantal effects.2. The user can utilize the proposed mapping to drastically reduce the movement times whilestill reaching the perceptually relevant targets with hand movements3. The rate of information processing is nearly constant in different hand motor control paradigms.This means that the user is not producing more information in a certain amount of time asthe task gets harder, rather taking better advantage of the information they can produce bymaking the task easier.4. The utilization of the perceptual mapping and constraints leads to an increased throughputor index of performance.1105. Joint 2D controls are better choices of control paradigms than independent 1D+1D controlsfor performing continuous trajectory tasks.5.6 Chapter summary and ContributionIn this chapter, we investigated the user’s possibility of leveraging the quantal nature of the formantspace via utilization of the proposed mapping, with the aim of improving the task performance, i.e.,increasing the information rate of the motor control paradigms. We first elaborated our trainingprocedure including data collection and generation, architectural choices, selection of training hy-perparameters, etc. We also presented the baseline methods and compared the proposed methodswith them for both the models and further critically analyzed the performance of our network. Wethen moved on to the user study including the descriptions of the devices used, paradigms of controlas well as the step-by-step protocol of data collection through the devices. We next examined theimportance of the perceptual constraints by comparing and contrasting the index of difficulty ofthe trajectory tasks in the quantal and non-quantal space with the help of several case studies. Toobserve the impact of the proposed neural network models, we then evaluated the index of per-formance of the user under three conditions: (a) without utilizing mapping, (b) utilizing mappingbut without seeing the perceptual boundaries and (c) utilizing mapping while explicitly seeing theperceptual boundaries. Again, all these experiments were performed with three different motorcontrol paradigms using the mouse and data glove-based input devices. Through these studies, wequantitatively demonstrated that the user can indeed utilize the proposed perceptual mapping aswell as the quantal nature of the formant space to enhance his performance. Finally, we addressedthe significance of the proposed mapping and discussed some of the important factors and issuesrelated to the task.This chapter has four major contributions. Our first contribution is related to the quantificationof difficulty level of a speech-like task. We provide a quantitative measure indicating the difficultylevel of different formant trajectory tasks in quantal as well as non-quantal 2D space. In otherwords, we preliminarily answer the questions: How difficult will it be to speak with one’s handby navigating a 2D acoustic space? How difficult is a vowel trajectory task under such conditionsand how to compare the difficulty level of different hand-to-speech tasks from a kinematic pointof view? This lays the foundation of the information theoretic view of speech articulation tasks.Our second contribution is related to the investigation determining why speech articulation appearseasy. We demonstrate that on one hand, the proposed perceptual network is capable of reducingthe complexity of the task space by categorizing a continuous 2D plane into discrete regions and onthe other, the user is capable of utilizing the network and the categorized task space to effectivelyincrease his throughput. This result is of utmost importance and is one of the primary contributions111of the thesis. This is because, the result shows that it is possible for human motor control systemto take advantage of given perceptual constraints to reduce the effort level of a task. This providesus an intuition that the quantal effects of speech are partly responsible for making the speech taskconsiderably easier. Our third contribution is that we show the possibility of human motor controlpathway possessing a certain amount of information capability. With increase in difficulty level,the motor control pathway does not produce drastically high information in a given amount of timebut tries to take better advantage of the constraints to make the task easier. Last but not theleast, we quantitatively show that the coupled control (joint 2D) is more convenient to achieve atrajectory task than a decoupled control (1D+1D). This is clearly reflected in the movement timeduration of the user, both before and after utilizing the perceptual constraints.112CHAPTER 6Exploration of interfaces and mappingsThe primary objective of the current work is to study the information theoretic background of neuro-muscular speech motor control and the last three chapters provide an initial step for exploring theproblem space. However, further investigations are required to analyze the underlying mechanismbehind such control. For this, we need to extend the current study to the actual vocal tract space,muscle activation space as well as the neural control (or active thought) space. In what follows,we elaborate our preliminary works on the exploration of different control interfaces, signals andmedical images targeting possible extensions of the hand-to-formant study in near future. Thisincludes performing feasibility studies with kinematic and force-based controls of virtual as well asphysical vocal tract interfaces, mapping tongue movements recorded via medical imaging modalitiesto speech, connecting imagined speech tokens acquired from brain signals to respective speechtokens, etc. This exploration is of secondary importance and also has other applications beyondthe scope of the current work and has led to a number of publications [2, 62, 102–104, 143–145, 145–151, 162].In Section 6.1, we discuss the feasibility of using two PC/web-based sound interfaces namedPink Trombone and VTDemo for controlling vocal tract geometry. In Section 6.2, we introducea novel mechanical sound interface and investigate a kinematic control strategy to manipulatethe shape and position of a tongue-like structure. Further in Section 6.3, we explore a force-based strategy to control selected tongue muscles in order to vary the vocal tract geometry andconsequently the sound output via articulatory speech synthesis pathway. All these experimentswill be useful to design the next set of studies investigating difficulty level of an actual speecharticulation control task using speech-related interfaces. Next, we move towards establishing aconnection between tongue/vocal tract movements to speech in Section 6.4. This involves ourinitial study on developing MRI-based speech recognition system and, more importantly, our workon mapping ultrasound tongue movements to formant frequencies. Furthermore, we also presenta joint articulatory-acoustic representation using the Pink trombone interface to allow invertiblemapping between the vocal tract configuration and the corresponding acoustics. These worksprepare the backgrounds for extending our current investigation on formant frequency space to thearticulatory space using imaging modalities. Last but not the least, we develop imagined speech113Figure 6.1: Different sound control interfaces discussed in Sections 6.1, 6.2 and 6.3 (a) PinkTrombone, (b) VTDemo, (c) Sound Stream, (d) Sound Stream IIrecognition systems from EEG in Section 6.5 to explore the information content of available speechimagery EEG acquisition devices. This study demonstrates the existing challenges in EEG-basedimagined speech analysis and control and suggests the possible ways to overcome the limitations.The overview of the investigations are shown in Fig 6.1 and Fig. 6.2. We believe that theseinvestigations will definitely facilitate further research on finding better indices of difficulty andperformance metrics corresponding to tasks that are closer to the original speech articulation orthe neuro-muscular processes underlying it. In other words, the investigations presented in thischapter will provide more insights on our future attempts extending the present study to vocaltract space, muscle synergy space or even active thought space.114Figure 6.2: Different mappings discussed in Section 6.4 and Section 6.5. (a) MRI-based speechrecognition, (b) US-based speech synthesis, (c) EEG-based speech recognition6.1 Kinematic control of interfacesA major part of our previous study dealt with the mouse-based control of a trajectory task in the2D formant space. In the following studies, we attempt to perform a fundamental study exploringmouse-based control of vocal tract structures in articulatory space instead of acoustic space. Forthis, we select two popular articulatory speech interfaces named Pink Trombone and VT Demo.Unlike the previous investigation, we do not incorporate any quantitative difficulty metric in thisstudy and rely mostly on the user’s personal ratings as a sanity check. Nevertheless, the studyessentially presents an initial investigation on the alternative ways of approaching the same problemspace via articulatory control and gives valuable insights on exploring speed-accuracy trade-off inmouse-to-tongue motor control.115Figure 6.3: Vocal tract configurations in Pink Trombone6.1.1 Pink TrombonePink Trombone [171] is an online voice synthesizer application that presents an interactive mid-sagittal view of human vocal tract, which can be manipulated by users through mouse control tosimulate various vocal sounds. The users can slide the variable circular purple tongue, lips, hardpalate and velum, and consequently hear the vocal sounds in real time, ranging from shrill screamsto low rumbles. It is an integration of vocal tract, nasal tract and the glottis - from which soundis generated.The tongue position can be changed through mouse by manipulating a circular point movingfreely in a triangular control space. The cursor can be dragged over the tongue surface and heldfixed at a particular point, clicking the left button of the mouse to change the tongue shape to thatparticular shape, but as soon as the left click is released, the tongue shape changes consequently.The glottal excitation can be varied through separate set of controls known as Voice-Box Controlwhich performs the pitch (frequency) and gain variation. There are two discrete coarse levels ofgain. Moreover, there is another option for continuous fine tuning of gain and frequency through aslider at the bottom.The aim of this study was to analyze the convenience in controlling a web-based interface of thehuman vocal tract based on user feedback and performance. We asked the research question: Howconvenient is it from the user’s perspective to control multiple degrees of freedom simultaneously116using the Pink Trombone control paradigm? We neglected the sound quality as a measure of per-formance metric since we are primarily exploring the input control space rather than the resultantacoustic space.A total of 13 participants (8 females and 5 males, in an age range of 20-34) were selected fromUBC for the study. We showed different target tongue shapes to the users with varied complexitylevel and asked them to eventually rate the control paradigm on a scale of 1 to 5, based on theirconvenience or ease in achieving the given tongue shapes via mouse control within fixed time. Auser rating of 5 means that they find the control very convenient, while a rating of 1 means thatthey struggled hard to perform the given task and do not consider the control to be suitable for thetask. The average rating of 13 users was 3.46 out of 5. We could not explore the joint control oftongue position/shape and vocal fold parameters as Pink Trombone does not allow simultaneousdual control of the vocal tract and vocal fold with mouse.The users found the Pink Trombone very engaging as it provides a good visual representationalong with interesting sound variations. According to some of them, the interface is very clear andreactive and it simulates a real-tongue. A direct manipulation of the tongue for position control isconvenient to use for educational purpose. However, users having previous background knowledgein speech articulation found the changing shape of tongue less promising in this interface. Theyprovided some deeper insight that, despite looking like a real tongue, it somewhat fails to provideenough flexibility that the tongue is supposed to give. The tongue shapes and trajectories werepre-programmed, in a way that it can take limited number of shapes. In general, they found thePink Trombone tongue moderately easy to control with mouse. Different tongue configurationscorresponding to vowels and consonants are presented in Fig 6.3.6.1.2 VT DemoVTDemo [74] is a Windows PC interface that investigates the variation of acoustics correspondingto the changes in vocal tract shape. The interactive application gives the user an opportunity tochange the sliders mapped to several vocal tract and glottal excitation parameters and therebysynthesize vocal sounds in real-time. The sound synthesis engine relies on an area function-basedone dimensional wave equation, similar to Pink Trombone, to generate the sounds.The control panel has a total of 10 sliders, aimed at changing the Jaw Height (JW), TonguePosition (TP), Tongue Shape (TS), Tongue Apex (TA), Lip Area (LA), Lip Protrusion (LP), LarynxHeight (LH), Glottal Area (GA), Fundamental Frequency (FX) and Velo-pharyngeal port opening(NS) individually, as shown in Fig 6.4. The sliders can be varied only one at a time, implyingthat the user has to change the parameters one after another to achieve a target vocal tract shapeand target sound. However, in real world, the human vocal tract articulators work simultaneously117Figure 6.4: VTDemo interfacein an extremely interdependent manner, due to the intermingling of muscles, bringing about thearticulatory movements. Hence, this kind of control becomes highly unrealistic when comparedto articulatory speech production. For this work, we were mostly concerned about the control oftongue surface, involving sliders TP, TS and TA.We ran a similar study as mentioned in the previous section, with the same users and for thesame task of matching the given tongue shape/positions through mouse control. The average ratingof 13 users in this case was 3.07.The VTDemo received comparatively low feedback with respect to pink trombone, in terms ofcontrol and design. All the users found it hard to understand the mapping between their slidercontrol action and the final shape change. The lack of a direct control made it difficult for them tointerpret the effects of slider variations. The users also remarked that controlling one feature at atime makes it more unrealistic and it needs more coordination between the sliders to make it moreinteractive. In general, the users found it quite difficult to control.6.1.3 Discussion and Future StudiesOur initial study with these two interfaces revealed that they can be potentially used to investigatemotor control in articulatory space as an extension of the current work on hand-to-formant mapping.The VTDemo interface has disentangled different degrees of freedom of the tongue into inde-pendent 1D slider-based controls. This given slider-to-tongue movement mapping can be utilized118(eg: using TP, TS, TA) to explore 1D+1D control somewhat analogous to our glove-based 1D+1Dcontrol. The only difference is that the control space here is articulatory as opposed to acousticthat we use in our glove/mouse study. Considering one slider at a time essentially translates theproblem to an equivalent 1D Fitts’ task and can be used to find the index of difficulty as well asthroughput metrics related to the vocal tract geometry. This will help us to reach one step closerto investigating the articulatory speech control in a simulated environment. The ability to controlmultiple sliders like TP and TS simultaneously will allow more realistic tongue motion in horizon-tal and vertical directions (in mid-sagittal plane) through hand gestures. Particularly, since it hasdefinite semi-realistic vocal tract boundaries, this interface can be utilized to explore the impact ofphysical constraints in the index of difficulty of speech task. For example, we can carefully designtarget tongue shapes for an experiment to see if human motor control can take the advantage ofthe physical constraints to decrease the effort level of his hand-to-speech task.On the other hand, Pink Trombone also provides us the opportunity to examine the advantageof physical constraints (if any) but with a joint control of the cursor within the triangular vowelspace. This is closer to our current study with formant space but offers the added advantage of itsavailable connection with the tongue shape. Therefore, it can be investigated as a natural extensionof our current work. It would be interesting to study how the motor learning and control changeswith or without seeing the triangular vowel space that the cursor is confined to. The presence orabsence of the vowel space changes the control paradigm to a great extent by changing the visualfeedback. Intuitively, the presence of the vowel space makes the control essentially a point-controlin 2D space which is easier, where as, blocking the user access to the triangle will lead to moreemphasis on controlling the tongue shape rather than the cursor position thereby increasing thedifficulty level of the task.One important consideration in the motor control analysis of both these interfaces is the indexof difficulty calculation. Now that we have performed an initial pilot analysis of the Pink Tromboneand VTDemo control, we can extend our study to understanding the difficulty level of the tasks byutilizing our knowledge in Chapters 3 and 5. This can therefore be used to explore the difficultylevel of speech task which is fundamentally more similar to the vocal tract configurations andencompasses the articulatory constraints.In general, for computation of the index of difficulty, we decouple the controller from the taskspace and only focus on the task space. However, it is crucial to find out the manifold space forcomputing the index of difficulty in some controls. For example, with a joint 2D controller like thePink Trombone, it is permissible to define the index of difficulty in the non-manifold space. Butfor some other controllers that involve independent controls, the difficulty level of a task definedin the actual manifold space may be lower than that in the non-manifold space. For an instance,in a Cartesian coordinate control space (with abscissa and ordinate controllers), drawing a circle119Figure 6.5: The proposed mechanical interface: SOUND STREAMis difficult, but the same task is easier in a spherical coordinate control space (with radius andtangent controllers) which is possibly its manifold space. Similarly, the tongue may appear toperform complex speech articulation tasks but it is possible that it has a manifold space dictatedby the available constraints as a result of which the articulation task becomes easier. Future workcould be directed towards analysing the manifold space of the synthetic tongue or vocal tract inthese interfaces and further investigating how the availability of constraints effect the throughputof the articulatory control.6.2 Mechanical interface controlThe previous section deals with the control of web-based/online sound interfaces. In this sectionwe instead discuss a mechanical sound interface called ”Sound Stream” towards investigating amulti-degrees of freedom (DOF) hand-to-speech motor control. We develop a 2D mechanicallycontrolled tongue-like (i.e., closely relating to tongue properties) structure with a novel five-DOFcontrol scheme targeting vocal sound synthesis. The target of the study is to develop and validatea convenient, easy-to-learn and cost-effective physical interface with improved mechanical controlleveraging multi-DOF capability of human arm. We use a simple spiral spring to model the uppersurface of tongue and cylindrical clay wrapped with black tape to represent the fixed upper palateand lingual base (upper and lower boundaries). This arrangement enables the users to utilize theirfingers to vary a set of sliders and their wrists/elbows to control a mobile platform on which thesliders are mounted. This allows them to modify the shape of tongue surface and its position120Figure 6.6: 3 DOF Slider Control Schemein 2D plane which in turn alters the anterior part of the upper airway, thereby modulating soundpropagation through it. Additionally, we provide a two-way (slider and mouse-based) control of thegain and frequency of sound source (glottal excitation), to investigate user preference in controllingacoustic parameters, through the other hand.6.2.1 Interface designIn order to build the experimental apparatus, we use a cork board as the base, card boards forcontrol space and movable platform, Arduino, document camera, slider sensors, mouse, a laptopand a speaker. To create a sagittal view of the tongue surface on a 2D plane, a spiral springelement is used and three different points of it (tip, top and root) are connected to the three slidersthrough thin aluminium rods, as shown in Fig 6.5 and Fig 6.6. The user would be able to vary theshape and the position of the spiral element within an articulatory space bounded by the upperpalate and the lingual base. The hardware interface is divided into two controlling blocks. Theprimary one is targeted to change the tongue shape and position while the secondary one is aimedto control the source (glottal) frequency and gain. The user would be able to use both the blockssimultaneously through an ambidextrous (dual-handed) control scheme. The overall latency of the121Figure 6.7: 2 DOF movement of controlling blockinterface is found to be 0.08 seconds.6.2.1.1 Primary Block: Tongue Shape and Position controlThree sliders are mounted over a movable platform and attached to the spring to control and changethe tongue shape as shown in Figures 6.5 - 6.8. Keeping the platform fixed, the users can varythe slider positions with three fingers to create the desired tongue shapes. Similarly, to changethe tongue position separately, the user can move the platform on a glossy cardboard base withspecified boundaries, as shown in Fig 6.6 and 6.8. To restrict the unrealistic angular rotation of thetongue-like structure, the movable controller is enabled with only two-DOF, by the arrangementas demonstrated in Fig 6.7, which allows its displacement along x and y direction only. Therefore,in total, this design provides a five-DOF control, which can be used to simultaneously manipulatethe tongue position and shape as evident from Fig 6.5, Fig 6.8 and Fig 6.9.6.2.1.2 Secondary Block: Source (Glottal) frequency and gain controlIn order to simulate this functionality, we come up with two designing ideas which can reducethe user’s effort while controlling both the blocks simultaneously. In the first design, we use twoslider sensors connected to a microcontroller. The position of both the sliders are tracked to varythe source frequency and gain, as demonstrated in Fig 6.8. In the alternative second design, weuse a mouse connected to the laptop which can be moved within a predefined window so that thecursor position can be tracked and mapped accordingly, to a gain-frequency co-ordinate system, asdemonstrated in Fig 6.9. With the left click button pressed, moving the mouse towards right (orleft) would correspondingly increase (or decrease) the frequency and moving down (or up) wouldcorrespondingly increase (or decrease) the gain. The user can easily follow any desired trajectoryin any direction, in order to sequentially vary the glottal frequency and gain. The whole set-up is122Figure 6.8: Dual-handed Simultaneous Control (Scheme 1)placed under a document camera as shown in Fig 6.5, and the camera is connected to the PC forprocessing.6.2.2 Acoustic System DesignHere we briefly discuss the computation of vocal tract area functions from tongue configuration aswell as the sound synthesis procedure using the estimated area functions.6.2.2.1 Real-time Area function ComputationThe real-time video captured by the document camera is read through Image-Mate Software, inMATLAB environment. The corresponding image frames are extracted at a frame rate of 30 fps.The rigid upper palate and tongue-like spring structure are detected and extracted utilizing threedistinct variations in vertical image intensity profiles. At regular spatial intervals i.e 30 control123Figure 6.9: Dual-handed Simultaneous Control (Scheme 2)points (i) along the structures, we compute the vertical distance (di) between the palate andtongue surface or the lower boundary, as shown with Red arrows in Fig 6.10, and using these valuesas the diameters, we derive the corresponding 2D area functional values (Ai) [116] that will beutilized for the next step.6.2.2.2 Sound Synthesis EngineThe goal of this design is to approximately model the vocal tract sound propagation by using awaveguide model in 1D acoustical tube. For sound propagation in vocal tract, a well known physicalmodel is Kelly-Lochbaum (KL) which employs a 1D acoustical tube structure characterized by anarea function. The underlying idea of KL model is that a 1D plane wave that surfaces from the farend of the vocal tract (glottis) travels through a line of concentric cylinder segments with varyingcross-sectional areas defined by area function to the open mouth end. In this work, we implementa method described in [181] which eliminates the drawbacks of KL model. The vocal tract ismodeled as an acoustic tube with its shape changing accordingly with the area functions received124Figure 6.10: 1D area function Computationfrom image processing module. Glottal excitation pulse was generated according to the Rosenberg’smodel [139]. This vocal fold model is coupled to discretized acoustic equations in the vocal tract.The acoustic wave propagation is simulated by numerically integrating the linearized 1D Navier-Stokes pressure-velocity PDE in time and space on a non-uniform grid. The synthesis mechanisminvolves excitations acting as source placed in the tube and sound propagation being simulatedby approximating the pressure-velocity wave equations[181]. All these models are implemented inJava Audio Synthesis System (JASS) [182] written in Java.6.2.3 Experiments and resultsAs the experimental focus is primarily on how the users control a physical interface of the humanvocal tract (i.e., the tongue-like structure), we eliminated the vocal sound generation as a measuredvariable.6.2.3.1 Task DesignWe design a set of six tasks to explore the controlling elements of the Sound Stream interface. Thefirst three tasks are designed to measure the user’s preferences while controlling multiple degree offreedoms to change the tongue shape and the position simultaneously. Though the source (glottal)125frequency and gain are not related to the tongue shape and position change, it plays an importantrole while producing vocal sound with various pitch and loudness. So, in order to provide morecontrol to the users over the sound, we ask the user to use both the slider and a mouse to understandtheir flexibility and convenience in using a both-handed control. The number of degree of freedomsinvolved, are increased with the tasks to increase the difficulty levels of task. First, the participantsare asked to only change the tongue shape with three sliders. Second, the participants are askedto change the shape and position by controlling all 3 sliders as well as the movable platform; andfinally, they are instructed to change the tongue structure (shape and position) along with theacoustic parameter variation (gain and frequency). The last three tasks are designed to investigatehow the Sound Stream interface facilitates the users to create different tongue shapes with lesseffort. So, we pre-define three tongue structures based on their complexity (simple to high). Andparticipants are asked to create the same tongue shapes within a limited time to measure the error.After completing the tasks, the user has to fill out a Google questionnaire form. The ques-tionnaire is designed to collect the demographic data of the participants and their feedbacks ina scalable format for quantitative analysis. Following this, an interview session is arranged toget more insights into the challenges that the users went through, while interacting with all theinterface.6.2.3.2 Research questionsWe intend to answer the following research questions by performing the quantitative analysis ofthe data that has been collected during the user study.1. Will there be any significant difference in user’s preference for three DOF, five DOF orambidextrous (AMB) i.e five DOF + Gain and Frequency control in Sound Stream?2. Will the user find controlling multiple parameters simultaneously in Sound Stream difficult?3. Will there be any noticeable difference in the user’s performance, i.e., error ratio, whilecreating a predefined tongue shape under a limited time?6.2.4 ResultsA total of 3 repeated measures ANOVA tests are used to analyze the data and validate themagainst our hypothesis. The first two tests are focused towards the measurement of the controllingaspects of the proposed physical interface while the last test is to measure the error ratio. We alsocompare the user feedback of our interface with that of Pink Trombone and VTDemo discussed inthe previous section.126Table 6.1: V ariance analysis of Hypothesis 3Level of Complexity η2 F pSimple 0.038 0.029 0.86Medium 3.26 0.34 0.56Complex 3.84 3.26 0.096The first test is targeted to compare whether the users are comfortable enough with ambidex-trous controllers having multiple degrees of freedom. There is no significant difference in theaverage value of participants’ preferences (M3−DOF = 3.84, SD3−DOF = 0.68; M5−DOF = 3.53,SD5−DOF = 1.19; MAMB = 3.38, SDAMB = 1.04). And the variance analysis shows the followingresult: F(2,12) = 1.23 (Fcrit = 3.40), p = 0.31(> 0.05), η2 = 1.43.The second test is supposed to evaluate which interface facilitates better opportunities to con-trol the multiple degrees of freedom with ambidextrous controllers. We run a statistical analy-sis on the user’s preferences on a scale of 1 to 5 for each interface. The average value of theuser’s preference across all the three interfaces are MV TDEMO = 2.23, MSOUNDSTREAM = 3.76,MPINKTROMBONE = 3.15. And the variance analysis shows the following result: F(2,12) = 10.16(Fcrit = 3.40), p = 0.0006 (< 0.05), η2 = 15.58.To address the third question, we compute the differences in curvature metrics and positions oftongue tip, top and root points of the target images with the corresponding tongue shape createdby the users using all the three interfaces for each task level. The computed differences are averagedand considered as the error ratio across all the three interfaces. After the statistical analysis of theerror ratios for all the three task levels (simple, medium, complex), we found the correspondingp < 0.05. The results have been furnished in Table 6.1.6.2.5 Interpretation of resultsWe did not find any statistically significant difference in user preferences for the simultaneouscontrolling of the multiple degree of freedom i.e while using 3 DOF or 5 DOF or AMB control.Though many participants agreed upon the fact that they are comfortable with a controller having5-DOF, still they would like to have a better design of the controller which can ease their actionsand provide the same functionalities.To control the source frequency and gain, most participants preferred the mouse controller overthe slider sensors ( Mmouse > Mslider ). We found a statistically significant difference (p < 0.05) inthe user preferences while controlling multiple degree of freedom across all the three interfaces.We did not find any statistically significant difference for creating different tongue shapes using127all the three sliders. One possible reason could be the predefined structure of the vocal tract inVTDemo and Pink Trombone interfaces which is more intuitive for the users, as they provide abetter mental model of which controlling parameters need to be changed to achieve the target. Butthis would be very difficult to generalize if we will increase the number of controlling points as usermight take more time to adjust each controller to achieve a given shape.6.2.6 Qualitative analysis and design issuesBased on the user feedbacks, we summarize the key points as well as propose a few possibleimprovements of the design aspects of our interface.The users found that the Sound Stream gives a better physical interpretation of slider-basedcontrol to the users. Mostly, they seem to be contented with the ’shape control’ side of the inter-face, bringing about various shapes with finger movements, which is evident from their comparativeratings as shown in Fig 6.11. However, according to some, the sliders could be replaced by controlelements like scroll wheel or elements fitting into hand or fingers, that would increase the conve-nience of control, serving the same function. They find position control of the tongue relatively moredifficult. This is because of friction arising from the paper base, on which the spring element lies,as well as the friction between the movable control block and the control space. So, it is necessaryto somehow reduce the frictions of this prototype to help the platform glide more smoothly, thatwill increase the convenience of the proposed simultaneous control. However, the interface receivedan overall satisfactory response. as shown in Fig 6.11, on the basis of simultaneous articulatoryand acoustic parameters, which was one of the targets of the work.Users also felt that the width of the movable control block could be decreased and that of thecontrol space could be increased to help them have a better grip and move it more freely. Based ontheir feedbacks, we have got a remarkable idea of replacing the square shaped movable block witha mouse-like structure, having three scroll wheels, i.e, a mouse with left and right click buttonsreplaced with scroll wheels. Obviously, it requires help from manufacturing experts and demandsre-thinking of control idea, as the new arrangement would need motors and actuators to vary thetongue movement based on control command.The users were more or less satisfied with the relative positioning and arrangement of the controlblocks. Some suggested that bringing the left-hand and right-hand controls on the same horizontalaxis could improve the convenience of control. While most users were satisfied with mouse basedcontrol, a few others preferred slider based control of gain and frequency. They justified that themouse movement is highly sensitive and it is difficult to achieve precise control, for fine tuning ofgain and frequency.The availability of a five degree-of-freedom control arrangement is advantageous to the users, as128Figure 6.11: Comparative user ratings on: (a) the suitability of varying tongue shape, (b) theeffectiveness of joint control of articulatory and acoustic parameters, (c) the availability of effectivecontrol actions129evident from Fig 6.11, because it utilizes the functionality of the hand and fingers. However, we arenot accustomed to do such rigorous tasks in our daily lives, that utilizes extensive finger and elbowmovements simultaneously. Thus, even though the users have the option of simultaneous changingof parameters, they would like to stick to two or three controls at a time. To summarize, the usersrealize that this arrangement gives them an opportunity to control more parameters at the sametime, but increase in number of control also implies increase in training required to precisely controltongue movement. So, eventually, the applicability of these interfaces depend upon the user-typesand the context of use. For new users or for learning purpose, users would like to use lesser control.However, for research purposes, they opine that the design nicely leverages the DOFs of the controlof both hands.The users also suggest to improve the aesthetic aspects of design so that it helps them to createa better mental model, which in turn will help the learnability and convenience of control. Inthis prototype, we used a simple image processing algorithm to detect the structures and computedistances between them, for avoiding computational complexities and hence used black colour forthe structures. The colour and texture of the background that represents the oral tract, the upperand lower boundaries representing the palate (hard and soft) and the lingual base could be changedto make it look more realistic. Besides, drawing artificial lips and throat and other mid sagittalstructures would give the users a better understanding. In that case, more robust image processingalgorithm needs to be used, that will require more computational time.6.2.7 Summary and Future DirectionIn this work, we explored the effect of simultaneous multiple degree-of-freedom control of tongue-like structure in articulatory speech synthesis. We assessed our interface with respect to PinkTrombone and VTDemo based on the proposed hypotheses. The users unanimously concludedthat Sound Stream being a physical interface is much more intuitive and helps the user to feelthe interaction better than the other two. Many users also opined that Sound Stream is betterin terms of simultaneous controlling aspects which lies in the premise of our research question.However, there are still a few design issues in the prototype, like the dimensions of the controlblocks, the friction between the spring element and the articulatory space or between the movableplatform and the cardboard base, etc. So, we conclude that our prototype needs more robust designto make it more user-convenient and easy-to-use for practical controlling aspects. Nevertheless,conceptually, it can be seen to perform the targeted tasks better than the available interfaces interms of controllability, expressibility and learnability and hence turns out to be the preferredinterface, with respect to Pink trombone and VTDemo for simultaneously controlling a multipleDOF system, like the vocal tract.130The possible areas of improvement are the sound synthesis engine and some aspects of conve-nience and controllability. Firstly, we plan to implement Finite Difference Time Domain (FDTD)[198] based synthesis engine through graphics pipeline to leverage the high computational capabil-ities of a GPU as numerical integration of a second order partial differential equation is necessarilya high computational problem. This makes the vocal sounds more natural sounding and intelligiblewithout trading-off the execution time. Secondly, we also want to enhance the convenience andcontrol aspects, as some of the users faced some particular difficulties to handle the Sound Streaminterface. So, we would look into the design aspects and try to replace and rearrange the controlelements based on the valuable suggestions received from the users. Furthermore, we would like todecrease the latency between mechanical control of the interface and the synthesis engine.The mechanical tongue arrangement will allow us to formulate the index of difficulty basedon the shape and position change of the tongue in future. This will undoubtedly give us moreinsights regarding the difficulty level of articulatory tasks in a mid-sagittal 2D vocal tract space.This mechanical interface involves an independent 1D+1D+1D shape control coupled to the joint2D position control and thereby gives us the opportunity to investigate the throughput of such acombined controller. Further investigations on finding ways of increasing the throughput wouldalso lead to the improvement of mechanical articulatory speech interfaces.6.3 ArtiSynth Tongue muscle controlArticulatory speech synthesis is utmost important for understanding the mechanism of humanspeech production. It encompasses the production of speech sounds using an artificial vocal tractmodel and simulating the movements of the speech articulators like tongue, lips, velum etc. How-ever, despite having considerable significance in research and learning purposes, there is a dearthof intuitive user interfaces to effectively control the articulatory parameter-based on simultaneousvariation of speech articulators. This is because of the level of complexity of the vocal tract ar-ticulators, that participate in speech production process. Vocal tract comprises of several organs,carefully controlled by the muscles. One of the key principles involved in articulatory synthesis liesin the simultaneous activation of these muscles to perform a multidimensional control of variousparts of the vocal tract. Such movement occurs in an extremely interdependent manner, due tothe intermingling of muscles.The prevalent user interfaces targeting such movements like Pink Trombone, VT Demo etc.utilize simple mouse-based kinematic control of mid-sagittal sliced tongue, lips, hard palate andvelum. These controllers allow the user to manipulate individual parts of the tract - one at a time, tosynthesize the vocal sounds. Furthermore, these changes occur in some predefined trajectories whichare less intuitive and difficult to relate to the slider changes triggered by the user. There is a lack131of user flexibility, since a user can achieve only one particular shape among a number of predefinedtongue shapes, corresponding to changes in slider values. Furthermore, it essentially enables user toexplore the effect of only one articulatory parameter or shape/deformation of one part of the tonguefor production of vocal sound. The other parts of the same articulator or different articulators areassumed to be fixed. Therefore, this kind of control becomes highly unrealistic when compared tothe actual articulatory speech production process. In particular, the tongue is a highly deformable,muscular hydrostat organ with infinite degrees of freedom, equipped with eleven muscles (extrinsicand intrinsic) controlling its shape and position. Kinematic control of a handful of points on thetongue surface ignores the practical biomechanical constraints behind speech. Hence, more researchneeds to be directed towards user-interface facilitating the control and manipulation of the tractcontour including tongue. Besides, most of the interfaces allow mere independent controls of variousparts, which means control of one part of the articulator do not reflect any changes in the otherparts or do not provide any feedback to the user for the variations in other interrelated parts.However, in reality, our muscles and articulators are intimately interleaved and have biomechanicalconstraints, because of which, movement in one part of an articulator renders changes in otherparts as well. To this end, we develop our SOUND STREAM II Interface trying to develop ahand-manipulated force-based realistic tongue-control strategy for sound production.6.3.1 Proposed methodologyWe present an interface involving four degrees-of-freedom (DOF) mechanical control of a two di-mensional, mid-sagittal artificial tongue structure through a biomechanical toolkit called ArtiSynthand a sound synthesis engine called JASS towards articulatory sound synthesis. The overview ofthe proposed interface has been shown in Fig. 6.12. As a demonstration of the project, the userlearns to produce a range of JASS vocal sounds, by varying the shape and position of the Ar-tiSynth tongue in 2D space through a set of four force-based sensors. In other words, the usercan physically play around with these four sensors, thereby virtually controlling the magnitudeof four selected muscle excitations of the tongue to vary articulatory structure. This variation iscomputed in terms of ‘Area Functions’ in ArtiSynth environment and communicated to the JASS-based audio-synthesizer coupled with two-mass glottal excitation model to complete this end-to-endgesture-to-sound mapping.Our hardware interface consists of four mini-joystick force sensors mounted on a fixed platform.These sensors are potentiometer-based force sensitive resistors that measure the applied force. Thefinger pressure exerted on each of the joysticks result in changes of resistances connected as a partof each voltage divider. Consequently, the analog input of Arduino microcontroller measures theoutput voltages, which are then translated to the tongue muscle excitations. The software interface132Figure 6.12: The proposed SOUND STREAM II hand gesture-to-sound control pathwayconsists of communication protocols between Arduino, ArtiSynth and JASS.6.3.2 Detailed MechanismThe proposed real-time gesture controlled sound synthesizer, through biomechanically-driven ar-ticulatory pathway, has three main phases as discussed below:6.3.2.1 Gesture-to-muscle activationsThe first step is force-activated tongue muscle control, where we essentially replace the high-dimensional neural control of muscles by low dimensional hand gesture-based tongue muscle ma-nipulation. Here, we follow a simplistic force-to-muscle mapping strategy, where the tongue muscleactivation ranging from 0 to 1, varies proportionally with the force exerted by the fingers133Extrinsic MusclesIntrinsic MusclesBeams computing cross-sectional area functionOne handOther handFigure 6.13: ArtiSynth Tongue Control6.3.2.2 Muscle-to-movementAs shown in Fig. 6.13, we particularly select two intrinsic (Inferior and superior longitudinal)and two extrinsic muscle groups (Anterior and posterior genioglossus) to be controlled by theambidextrous hand gestures. The longitudinal muscles are responsible for tongue retraction, makingit short and thick. On the other hand, the genioglossus plays a major role in tongue protrusionsand moving the tongue tip back and down. So variation of these muscle group excitations havesignificant effect on tongue shape and position. The established forward biomechanical pathwayin ArtiSynth allows conversion of muscle excitations to resultant movements. We utilized this toget tongue shape and position changes from the real-time variation of selected muscle activations.Next, we constructed a series of beams around the tongue, with 22 fixed markers set at regularintervals along the vocal tract surface. We further computed the distance between the tonguesurface (varying with muscle activation changes) and these markers and derived the effective crosssectional area function for feeding it into the articulatory audio synthesizer.6.3.2.3 Movement-to-Sound outputThe array of area functional values are sent in real-time to the Java Audio Synthesis System(JASS) which considers the vocal tract as an acoustic tube with its shape changing accordinglywith the area functions as shown in Fig 6.12. Glottal excitation pulse was generated accordingto the Rosenberg’s model and coupled to discretized acoustic equations in the vocal tract. The134acoustic wave propagation was simulated by numerically integrating the linearized 1D Navier-Stokes pressure-velocity PDE in time and space on a non-uniform grid. The synthesis mechanisminvolved excitations acting as source placed in the tube and sound propagation being simulated byapproximating the pressure-velocity wave equations.6.3.3 SummaryIn this work, we explored a low-dimensional subspace of the high dimensional neuro-muscular con-trol of tongue muscles, towards articulatory vocal sound synthesis. Using this interface, the usercan use his/her fingers to play around with the muscle activations, to achieve real-time changes intongue shape and position resuting in simultaneous variation of vocal sound. Therefore, this workoffers an alternative pathway to the conventional kinematic approach of controlling vocal tractmovements. A qualitative pilot study on the proposed interface revealed that though the inexperi-enced users find it somewhat difficult to achieve target tongue movements quickly, they agree thatthis interface provides them with more variability of inputs and more intuitive understanding of thehuman voice synthesis. Hence, it can be concluded that the proposed force-activated vocal soundcontroller is a step towards natural articulatory speech production and control.6.4 Mapping Tongue Motion to SpeechTo enable the future extension of our current work (on hand-to-formant task) to the vocal tractspace, we need to establish a successful mapping between vocal tract configuration and acousticsignals. For this, we perform three different image-based investigations. In Subsection 6.4.1, westart with the recognition or classification of static vowels, consonants and vowel-consonant-voweltransitions from MRI data. Based on the insights from the study, we perform the continuoussynthesis of vowels from ultrasound data as the next step in Subsection 6.4.2. Finally, in Subsection6.4.3, we develop an invertible mapping between vocal tract configurations and acoustic space. Allthese investigations will be briefly discussed below.6.4.1 MRI-based speech recognitionVocal tract configurations play a vital role in generating distinguishable speech sounds, by mod-ulating the airflow and creating different resonant cavities in speech production. They containabundant information that can be utilized to better understand the underlying speech productionmechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, weemploy an effective video action recognition technique called Long-term Recurrent Convolutional135Figure 6.14: Overview of the LRCN modelNetworks (LRCN) model to identify different vowel-consonant-vowel (VCV) sequences from dy-namic shaping of the vocal tract. Such a model typically combines a CNN-based deep hierarchicalvisual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporallydeep enough to learn the sequential dynamics of a short video clip for video classification tasks.We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterancesby 17 speakers. The comparative performances of this class of algorithms under various parametersettings and for various classification tasks are discussed. Interestingly, the results show a markeddifference in the model performance in the context of speech classification with respect to genericsequence or video classification tasks.6.4.1.1 Illustration of the Proposed MethodologyAmong the most widely used algorithms for video-based action recognition, 3D ConvolutionalNetworks [176], Two stream Convolutional Networks [157] and Long-term Recurrent Convolutional136Figure 6.15: Frames of rtMRI videos for speaker F1 producing [asa]. Time progresses from left toright.Networks (LRCN) [38] deserve special mention in our context. This class of algorithms incorporatespatial and temporal feature extraction steps to capture complementary information from theindividual still frames as well as between the frames, which is associated to our primary goalin this work. However, the two stream architecture involves CNN trained on multi-frame denseoptical flow images for the temporal recognition stream. Since the dataset we are using [160], hasa significant amount of non-stationary noise and artifacts, such algorithm gives more emphasison the noise instead of the vocal tract shape change, which in turn leads to erroneous soundclassification. On the other hand, though 3D ConvNets show faster performance on our dataset,the temporal pooling layers are too weak to extract long term temporal features spanning over aduration of 3 seconds. The end-to-end trainable LRCN networks combine CNN-based deep visualfeature extractor with LSTM-based long-term temporal dynamics extractor and are shown to havesufficient recurrence of the latent variables over temporal domain. Thus, we utilize the LRCNmodel to investigate whether the speech recognition problem can be viewed similar to a videoaction classification/recognition task, with the vocal tract movement resembling the action and theVCV sequences as the final mapped output interpreted from the action. However, the performanceof the other two is not significant enough in the current context and hence we decide to confine ourdiscussion with LRCN.We consider the VCV identification from the MRI as the ‘sequential input to fixed output’problem with videos of arbitrary length T as input and prediction of single labels correspondingto each video, as the output. The video is split into T individual frames or image sequences andfed to T convolutional network layer, each network layer implying a complete ResNet architectureand then connected through two fully connected layers with 2048 and 1024 neurons with 0.5 drop-outs respectively to an N layered LSTM with M hidden nodes. Next, the LSTM predicts theprobabilities of speech output class for each of the time frames and are averaged to yield the finalclass probability score across the entire sequence demonstrated in Fig 6.14.6.4.1.2 Experiments and Results1. Dataset preparation We evaluate our architecture on the USC Speech and Vocal Tract137Morphology MRI Database [160] which includes 2D real-time MRI of vocal tract shaping of17 speakers (9 female and 8 male) along with simultaneous denoised audio recording. Thisdatabase provides the resource to relate rtMRI data capturing dynamic vocal tract shapesand speech variability, thereby helping our attempt to explore the forward mapping pathwayfrom imaging data. The imaging sequence has a frame rate of 23.18 frames/second, slicethickness of 5mm, spatial resolution of 2.9mm2/pixel and field of view of 200mm× 200mm.Further details regarding the database are available in [124].The database contains 3 repetitions corresponding to each speaker for each of 51 VCV utter-ances (apa, upu, ipi, ata, utu, iti, aka, uku, iki, aba, ubu, ibi, ada, udu, idi, aga, ugu, igi,aTa, ithi, uthu, asa, usu, isi, aSa, uSu, iSi, ama, umu, imi, ana, unu, ini, ala, ulu, ili, afa,ufu, ifi, a Ïa, u Ïu, i Ïi, aha, uhu, ihi, awa, uwu, iwi, aja, uju, iji). We pre-segment these intoa total of 2754 videos, each containing the entire length of single VCV utterance. We furtherdivide the total available data into training dataset having 2268 videos for 14 speakers andtesting dataset with the remaining 486 videos. The testing dataset containing videos corre-sponding to 3 speakers with 3 repetitions were absolutely unseen during the training phase.Eight frames of a sample test data, where a female speaker F1 produces the VCV token [asa]has been shown in Fig 6.15.2. TrainingSixteen image frames are extracted from each of the videos with a stride of 3, excluding thesilent frames. The target is to classify the videos into any of the 51 VCV labels. For theCNN part, we utilize the 50 layered Residual Network (ResNet50) that is pre-trained on theILSVRC-2012 dataset [141]. This speeds up the training process with the optimized weightsand further restricts overfitting in the current small dataset. We vary the parameters one byone keeping all others fixed and simultaneously conduct evaluations for various combinationsto select the set of optimum values. The batch size is varied from 32 to 128 and set to theoptimum value of 64. Similarly, the number of steps/epoch for the training is changed from100 to 1000 and fixed at 500, which gives better accuracy. The range of values tried for layersand nodes of LSTM network are 1 to 5 and 32 to 512 respectively. We conclude that therespective optimal values are 1 and 256. The total number of epochs is kept at 140. Thedrop out value for the LSTM network is varied from .5 to .97 and finally fixed at 0.9. Theperformance considerably decreases when any activation function except softmax is used.With the optimal value of the set of parameters remaining fixed, the number of target classeswere changed from 51 to 3 and 17 for individual vowels and consonants, respectively. Subse-quently, the target class labels were also modified.138Figure 6.16: The Top-1 accuracy for vowel identification3. Performance EvaluationWe conducted evaluation for various combinations of parameters and classes (V, C and VCV).In the objective evaluation, the loss measure was given by categorical cross-entropy andvarious metrics such as Top-1, Top-5 and Top-10 categorical accuracy were employed toanalyze its performance.From Table 6.2, it can be observed that the mapping accuracy markedly improves from VCVto vowels. The algorithm achieves highest performance of 0.96 when we classify the datainto 3 vowel classes, as shown in Fig 6.16. The performance unexpectedly drops to 0.68 interms of Top-1 accuracy for 17 consonant identifications. A careful analysis of the predictedclasses shows that, the videos are wrongly classified because of their spatial characteristicsrather than the temporal features. In other words, it tends to shift towards speaker-basedclassification, rather than the targeted consonant-based identification. This is because thenumber of speakers is also 17 and hence, in this case, the ResNet architecture dominatesthe trained LSTM layer which makes the identification task much more difficult than vowelidentification.Top-1 classification accuracy for test VCV sequences is 0.42 as shown in Fig 6.17. Top-5and Top-10 accuracies reach 0.76 and 0.93 respectively. Though it has demonstrated Top-1accuracy of .8 for video action recognitions, it fails to maintain the accuracy for the currentsound identification task. The degradation of accuracy may be due to excessive increase in139Table 6.2: Speech identification performanceIdentification task Top-1 accuracyVowel 0.96Consonant 0.68VCV 0.42number of controllable parameters with increase in number of classes, which would requiremore training data for estimation.To investigate this issue further, we divided the dataset into 3 parts with 17 target classes,to avoid ambiguity among tokens involving similar articulator movements. As a consequence,the results increased by a considerable margin of 0.12 and the maximum accuracy reachedaround 0.55 which proves that the similar articulatory movements getting mapped to differentsounds is a vital issue that the LRCN algorithm is unable to detect.6.4.1.3 Discussion and LimitationAutomatic speech recognition from the movements and deformations of vocal tract articulators asvisualized from an imaging modality is an incredibly challenging task. The first issue associatedwith this, is the inter-speaker anatomical differences in vocal tract. This is not a considerableproblem in EMA, as the EMA recordings are more concerned with tracking the time varyingtrajectories of selected points on the tongue and palate. However, while extracting features fromimaging modalities, these anatomical differences restrict a particular model from generalizing thestructural features like the shapes and lengths of the articulators, which play secondary role inspeech identification. Hence, it is difficult to identify the articulator positions and contours due tothis inter-subject variability. This makes the conventional methods like Hidden Markov Models [70],Maximum Likelihood Mapping Models [72], Gaussian Mixture Models [173] quite ineffective inextracting appropriate sequential, structural features from the imaging dataset.Secondly, the subtle and rapid changes of the tract physiology makes the speech recognitiontask more difficult. This is conceptually different from the conventional action recognition tasks,where the target is to classify the video action space-based on widely varying user-movements.In our case, a speaker-specific frame-by-frame analysis of the image sequences corresponding tovarious VCV tokens demonstrate nearly similar tongue motion and vocal tract contour variation.This makes it difficult to detect significant changes in articulator movements for the consonantpart of the VCV utterances. Also, the action recognition tasks involve an underlying spatial objectdetection task to interpret the actions from the video-clips, through spatial networks. However,140Figure 6.17: The Top-1 accuracy for VCV identificationsuch networks [38, 157, 176] fail to extract meaningful features from rtMRI frames to map them todistinct articulatory movements towards speech identification.Thirdly, more than two-third of the total time duration for each video clip is occupied foruttering the two ‘V’s and the remaining time is utilized for the intermediate ‘C’. So there are moreframes corresponding to the vowels than the consonants, which means more feature extraction fromthose frames and heavier emphasis on the vowels in the classification task. Besides, the trainingdata corresponding to each vowel class is more than that corresponding to each consonant or eachtransition. The availability of training data is seen to have an enormous influence on the accuracyof the methods. Thus, while the algorithm shows great potential for vowel identification fromthe tokens, the accuracy reduces for consonants and gets further decreased for the entire VCVidentifications.Fourthly, rtMRI images have low spatial and temporal resolution and are infested with vari-ous noises and reconstruction artifacts [100]. This work mostly focuses on addressing the speechidentification issue from the raw images and hence, we did not incorporate any preprocessing mod-ule. However, it might be interesting to investigate how noise and artifact reduction or resolutionenhancement effects the performance accuracy.Lastly, we observed that the acoustic pitch and energy level of the audio signals exhibit sig-nificant variations for several VCV tokens, which could be utilized to achieve better classification.One way to incorporate this is to associate the target class labels with such acoustic variables and141take advantage of multi-variate loss function towards accurate prediction scores.6.4.1.4 Summary and Future DirectionsA promising deep learning-based video action classification technique named Long Term RecurrentConvolutional Network (LRCN) has been trained and tested on 2754 videos with 51 VCV tokens.Results showed that the targeted articulatory-to-acoustic mapping was satisfactorily approximatedfor vowels and acceptably for consonants. However, the classification performance demonstrated aconsiderable decrease in accuracy, while identifying the entire VCV transitions, which implies thatthe model alone is not sufficient to distinguish these transitional tokens.Along with the structural vocal tract model, it might be interesting to explore whether aug-mentation from the vocal tract flow model [198] can assist the algorithm to differentiate betweenconsonants that involve similar articulator movements. In future, we will investigate this issue aswell as address whether its performance can be increased by utilizing phonological or biomechanicalconstraints [58, 163, 186].6.4.2 Ultrasound-to-formant estimationThe previous work discusses the classification of vocal tract configurations into categories of discretevowels, consonants and vowel-consonant-vowel transitions. This is one of the ways of achievingarticulatory-to-acoustic mapping. However, we found that the classification accuracy decreasesconsiderably for vowel-consonant-vowel transitions. Therefore, it is evident that with increase invocabulary and length of sequences, it is not feasible to use classification techniques for establishinga practical large-scale articulatory-to-acoustic mapping. Instead of the recognition techniques, wetherefore turn towards exploring the synthesis pathway that will be discussed below.Investigation of speech motor control requires access to the speech articulators via positionsensors, signal acquisition devices like EMA or medical imaging modalities like MRI, US, CT etc.This will enable better understanding of the connection between the articulatory movements andresultant acoustic variations, which in turn will allow us to explore the information theoretic viewof speech motor control better. To this end, we address the articulatory-to-acoustic mappingproblem based on ultrasound (US) tongue images. The overview of the proposed method has beenpresented in Fig 6.18. Our approach targets automatically extracting tongue movement informationby selecting an optimal feature set from US images and mapping these features to the acoustic space.We use a novel deep learning architecture to map US tongue images from the US probe placedbeneath a subject’s chin to formants that we call, Ultrasound2Formant (U2F) Net. It uses hybridspatio-temporal 3D convolutions followed by feature shuffling, for the estimation and tracking of142vowel formants from US images. The formant values are then utilized to synthesize continuoustime-varying vowel trajectories, via Klatt Synthesizer. Our best model achieves R-squared (R2)measure of 99.96% for the regression task. Our network lays the foundation for articulatory motorcontrol as it successfully tracks the tongue contour automatically as an internal representationwithout any explicit annotation.6.4.2.1 Background and related works1. Previous worksThe research in this field was initiated with an attempt [33] to synthesize speech-based on 12GSM vocoder parameters from tongue contour points using Multilayer-perceptrons (MLP).The input was later modified and combined with lip coordinates and mapped to 12 line spec-tral frequencies (LSF) using similar MLP networks [34]. The input space was further alteredin [75] to additionally include Eigentongue features as well. Another related work [25] utilizeddifferent combinations of feature representation including eigen-tongue and correlation-basedfeatures to map to 13 MGC-LSP features using 5 layered DNN. In a follow-up work [61], theauthors replaced the hand-engineered features by an autoencoder, whose bottleneck featureswere then fed to the DNN layers for mapping to the MGC-LSP. In order to take advantage ofa shared representation, a multi-task DNN was further employed in [174] to simultaneouslyclassify phone states and synthesize spectral parameters. The latest work in this direction hasbeen the application of CNN-LSTMs on US images denoised via convolutional autoencoders,intended to predict 24 order MGC-LSP coefficients for synthesizing speech in Hungarian lan-guage [80]. However, the vowel transitions in the synthesized speech differ considerably fromthe desired speech, leading to the poor performance of the synthesized version. A promisingway of achieving closer vowel trajectories is to explore the formant frequency space. Theconnection of articulatory space with the 2D formant space is particularly important to ex-plore because the formant representation is a very powerful acoustic encoding that efficientlydescribes the essential aspects of speech using limited parameters. It also provides a lot ofinsight on continuous vowel trajectories, the most dynamic part of speech production, thatcan be utilized for better control of speech in SSI. To the best of our knowledge, this is the firstinvestigation on deep learning-based ultrasound-to-formant mapping for speech synthesis.2. Formant estimation and tracking Estimation and tracking of formant frequency is one ofthe fundamental problems in speech processing [131]. This involves determining the formantfrequencies corresponding to stationary speech segment and tracking these throughout thesignal. Since these formant frequencies are the direct results of resonances brought aboutby the tongue movement, the problem somewhat boils down to identifying and tracking the143Figure 6.18: Overview of proposed Ultra2Speech. The arrows indicate the data flow.tongue contour. This is, to some extent, analogous to the applications like video objecttracking, action recognition, etc. which utilize different spatio-temporal feature encodingschemes [14, 105, 112, 177]. However, there are numerous other challenges encountered inultrasound-based tongue localizing and tracking. For example, we cannot use pre-trained net-works popular in video processing as backbones for our application. Besides, the ultrasoundimages are grayscale, contain less information, possess low spatial resolution and are infestedwith noises and artifacts.3. Challenges in tongue trackingTongue is a muscular hydrostat having no conventional skeletal support, which results in itsremarkably diverse and complex movements [163]. Having multiple degrees of freedom, differ-ent parts of the tongue can move simultaneously towards different directions. As such, eachtiny movement or shape change of the tongue results in corresponding changes of the vocaltract resonances, which in turn, changes the formant values at that time instant. However,there is a dearth of tongue contour annotations and lack of fully-automated, generalizablecontour extraction methods that makes the U2S task much more challenging. As a result,it is crucial for a successful U2S mapping algorithm to be able to automatically track thetongue contour as a hidden representation, in order to understand the variation of formantsfrom ultrasound.1446.4.2.2 Proposed Ultra2Speech (U2S) modelWe face three fundamental challenges in US-based formant estimation and tracking: (1) extractingrelevant spatial information for accurate tongue contour detection; (2) encoding temporal informa-tion for understanding the dynamics of tongue movement; and (3) reaching the desired mappingbetween the extracted spatio-temporal features and the formant trajectories. In this section, wefirst introduce our Ultra2Formant (U2F) Net, aimed at tackling these challenges using differentkernels of 3D CNNs, as illustrated in Fig 6.19.• First hidden layer The input video is first convolved with a set of pointwise convolutionalfilters with kernel of size 1 × 1 × 1. Such filters are known to reduce the computationalcomplexity before expensive 3 × 3 × 3 operations. Besides, they also extract efficient lowdimensional embedding and apply extra non-linear activations that help the network to modelcomplex functions.• Hybrid convolutional layer The output channels of pointwise convolutions are split intothree groups, one for intra-frame spatial feature extraction, another for cross-frame temporalmodeling and the other for joint spatio-temporal encoding. The spatial branch is composedof 2D CNN kernels 1 × 3 × 3; the temporal branch is composed of 1D CNN time-kernels3× 1× 1; and the joint spatio-temporal branch is composed of 3D CNN kernels 3× 3× 3. Inthis way, we constrain some particular feature channels to focus more on static spatial features,while few others to focus on dynamic motion representation and remaining on encoding jointinformation. Factorizing part of the standard 3D convolution kernel[105] into orthogonalparallel components reduces the number of parameters thereby making it easier to train.Besides, the separation of orthogonal features also contributes towards better optimizationof loss function, as reflected in the performance later. This partial decoupling of the spatialand temporal kernels of 3D CNN makes it both effective in performance and efficient incomputation.• Feature shuffling, grouped convolution and fully connected layer The output fromthree branches are concatenated together and shuffled in three groups. Consider the con-catenated features with 3 groups, each having N channels. For shuffling, the output channeldimension is first reshaped into (3, N), followed by transposing and then flattening it backto its previous shape for feeding it to the next layer. This shuffling facilitates cross-groupinformation exchange and strengthens the spatio-temporal encoding within a computationalbudget as shown in a different context in [201]. The feature representation is further com-pressed by passing it through a grouped convolutional layer of kernel size 1×1×1. The outputfeatures are finally flattened and connected to two parallel sets of 30 output nodes through145Input US sequence1 x 30 x 50 x 8248 x 15 x 25 x 4196 x 7 x 12 x 2032 x 3 x 6 x 105760 30 30 1 x 1 x 13 x 3 x 3 3 x 1 x 1  1 x 3 x 3 1 x 1 x 13D Convolution + Batch Normalization + ReLU + 3D Max Pooling Concatenation  Channel Shuffling96 x 7 x 12 x 2096 x 7 x 12 x 20Formant frequenciesHidden layer 1 Hybrid layer Hidden layer 3Figure 6.19: Architecture of the proposed Ultra2Formant (U2F) Nettask-specific fully-connected layers. This joint learning paradigm aimed at estimating twosets of formant frequencies using the same network creates a shared representation beneficialfor the model.• Formant2speech block We utilize Klatt synthesis software that accepts the U2F outputs(formants) as its input parameters and generates speech output as shown in Fig 6.18. Due tospace constraints, we refer the readers to [88] for further study on the joint parallel-cascadedformant-to-speech synthesis methodology.6.4.2.3 Experiment and results• Data acquisitionWe collected mid-sagittal US videos of a single male participant over a number of sessions.Throughout the data collection procedure, the participant was seated with his head stabilizedagainst a headset and was asked to make continuous open vocal tract sounds with intervalsin between. For imaging the tongue, the ultrasound transducer was placed beneath the chin.The imaging was done using ALOKA SSD-5000 ultrasound system at 30 fps, with 180 degree9mm radius UST-9118 3.5 MHz convex ultrasound probe. Mono-channel audio recording was146done simultaneously using Praat software, a Sennheiser MKH 416 P48 shotgun microphoneand a Focusrite Scalet 2i2 preamplifier. In order to align the audio soundtrack and the videorecording, participants were asked to produce sounds that involves a sudden salient acousticchange and a quick noticeable tongue movement, such as /ga/.• Audio-visual alignment, image extraction and pre-processing:Unlike the AAA system[195], the device used in this study does not align audio and videorecordings automatically. Since the audio recording started prior to video recording, therewas a time lag between audio and video recordings which ought to be calculated. For this,the frame of production of the release of /ga/ in ultrasound imaging data and the timestampof the same event in the audio recording were identified and used for synchronization. Vowelsequences were identified and segmented from the audio recordings, and acoustic landmarkswere prepared. Further, ultrasound video recording were converted to image sequences at30 fps using QuickTime 7 Pro and the frames corresponding to each vowel sequence wereextracted considering the time of acoustic landmarks and the lag between audio and videorecordings. The frames of spatial resolution 480× 640 were cropped using a bounding box of200 × 330 that contained the tongue for the entire image sequence and were further down-sampled into 50 × 82. We also converted the images to grayscale and normalized the pixelintensities within [0, 1]. We chose a time window of 30 frames, resulting in a total of 13,082videos of duration 1 second each. This time window was chosen as a trade-off between thedynamic information to be imparted to the network and the computational time required fortraining.• Formant extraction from recorded speech:We applied a Hamming window on frame-blocked acoustic signal of 1470 samples each. Fol-lowing the traditional approach, next, we employed a 1D filter with transfer function of1/(1 + 0.63z−1) and computed the linear predictive coefficients for the filtered signal usingautocorrelation method. Furthermore, we computed the roots of the predictor polynomial inorder to locate the peaks in spectra of LPC filters. Only positive frequencies up to half of thesampling frequency were considered for calculations and were sorted in ascending order. Inthis study, we explored the first two formant frequencies - the most dominating parametersfor speech trajectories.• Implementation details and performance analysisThe final model consists of 4 hidden layers, 3 being convolutional, with respectively 48, 96 and32 filters and the last one being a fully-connected layer of 5760 nodes, connected parallely tothe output nodes as illustrated in Fig 6.19. Each 3D convolutional layer has a stride=1 and147Figure 6.20: (a) Time-varying formants (Red indicates target and blue indicates predicted trajec-tories), (b) Original speech signal, (c) synthesized speech signalFigure 6.21: Saliency maps from U2F showing internal tongue contour localizationis followed by 3D Batch Normalization[78], ReLU activation[26] and 3D max-pooling witha stride=2. We perform a respective padding of (0,1,1), (1,0,0) and (1,1,1) for the spatial,temporal and spatio-temporal convolution of the hybrid layer.• Training and evaluation:Our U2S model was implemented in PyTorch. We randomly shuffled and partitioned the data(13,082 videos) into train (80%), development (10%) and test sets (10%). The network wastrained with a batch size of 10 on NVIDIA GeForce GTX 1080 Ti GPU. Mean absolute error(MAE) loss function was optimized using Adam with a learning rate of .001 for a total of 100epochs. All the parameters were randomly initialized. In order to mitigate the problem ofoverfitting, we used Batch Normalization after every convolutional layer and before applyingnon-linearity. The architectural parameters and hyperparameters shown in Fig 6.19 wereselected through an exhaustive grid-search. Since the results were computed as a sequenceof f1 and f2 values (30 each) as shown in Fig 6.20 (a), we used Mean Absolute Error (MAE)and Mean R-squared (R2) as metrics for quantifying the regression performance.• Results:We showcase our results on a randomly chosen sample in Fig 6.20, which demonstrates thatthere is almost no visible distinction between the target and predicted acoustic signal. Wealso show the visual explanation behind the estimations made by U2F in Fig 6.21 in the formof saliency maps corresponding to last CNN layer. This surprisingly reveals its striking abil-ity to accurately represent the tongue contour internally. Table 6.3 presents the quantitative148Table 6.3: Performance comparison with baseline methodsMethod f1 f2 f1− f2MAE Mean R2 MAE Mean R2 MAE Mean R2CLSTM(2-layers)+2-FCN .0419 86.36 .0423 85.34 .0444 86.45CBiLSTM(2-layers)+2-FCN .0352 89.40 .0380 89.12 .0293 90.013D CNN(2-layers)+1-FCN .0233 96.79 .0242 96.22 .0069 98.873D CNN(4-layers)+1-FCN .0204 98.40 .0174 98.13 .0118 98.78Our U2F (with Hybrid Conv) .0097 99.80 .0092 99.76 .0052 99.96Table 6.4: Ablation experiments - Removal of spatial, temporal and shuffling blocksMethod f1 f2 f1− f2MAE Mean R2 MAE Mean R2 MAE Mean R2U2F w/o spatial kernels .0178 97.78 .0267 97.69 .0113 98.08U2F w/o temporal kernels .0188 98.01 .0180 98.30 .0161 98.52U2F w/o shuffling block .0103 98.84 .0097 99.09 .0087 99.12results and comparisons, corresponding to a joint f1-f2 prediction task Vs individual formantprediction task. The joint configuration consistently achieves better performance taking ad-vantage of a shared representation, despite having fewer parameters. Our baseline methodis the Conv-BiLSTM that has been presented as the state-of-the-art approach in [80]. Thenetwork in the fourth row of the Table 6.3 has exactly same architecture as U2F except thatthe hybrid CNN-block is replaced by regular CNN-block. The results show that the proposedU2F model outperforms the CNN-RNN baselines as well as the standard 3D CNN modelseven with lesser number of parameters.• Ablation studyWe conduct several ablation experiments on our dataset to analyze the contribution of differ-ent modules of U2F Net. Here, we particularly report (in Table 6.4) three primary variants ofour model obtained by dropping the spatial layer, the temporal layer, and the channel shuf-fling. We can see that the network performance decreases by 1.88% and 1.44% in absence ofthe individual spatial and temporal encoding respectively. This shows that the hybrid block isa significant part of U2F which captures contrasting features to jointly learn the localizationand tracking of tongue contour better. Similarly, the removal of shuffling block leads to anapproximate decrease of the Mean R2 by .84%. This is because channel shuffling mixes theindependent as well as shared encodings and thereby enriches the input feature space for thelast grouped CNN layer. All the ablation studies provide evidence in favour of our originalmodel design.1496.4.2.4 SummaryThe main contributions of our work are four-fold:1. We developed a novel spatio-temporal feature extraction strategy for mapping ultrasoundtongue movement to formant trajectories. This involves replacing a chunk of the 3D con-volutional layer by individual 2D spatial and 1D temporal convolutions for better featureencoding. A shuffling block is introduced to enable cross-feature information flow betweenspatial, temporal and spatio-temporal representations.2. For the first time, we established a successful end-to-end mapping between the ultrasoundtongue images and formant frequencies, that bridges the gap in silent speech interfaces andopens a new dimension for articulatory speech research.3. We provide evidences that our network has the ability to model an internal representation oftongue by optimizing a non-image-based loss function. This demonstrates that the networkhas the potential to replace the manual selection of points for semi-automatic tongue contourextraction. This also shows the promise of using acoustic labels for tongue contour detection,thereby, replacing the need for tedious manual annotation for tongue tracing.4. Our approach shows a striking improvement in performance over the baseline methods. Wepresent an ablation study to explain the contribution of individual components towards betterperformance. Our network has the potential to encode robust spatio-temporal information inother related tasks.6.4.3 Pink Trombone VT and acousticsThe articulatory geometric configurations of the vocal tract and the acoustic properties of theresultant speech sound are considered to have a strong causal relationship. In our previous works, weinvestigated a one-way relationship from articulatory configuration to acoustic outputs. This workaims at finding a joint latent representation between the articulatory and acoustic domain for vowelsounds via invertible neural network models, while simultaneously preserving the respective domain-specific features. Our model utilizes a convolutional autoencoder architecture and normalizing flow-based models to allow both forward and inverse mappings in a semi-supervised manner, betweenthe mid-sagittal vocal tract geometry of a two degrees-of-freedom articulatory synthesizer with1D acoustic wave model and the Mel-spectrogram representation of the synthesized speech sounds.Our approach achieves satisfactory performance in achieving both articulatory-to-acoustic as well asacoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encodingof both the domains. The reversible articulatory-acoustic representation will enable us to explore150the speech motor control better by reflecting the changes in one domain to the other. We nextexplain the relevance of the articulatory-acoustic mappings and the significance of our project inthis context.The articulatory-to-acoustic forward mapping, i.e., estimating acoustic response to articulatorybehaviour is of utmost importance in the development of articulatory speech synthesizers and othersilent speech interfaces as well as in detailed study of speech production and articulatory phonetics.On the other hand, applications of the inverse mapping, i.e., inferring articulatory informationfrom speech acoustics include estimation of vocal tract parameters for efficient speech coding,enhanced speech recognition systems and for developing visual articulatory feedback systems. Therelated works mostly address either of these two problems independently and as such, there isa lack of unified end-to-end forward and inverse mapping approach that can reversibly map thevocal tract shapes and corresponding speech sounds. This is because it is incredibly challengingto accurately determine a joint distribution of the articulatory and acoustic domains, both havingcomplex generative processes involving a series of motor control and estimation tasks, biomechanicalmechanisms and aero-dynamic flow - some being shared across both generations while some beingspecifically important to one of them.In order to address this issue, we employ a semi-supervised, invertible, bijective cross-domainmapping between vocal tract geometries and the acoustic outputs, leveraging a pair of deep convo-lutional autoencoders and normalizing flow-based probability density estimation technique. In thispaper, we particularly consider the mid-sagittal vocal tract configurations and synthesized vowelsounds, simulated in the online articulatory speech synthesizer application named Pink Trombone[171], as our input-output space. Our approach involves a separated yet shared encoding of theimages, capturing diverse vocal tract shape, as well as the mel-spectrograms, possessing the acous-tic information pertaining to the resultant speech signals, in an unsupervised manner. The doubleautoencoders are simultaneously aligned in a supervised fashion by stacking a chain of invertiblebijective transformation functions between the bottleneck feature distributions. The core idea isto constrain the latent representations of both the domains to have some domain-specific featurespertaining to self-reconstruction as well as a joint feature space that encodes the mutual charac-teristics for enabling cross domain VT geometry-to-speech and speech-to-VT geometry synthesis.Furthermore, the domain-specific latent codes is kept conditional on the shared cross-domain la-tent space by enforcing a normalizing flow-based [85] conditional prior in the articulatory-acousticlatent representation. In the next section, we will lay the foundation of our approach and presenta systematic study on the variational model employed to achieve the target mapping.1516.4.3.1 Proposed Mapping Strategy• Problem formulation and overviewIn order to investigate the joint distribution of vocal tract shapes and acoustics p(xg, xs)which follow the generative processes pg(xg) and ps(xs) respectively, we define a commonlatent variable z such that the marginal likelihood p(xg, xs) =∫p(xg, xs, z) dz, where thejoint probability distribution p(xg, xs, z) = p(xg, xs|z)p(z). The likelihood p(xg, xs|z) indi-cates the probability distribution over the observed variables in articulatory and acousticspace, given the latent representation z. The standard practise is to compute the likelihoodusing the posterior distribution p(z|xg, xs) via Bayes’ rule. However computing the posteriordistribution is intractable in general as there exists no closed form solution. Alternatively, avariational distribution Ψ(z|xg, xs) is used to approximate the posteriori by optimizing theevidence lower bound (ELBO) [86]. Therefore in our case, the maximization of the likeli-hood of p(xg, xs) can be achieved by involving a posterior encoding distribution Ψφ(z|xg, xs)parameterized by φ. As such, our objective boils down to learning the variational posteriordistribution Ψφ(zĝs|xg, xs) related to the shared latent space (zĝs) between the VT geometry(xg) and the acoustic representation (xs). Considering that the shared latent variable is ca-pable of encoding joint articulatory and acoustic information, this implies, for a given pair ofarticulatory-acoustic data sample (xg, xs), the learnt posterior distributions, Ψφ(zĝs|xg, xs) ≡Ψφ(zĝs|xg) ≡ Ψφ(zĝs|xs). This can be ensured by enforcing the encoders in both the domainsto generate same latent information. However, since the data distribution of articulatory andacoustic domains follow distinct underlying generative models as discussed earlier, it is notadmissible to try to enforce exactly same latent variable representation by removing individ-ual domain-specific information from the acoustic or articulatory space. For the same reason,minimizing the mean squared error between the encoder output features for enhancing sharedinformation is not ideal.To this end, the shared information encoding is modified in two ways. The first is to partitionthe encodings of both the articulatory and acoustic domains into two parts: one that containssole articulatory or acoustic information (zg \ ĝs or zs \ ĝs) and the other which has the jointor shared information (zĝs). Therefore, our model consists of two domain-specific encoders :one for encoding the vocal tract geometry from the input image that learns an articulation-related posterior distribution Ψγ(zg \ ĝs|xg, xs) parameterized by γ and the other for encodingthe acoustic information from the mel spectrograms, that learns the latent posterior distri-bution of acoustic domain Ψα(zs \ ĝs|xg, xs) parameterized by α. The second modificationis that, instead of constraining the encoders to learn the exact same shared latent encodingdimensions, we respect the constraints specific to articulation or acoustics and alternatively152Figure 6.22: The self-attention module in convolutional autoencoder architecturelearn an invertible bijective mapping Ωωĝs: Rdĝs −→ Rdĝs between the shared representation ofthe articulatory and acoustic space. This invertible mapping performs transformation of thedĝs dimensional latent vector zĝs between the domains g and s.• Self-attention-based Convolutional AutoencoderThe input-output space of our problem being artificial vocal tract images and mel-spectrograms,both are in the image representation. And our target is to encode the pair of image data into2 sets of effective latent vectors or bottleneck features which best represent the respectivedomains and contain maximum relevant information required for their individual reconstruc-tion. Convolutional architecture is a natural choice for the autoencoder network in this caseas the convolutional autoencoders preserve the spatial information of the input image data,by incorporation of convolutional filter kernels in the network [115]. Additionally, the self-attention mechanism of [184, 190, 199] is also utilized in the encoder-decoder architectureas shown in Fig. 6.22, leveraging its capability of modeling non-local relationships betweenwidely separated spatial regions - an equivalent of long-range dependency in images. An at-tention map β is generated by first transforming the feature set from the previous layer intotwo parallel layers f(x) and g(x) followed by exponentiating the product of these two featuresets and normalizing it as shown below:f(x) = Wfx, x ∈ RC×N , Wf ∈ RC×C (6.1)g(x) = Wgx, Wg ∈ RC×C (6.2)153βj,i =ef(xi)T g(xj)∑Ni=1 ef(xi)T g(xj), β ∈ RN×N (6.3)where βj,i denotes the impact of ith location while rendering jth location, i.e., the extent towhich the network attends to ith location while synthesizing jth location.Next, the previous layer features are again transformed to another feature set h(x) andmultiplied with the computed attention map β to generate the self attention map output o.h(x) = Whx, Wh ∈ RC×C (6.4)v(ui) = Wvui (6.5)oj = v(N∑i=1βj,i h(xi)), o ∈ RC×N (6.6)Wf , Wg, Wh and Wv are learned weight matrices, implemented as 1×1 convolution operation.The final layer of the self-attention convolution layer is represented as the addition of aweighted self-attention mask (with the learnable scalar weight, η) to the previous layer feature.yj = η oj + xj , y ∈ RC×N (6.7)η is initialized as 0 to let the model explore local spatial information before starting to capturenon-local features via self-attention-based refinement.Let the encoder networks corresponding to the vocal tract geometry xg of dimensions dg andthe acoustic representation xs of dimensions ds be denoted as GEg and SEs with parameters Egand Es respectively, such that GEg : (xg)dg −→ (zg)dlg with Eg = {β, γ} and SEs : (xs)ds −→ (zs)dlswith Es = {α, γ}, where dlg and dls are the latent dimensions of articulatory and acousticdomains. Similarly, let the decoder networks corresponding to the vocal tract geometryand the acoustic representation be denoted as GDg and SDs with parameters Dg and Dsrespectively, such that GDg : (zg)dlg −→ (xg)dg and SDs : (zs)dls −→ (xs)ds . Further, let thedecoded vocal tract geometry and acoustic representation outputs be denoted as x˜g and x˜srespectively, then, x˜g = GDg(GEg(xg)) and x˜s = SDs(SEs(xs)). For vocal tract geometryimage, the reconstruction loss between input image (xg) and reconstructed image (x˜g) fromthe image decoder is computed as Lrecg (xg, x˜g) =∥∥xg − GDg(GEg(xg))∥∥ where ‖.‖ denotesl2 norm. Similarly, for Mel-spectrogram image, the reconstruction loss between input Mel-spectrogram (xs) and reconstructed Mel-spectrogram (x˜s) from the spectrogram decoder iscomputed as Lrecs (xs, x˜s) = ‖xs − SDs(SEs(xs))‖.• Derivation of ELBO154Act NormAct NormInvertible 1x1 convAffine Coupling LayerAffine Coupling LayerInvertible 1x1 convx K x KAct NormInvertible 1x1 convAffine Coupling LayerAct NormInvertible 1x1 convAffine Coupling LayerMel-SpectrogramEncoderVT geometryEncoderVT geometryDecoderMel-SpectrogramDecoderVT geometry to  acoustics Acoustics to VT geometryVT geometry ReconstructionAcoustics Reconstruction….…....….…....Convolutional layersConvolutional layersGLOWFigure 6.23: The proposed articulatory-acoustic forward and inverse mappingSince our entire latent variable representation is partitioned to three major components,accordingly, our posterior distribution gets factorized as follows:Ψφ(z|xg, xs) = Ψγ(zg \ ĝs|xg, zĝs)Ψα(zs \ ĝs|xs, zĝs)Ψβ(zĝs|xg, xs) (6.8)Similarly, assuming the conditional independence of the latent codes, our prior probabilitydistribution gets factorized as:p(z) = p(zĝs, zg \ ĝs, zs \ ĝs) = p(zg \ ĝs|zĝs) p(zs \ ĝs|zĝs) p(zĝs) (6.9)The computation of the optimal likelihood requires the marginalization of the latent variable,which is potentially challenging to compute. Instead, we optimize the lower bound over theencoding distribution using the standard procedure, as follows:log p (xg, xs) = log(∑Ψφ(z|xg, xs) p (xg, xs, z)Ψφ(z|xg, xs))(6.10)155Now using Jensen’s inequality,log p(xg, xs) >∑Ψφ(z|xg, xs) log p (xg, xs, z)Ψφ(z|xg, xs)=∑Ψφ(z|xg, xs) log p (xg, xs|z)p(z)Ψφ(z|xg, xs)>∑Ψφ(z|xg, xs)[log p (xg, xs|z) + log p(z)− log Ψφ(z|xg, xs)] (6.11)With the help of Equation (6.8) and (6.9), the first term or data-likelihood term in Equation(6.11), can be further simplified as:Ψφ(z|xg, xs) log p (xg, xs|z) = Ψγ(zg \ ĝs|xg, zĝs)×Ψβ(zĝs|xg, xs) log p(xg|zĝs, zg \ ĝs) + Ψα(zs \ ĝs|xs, zĝs)×Ψβ(zĝs|xg, xs) log p(xs|zĝs, zs \ ĝs) (6.12)Similarly, the second term can be further simplified as:Ψφ(z|xg, xs) log p(z) = Ψβ(zĝs|xg, xs) log p(zĝs)+Ψγ(zg \ ĝs|xg, zĝs) log p(zg \ ĝs|zĝs) + Ψα(zs \ ĝs|xs, zĝs)×log p(zs \ ĝs|zĝs). (6.13)And the third term in equation (6.11) can be simplified as:−Ψφ(z|xg, xs) log Ψφ(z|xg, xs) = −Ψβ(zĝs|xg, xs)×log Ψβ(zĝs|xg, xs)−Ψγ(zg \ ĝs|xg, zĝs) log Ψγ(zg \ ĝs|xg, zĝs)−Ψα(zs \ ĝs|xs, zĝs) log Ψα(zs \ ĝs|xs, zĝs). (6.14)Therefore our task is now to maximize the lower bound [73, 86, 110, 180] obtained by pluggingthe expressions of Equations (6.12), (6.13) and (6.14) in Equation (6.11).• Normalizing flowNormalizing flow [36, 37, 73, 85, 110, 138] is a flow-based generative model and is usedas a powerful probability density estimator. It is constructed by stacking a sequence ofinvertible transformation functions which transform a simple distribution into a complex156one and eventually learns an explicit data distribution p(x). The probability distributionof the final target variable is obtained by substituting the variables for a new one, flowingthrough a chain of transformations fi, following the change of variables theorem. Given aninitial distribution z0, the output x can be obtained by using a series of probability densityfunctions in a step-by-step fashion.x = zK = fK(fK−1(fK−2(fK−3(....(f3(f2(f1(z0))))....)))) (6.15)Using the change of variables rule, the probability density function of the model can thereforebe written as follows:log p(x) = log piK(zK) = log piK−1(zK−1)− log∣∣∣∣det dfKdzK−1∣∣∣∣= log pi0(z0)−K∑i=1log∣∣∣∣det dfidzi−1∣∣∣∣ (6.16)The sequence formed by successive distributions pii is known as normalized flow. Both theconditional priors Ωωg = Ψγ(zg \ ĝs|xg, zĝs) and Ωωs = Ψα(zs \ ĝs|xs, zĝs) as well as the map-ping between the shared latent codes Ωωĝsare modeled with GLOW [85], a normalizingflow-based generative model using invertible 1 × 1 convolutions. A single step of GLOWinvolves three substeps - activation normalization (act-norm), invertible 1 × 1 convolutionand an affine coupling layer. The act-norm is an affine transformation using trainable pa-rameters - scale (s) and bias (b) per channel, similar to batch normalization, except thatit works for a mini-batch size 1. The transformation for a kth layer can be expressed asy(k) i,j = s z(k) i,j + b(k). Next, 1× 1 convolution with equal input and output dimensions isa generalized way of permuting channel ordering between layers of flow, thereby ensuring thatthe ordering of channels is shuffled for the flow to act on the entire data sample. Assuming theweight matrix to be W : [c× c], where c is the number of channels, this step can be written asv(k) i,j = Wy(k) i,j . The last substep consists of an affine coupling layer where the convolvedoutputs v(k) are split into two parts: v(k) a and v(k) b, out of which (v(k) a) remains the samewhere as the other part (v(k) b) undergoes an affine transformation involving scaling (s(.)) andtranslation (t(.)). This can be denoted as: v(k) a, v(k) b = split(v(k)), (log s, t) = NN(v(k) b),u(k) a = v(k) a, u(k) b = exp(log s) v(k) b + t, z(k+1) = concat(u(k) a, u(k) b).As shown in Fig 6.23, the mapping between the shared latent components Ωωĝsis achievedusing a sequence of such transformations. The cost of mapping the latent space of VTgeometry image to that of mel-spectrogram Lg2s(xg, xs) is defined as mean squared errorbetween the encoded spectrogram representation (zs)dls and transformed image representation157Figure 6.24: (a) and (c) respectively shows the mel-spectrogram corresponding to the originalvowels /a/ and /u/, (b) and (d) respectively shows their synthesized versions from VT geometryFigure 6.25: The synthesized pink trombone images corresponding to VT configurations for /a/,/ae/, /i/ and /u/ (left to right)Ωωĝs((zg)dlg). Similarly, the cost of mapping the latent space of mel-spectrogram to that ofVT geometry image Ls2g(xs, xg) is defined as mean squared error between the encoded VTgeometry representation (zg)dlg and transformed spectrogram representation Ωωĝs((zs)dls).6.4.3.2 Experiments and Results• Dataset and TrainingWe varied the pink trombone tongue controller1 that changes the VT shape and correspond-ingly captured videos of pink trombone VT with frame rate of 30 fps and audios at a samplingrate of 22,020 Hz. Our model was implemented in PyTorch and we converted the audio intomel-spectrograms using Librosa [117]. We randomly shuffled and partitioned the data (36,081audios and images extracted from the videos) into train (80%), development (10%) and testsets (10%). The images were downsampled to dimensions 90 × 98 × 3 to reduce the compu-tational time. The network was trained with a batch size of 10 on NVIDIA GeForce GTX1080 Ti GPU. The loss function was optimized using Adam with a learning rate of .0001 fora total of 200 epochs. In order to mitigate the problem of overfitting, Batch Normalizationwas used after every convolutional layer and before applying non-linearity.1https://dood.al/pinktrombone/158• Qualitative and quantitative performance analysisThe original and synthesized Mel-Spectrograms of the vowels /a/ and /u/ corresponding tothe respective VT shapes have been shown in Fig 6.24. A qualitative analysis of the figuredemonstrates that although the generated mel-spectrograms are blurrier than the originalcrisp mel-spectrograms, they are indeed recognizable and significantly similar to the groundtruth data. In order to quantitatively evaluate the performance of the proposed method inacoustic domain, we further computed the average formant frequencies of the synthesizedaudio signal and the original audio signal after recovering the synthesized audio with Griffin-Lim-based spectrogram inversion method [65]. The mean error of the first three formantsof synthesized vowels w.r.t the original vowels are 18.57%, 24.21%, 7.69% respectively. Thesynthesized pink trombone images corresponding to the cardinal vowels /a/, /ae/, /i/ and/u/ have been presented in Fig. 6.25. The generated images are found to be quite similar tothe actual VT geometries of pink trombone corresponding to respective vowels. It shows thatour model is able to properly recognize the VT shape changes with changes in acoustic input,thereby demonstrating the success of our approach. The mean absolute error between thenormalized synthesized pink trombone images and original images is 0.0397, most of whichevidently comes from the non-VT part.6.4.3.3 Summary and future worksIn this work, we have developed an invertible mapping between the articulatory and acousticspaces for an online articulatory speech synthesizer application named pink trombone. To thebest of our knowledge, this is the first attempt to study an invertible joint articulatory-acousticrepresentation utilizing the best of deep autoencoder architectures and normalizing flow-basedtechniques. However, this work investigates a joint articulatory-acoustic representation for staticvowels only, as we are considering VT input image for a particular instant. This can be furtherextended to continuous vowel spaces by including a sequence of images reflecting the dynamic VTshape changes with time in the articulatory space.This investigation is particularly important as it reversibly connects articulation and acoustics.Further investigation in this direction will help us to establish automatic connections betweenarticulatory and acoustic trajectories. The analysis of the trajectories can then be utilized toformulate and relate the difficulty metrics defined in the individual spaces, thereby opening a newdimension in the information-theoretic study of speech in both articulatory and acoustic domains.1596.4.4 Section SummaryIn this section, we addressed three different problems related to articulatory-acoustic mapping.The first one dealt with automatic speech recognition from MRI-based vocal tract configurationsusing video-recognition techniques. The second one involved formant-based speech synthesis fromUltrasound video and the third one incorporated the development of a reversible joint articulatory-acoustic representation using Pink Trombone-based synthetic data. These studies gave us moreinsights on extracting vocal tract features underlying acoustic representations. Besides, these workswill serve as the basis for the replacement of our hand kinematics by articulatory movements andtherefore will facilitate our future works.6.5 Mapping Active Thoughts to SpeechIn the previous section, we studied the connection of articulatory movements to speech. In thissection, we will take a step forth towards understanding the neural bases behind such movementsthrough EEG. In other words, we will try to study and establish the connection between brainsignals and speech via an articulatorily relevant pathway.While speaking, we dynamically coordinate the movements of our articulators inlcuding lips,tongue, larynx, etc. These articulatory movements are mostly encoded in our sensorimotor cortex,that are responsible for complex kinematic articulatory trajectories. Therefore, in order to under-stand the underlying neural mechanism of speech production, we need to analyse the speech-relatedbrain signals in human speech sensorimotor cortex. Our intention is to primarily infer the speechtokens and the corresponding articulatory movements from the brain signals. As an initial steptowards it, we perform deep-learning based classification of imagined speech EEG signal, with andwithout using phonological features. This can be considered as the first step towards mapping ac-tive thoughts to speech and the insights derived from the EEG studies will be useful to further planand design the next set of experiments. In the following subsections, we present three different butrelated works on imagined speech EEG: (i) the direct classification of speech tokens (like vowels,short and long words), (ii) classification of phonological categories (like the presence or absenceof consonants, phonemic nasal, bilabial, high-front vowels, low-back vowels) and (iii) utilization ofsuch phonological information for the classification of speech tokens.6.5.1 EEG-based direct recognition of vowels and wordsWe propose a mixed deep neural network strategy, incorporating parallel combination of Con-volutional (CNN) and Recurrent Neural Networks (RNN), cascaded with deep autoencoders andfully connected layers towards automatic recognition of imagined speech from EEG. Instead of160utilizing raw EEG channel data, we compute the joint variability of the channels in the form of acovariance matrix that provide spatio-temporal representations of EEG. The networks are trainedhierarchically and the extracted features are passed onto the next network hierarchy until the fi-nal classification. Using a publicly available EEG-based speech imagery database we demonstratearound 23.45% improvement of accuracy over the baseline method. Our approach demonstratesthe promise of a mixed DNN approach for complex spatial-temporal classification problems.6.5.1.1 Brief backgroundIn the past decade, numerous methods have been proposed to decode speech and motor-relatedinformation from electroencephalography (EEG) signals for Brain-Computer Interface (BCI) appli-cations. However, EEG signals are highly session-specific; infested with noises and artifacts. Inter-preting active thoughts underlying vocal communication involving labial, lingual, naso-pharyngealand jaw motion, is even more challenging than inferring motor imagery, since utterances involvehigher degrees of freedom and additional functions in comparison to hand movement and gestures.As a result, it is extremely challenging to recognize phonemes, vowels and words from single-trialEEG data. The reported classification accuracy of existing methods [30, 119, 126] are not satisfac-tory, showing that manual handcrafting of features or traditional signal processing algorithms lacksufficient discriminative power to extract relevant features for classification.This work addresses these issues by implementing a deep learning-based feature extractionscheme, targeting classification of speech imagery EEG data corresponding to vowels, short andlong words. It is mentionworthy that most of the previous works applied to vowels and phonemesclassification show a degradation of performance when applied to words and vice versa. Therefore,this is the first work that aims to automatically learn a discriminative EEG manifold applicable toboth word and vowel/phoneme-based classification at the same time, using deep learning techniques.6.5.1.2 Proposed frameworkThe problem of categorizing EEG data based on speech imagery can be formulated as a non-linearmapping fˆ of a multivariate time-series input sequence Xct to fixed output y, i.e, mathematicallyfˆ : Xct −→ y.We found that well-known deep learning techniques, like fully connected networks, CNN, RNN,autoencoders etc. fail to individually learn such complex feature representations from single-trialEEG data. Also, our investigation demonstrated that it is crucial to capture the informationtransfer between the electrodes rather than using the multi-channel high-dimensional EEG datathat otherwise needs large training times and resource requirements. Therefore, instead of utilizing161V1V2V3V3V2V1C1,1 C1,2      C1,N-1 C1,N C2,1                  C2,NCM-1,1                CM-1,NCM,1CM,2     CM,N-1 CM,NV3V2V11927681927689696646464 2x64 4x64 128 12864 64 64 64 6464x64CNNRNNCCVEEG Classes  O/P  I/P ClassesDAEFCNFigure 6.26: Overview of the proposed approachraw EEG, we compute channel covariance, resulting in positive, semi-definite matrices encodingthe joint variability of electrodes. We define channel covariance between any two electrodes c1 andc2 as: Cov(Xc1t , Xc2t+τ ) = E[Xc1(t)− µXc1 (t)][Xc2(t+ τ)− µXc2 (t+ τ)].This is particularly important because higher cognitive processes underlying speech synthesisand utterances involve frequent information exchange between different parts of the brain. Hence,such matrices often contain more discriminative features and hidden information than mere rawsignals.Besides, cognitive learning process underlying articulatory speech production involves incor-poration of intermediate feedback loops and utilization of past information stored in the form ofmemory as well as hierarchical combination of several feature extractors. To this end, we de-velop our mixed neural network architecture composed of three supervised and single unsupervisedlearning step as shown in Fig 6.26.In order to decode spatial connections between the electrodes from the channel covariancematrix, we use six-layered 1D convolutional networks stacking two convolutional and two fullyconnected hidden layers with ReLU as activations. The kth feature map at a given CNN layer withinput x, weight matrix W k and bias bk is obtained as: hk = ReLU(W k ∗ x + bk). The network is162trained with the corresponding labels as target outputs, optimizing cross-entropy cost function viaAdam Optimizer.In parallel, we apply a six-layered recurrent neural network on the channel covariance matricesto explore the hidden temporal features of the electrodes. It consists of two fully connected hiddenlayers, stacked with two LSTM layers and is trained in a similar manner as CNN.Since these parallel networks are trained individually and 5th layer of both the networks has adirect relationship with respective output layers, we claim that these layers are powerful discrim-inative spatial and temporal representations of the data. Therefore, we concatenate these featurevectors to form joint spatio-temporal encodings of covariance matrix.The second level of hierarchy encompasses unsupervised training of deep autoencoders (DAE)having two encoder-decoder layers, with mean squared error (MSE) as the cost function. This leadsto further dimensionality reduction of the spatio-temporal encodings [200].At the third level of hierarchy, the discrete latent vector representation of the deep autoencoderis fed into a two-layered fully connected network (FCN) followed by softmax classification layer.This is again trained in a supervised manner similarly as the CNN network, to output the finalpredicted classes corresponding to the speech imagery.6.5.1.3 Experiments and ResultsWe evaluate our model on the publicly available imagined speech EEG dataset [126]. It consists ofimagined speech data corresponding to vowels, short words and long words, for 15 healthy subjects.The short words included in the study were ‘in’, ‘out’ and ‘up’; the long words were ‘cooperate’ and‘independent’; the vowels were /a/, /i/ and /u/. The participants were instructed to pronouncethese sounds internally in their minds, avoiding muscle movements and overt vocalization.We trained the networks with 80% of the data in the training set and the remaining 20% in thevalidation set. Table 6.5 shows the average test accuracy for all subjects corresponding to long wordclassification. Fig 6.27 provides a comparison of test accuracy of our approach with others on vowelsand short word classification. Our results illustrate that our model achieves significant improvementover the other methods, thereby validating that deep learning-based hierarchical feature extractioncan learn a better discriminative EEG manifold for decoding speech imagery.6.5.1.4 SummaryTowards recognizing active thoughts from EEG corresponding to vowels, short words and longwords, this work presents a novel mixed neural network strategy as a combination of convolutional,recurrent and fully connected neural networks stacked with deep autoencoders. The network is1630 20 40 60 80 100 S4 S5 S8 S9 S11 S12 S13 S15 0 20 40 60 80 100 S1 S3 S5 S6 S8 S12 Proposed approach Nguyen et al. Min et al. Dasalla et al. Log+LDA Accuracy (in %) Figure 6.27: Performance comparison for all subjects on vowels (left) and short words (right)Table 6.5: Classification accuracy on long wordsMethod S2 S3 S6 S7 S9 S11Nguyen et al. 70.0 64.3 72.0 64.5 67.8 58.5Proposed 77.5 90.7 73.7 86.8 80.1 71.1trained hierarchically on a channel covariance matrix for categorizing respective EEG signals tothe imagined speech classes. Our model achieves satisfactory performance with different types oftarget classification, on different subjects and hence can be considered as a reliable and consistentapproach for classifying EEG-based speech imagery.This study was done as a sanity check to determine if the proposed model is effective in ex-tracting speech-related information from the noisy EEG data. The studies that follow (in the nextsubsections) will utilize fundamentally same network and procedure, with a few modifications doneon top of that, in order to improve its performance and will evaluate the network on new datasets.6.5.2 EEG-based phonological categorizationAs a step towards full decoding of imagined speech from active thoughts, in this subsection, wepresent a speech-related Brain Computer Interfaces (BCI) system for subject-independent classifica-tion of phonological categories exploiting a novel deep learning-based hierarchical feature extractionscheme. Motivated by the success of our previous work (as discussed in Section 6.6.1), we computethe joint variability of EEG electrodes in the form of a channel covariance matrix in order to bettercapture the complex representation of high-dimensional electroencephalography (EEG) data. We164then extract the spatio-temporal information encoded within the matrix using a mixed deep neuralnetwork strategy in a similar fashion as discussed in the previous subsection. The only differencehere is that we train the individual networks hierarchically feeding their combined outputs in a finalgradient boosting classification step. Our best models achieve an average accuracy of 77.9% acrossfive different binary classification tasks, providing a significant 22.5% improvement over previousmethods. As we also show visually, our work demonstrates that the speech imagery EEG possessessignificant discriminative information about the intended articulatory movements responsible fornatural speech synthesis.6.5.2.1 Related worksDecoding intended speech or motor activity from brain signals is one of the major research areasin Brain Computer Interface (BCI) systems [42, 69]. In particular, speech-related BCI technologiesattempt to provide effective vocal communication strategies for controlling external devices throughspeech commands interpreted from brain signals [57]. Not only do they provide neuro-prosthetichelp for people with speaking disabilities and neuro-muscular disorders like locked-in-syndrome,nasopharyngeal cancer, and amytotropic lateral sclerosis (ALS), but also equip people with a bettermedium to communicate and express thoughts, thereby improving the quality of rehabilitation andclinical neurology [114, 125]. Such devices also have applications in entertainment, preventivetreatments, personal communication, games, etc. Furthermore, BCI technologies can be utilizedin silent communication, as in noisy environments, or situations where any sort of audio-visualcommunication is infeasible.Among the various brain activity-monitoring modalities in BCI, electroencephalography (EEG) [68,134] has demonstrated promising potential to differentiate between various brain activities throughmeasurement of related electric fields. EEG is non-invasive, portable, low cost, and provides satis-factory temporal resolution. This makes EEG suitable to realize BCI systems. EEG data, however,is challenging: these data are high dimensional, have poor SNR, and suffer from low spatial reso-lution and a multitude of artifacts. For these reasons, it is not particularly obvious how to decodethe desired information from raw EEG signals.Although the area of BCI-based speech intent recognition has received increasing attentionamong the research community in the past few years, most research has focused on classificationof individual speech categories in terms of discrete vowels, phonemes and words [9, 29, 30, 35, 59,77, 83, 121, 189]. This includes categorization of imagined EEG signal into binary vowel categorieslike /a/, /u/ and rest [29, 30, 77]; binary syllable classes like /ba/ and /ku/ [9, 35, 42, 83]; ahandful of control words like ’up’, ’down’, ’left’, ’right’ and ’select’ [59] or others like ’water’, ’help’,’thanks’, ’food’, ’stop’ [121], Chinese characters [189], etc. Such works mostly involve traditional165L1L2L2L1C1,1 C1,2      C1,N-1 C1,N C2,1                  C2,NCM-1,1                CM-1,NCM,1CM,2     CM,N-1 CM,NL2L1CNNLSTMCCVEEG ClassesOUTPUT INPUT ClassesDAEXG BoostFigure 6.28: Overview of the proposed approachsignal processing or manual feature handcrafting along with linear classifiers (e.g., SVMs). In ourrecent work[142], we introduced deep learning models for classification of vowels and words thatachieved 23.45% improvement of accuracy over the baseline.Production of articulatory speech is an extremely complicated process, thereby rendering under-standing of the discriminative EEG manifold corresponding to imagined speech highly challenging.As a result, most of the existing approaches failed to achieve satisfactory accuracy on decodingspeech tokens from the speech imagery EEG data. Perhaps, for these reasons, very little work hasbeen devoted to relating the brain signals to the underlying articulation. The few exceptions include[164, 202]. In [202], Zhao et al. used manually handcrafted features from EEG data, combined withspeech audio and facial features to achieve classification of the phonological categories varying-basedon the articulatory steps. However, the imagined speech classification accuracy based on EEG dataalone, as reported in [164, 202], are not satisfactory in terms of accuracy and reliability. We nowturn to describing our neural network model.166Figure 6.29: Cross covariance Matrices : Rows correspond to two different subjects; Columns(from left to right) correspond to sample examples for bilabial, nasal, vowel, /uw/, and /iy/.6.5.2.2 Proposed FrameworkCognitive learning process underlying articulatory speech production involves incorporation of in-termediate feedback loops and utilization of past information stored in the form of memory aswell as hierarchical combination of several feature extractors. To this end, we develop our mixedneural network architecture composed of three supervised and a single unsupervised learning step,as shown in Fig. 6.28. We formulate the problem of categorizing EEG data based on speech im-agery as a non-linear mapping fˆ of a multivariate time-series input sequence Xct to fixed outputy, i.e, mathematically fˆ : Xct −→ y, where c and t denote the EEG channels and time instantsrespectively.We follow similar pre-processing steps on raw EEG data as reported in [202] (ocular artifactremoval using blind source separation, bandpass filtering and subtracting mean value from eachchannel) except that we do not perform Laplacian filtering step since such high-pass filtering maydecrease information content from the signals in the selected bandwidth.Multichannel EEG data is high dimensional multivariate time series data whose dimensionalitydepends on the number of electrodes. It is a major hurdle to optimally encode information fromthese EEG data into lower dimensional space. In fact, our investigation based on a developmentset (as we explain later) showed that well-known deep neural networks (e.g., fully connected net-works such as convolutional neural networks, recurrent neural networks and autoencoders) fail toindividually learn such complex feature representations from single-trial EEG data. Besides, wefound that instead of using the raw multi-channel high-dimensional EEG requiring large trainingtimes and resource requirements, it is advantageous to first reduce its dimensionality by capturing167the information transfer among the electrodes. Instead of the conventional approach of selecting ahandful of channels as [164, 202], we address this by computing the channel covariance resultingin positive, semi-definite matrices encoding the connectivity of the electrodes. This is essentiallydifferent than our previous work [142] where we extract per-channel 1-D covariance informationand feed it to the networks. we define channel covariance (CCV) between any two electrodes c1 andc2 as: Cov(Xc1t , Xc2t+τ ) = E[Xc1(t)−µXc1 (t)][Xc2(t+τ)−µXc2 (t+τ)]. Next, we reject the channelswhich have significantly lower cross-covariance than auto-covariance values (where auto-covarianceimplies CCV on same electrode). We found this measure to be essential as the higher cognitiveprocesses underlying speech planning and synthesis involve frequent information exchange betweendifferent parts of the brain. Hence, such matrices often contain more discriminative features andhidden information than mere raw signals. We present our sample 2-D EEG cross-covariance ma-trices (of two individuals) in Fig. 6.29.In order to decode spatial connections between the electrodes from the channel covariancematrix, we use a CNN [95], in particular a four-layered 2D CNN stacking two convolutional andtwo fully connected hidden layers. The kth feature map at a given CNN layer with input x, weightmatrix W k and bias bk is obtained as: hk = ReLU(W k ∗x+ bk). At this first level of hierarchy, thenetwork is trained with the corresponding labels as target outputs, optimizing a cross-entropy costfunction. In parallel, we apply a four-layered recurrent neural network on the channel covariancematrices to explore the hidden temporal features of the electrodes. Namely, we exploit an LSTM[71] consisting of two fully connected hidden layers, stacked with two LSTM layers and trained ina similar manner as CNN.As we found the individually-trained parallel networks (CNN and LSTM) to be useful (seeTable 6.7), we suspected the combination of these two networks could provide a more powerfuldiscriminative spatial and temporal representation of the data than each independent network. Assuch, we concatenate the last fully-connected layer from the CNN with its counterpart in the LSTMto compose a single feature vector based on these two penultimate layers. Ultimately, this forms ajoint spatio-temporal encoding of the cross-covariance matrix.In order to further reduce the dimensionality of the spatio-temporal encodings and cancel back-ground noise effects[200], we train an unsupervised deep autoenoder (DAE) on the fused hetero-geneous features produced by the combined CNN and LSTM information. The DAE forms oursecond level of hierarchy, with 3 encoding and 3 decoding layers, and mean squared error (MSE)as the cost function.At the third level of hierarchy, the discrete latent vector representation of the deep autoencoderis fed into an Extreme Gradient Boost-based classification layer[16, 17] motivated by [200]. It isa regularized gradient boosted decision tree that performs well on structured problems. Since our168Table 6.6: Selected parameter setsParameters CNN LSTM DAEBatch size 64 64 64Epochs 50 50 200Total layers 6 6 7Hidden layers’ de-tailsConv:32,64masks:3x3 Dense:64,128LSTM: 128,256Dense: 512,1024512,128,32 (En-coder) 32,128,512(Decoder)Activations ReLU, last-layer :softmaxall ReLU, last-layer : softmaxReLU, ReLU,sigm, sigm,ReLU, tanhDropout .25, .50 .25, .50 .25, .25, .25Optimizer Adam Adam AdamLoss Binary cross en-tropyBinary cross en-tropyMean Sq Errorl-rate .001 .001 .001Figure 6.30: tSNE feature visualization for ±nasal (left) and V/C classification (right). Red andgreen colours indicate the distribution of two different types of features169EEG-phonological pairwise classification has an internal structure involving individual phonemesand words, it seems to be a reasonable choice of classifier. The classifier receives its input from thelatent vectors of the deep autoencoder and is trained in a supervised manner to output the finalpredicted classes corresponding to the speech imagery.6.5.2.3 Experiments and Results• DatasetWe evaluate our model on a publicly available dataset, KARA ONE [202], composed ofmultimodal data for stimulus-based, imagined and articulated speech state corresponding to7 phonemic/syllabic ( /iy/, /piy/, /tiy/, /diy/, /uw/, /m/, /n/ ) as well as 4 words(pat, pot,knew and gnaw). The dataset consists of 14 participants, with each prompt presented 11times to each individual. Since our intention is to classify the phonological categories fromhuman thoughts, we discard the facial and audio information and only consider the EEGdata corresponding to imagined speech. It is noteworthy that given the mixed nature of EEGsignals, it is reportedly challenging to attain a pairwise EEG-phoneme mapping[164]. In orderto explore the problem space, we thus specifically target five binary classification problemsaddressed in [164, 202], i.e., presence/absence of consonants, phonemic nasal, bilabial, high-front vowels and high-back vowels.• Training and hyperparameter selectionWe performed two sets of experiments with the single-trial EEG data. In PHASE-ONE, ourgoal was to identify the best architectures and hyperparameters for our networks with areasonable number of runs. For PHASE-ONE, we randomly shuffled and divided the data (1913signals from 14 individuals) into train (80%), development (10%) and test sets (10%). InPHASE-TWO, in order to perform a fair comparison with the previous methods reported on thesame dataset, we perform a leave-one-subject out cross-validation experiment using the bestsettings we learn from PHASE-ONE.The architectural parameters and hyperparameters listed in Table 6.6 were selected throughan exhaustive grid-search-based on the validation set of PHASE-ONE. We conducted a seriesof empirical studies starting from single hidden-layered networks for each of the blocks and,based on the validation accuracy, we increased the depth of each given network and selectedthe optimal parametric set from all possible combinations of parameters. For the gradientboosting classification, we fixed the maximum depth at 10, number of estimators at 5000,learning rate at 0.1, regularization coefficient at 0.3, subsample ratio at 0.8, and column-sample/iteration at 0.4. We did not find any notable change of accuracy while varying other170Table 6.7: Results in accuracy on 10% test data in the first studyMethod ± Bilab ± Nasal C/V ± /uw/ ± /iy/LSTM 46.07 45.31 45.83 48.44 46.88CNN 59.16 57.20 67.88 69.56 68.60CNN+LSTM 62.03 60.89 70.04 72.76 63.75Our Mixed 78.65 74.57 87.96 83.25 77.30hyperparameters while training gradient boost classifier.• Performance analysis and discussion To demonstrate the significance of the hierarchicalCNN-LSTM-DAE method, we conducted separate experiments with the individual networksin PHASE-ONE of experiments and summarized the results in Table 6.7. From the averageaccuracy scores, we observe that the mixed network performs much better than individualblocks which is in agreement with the findings in [200]. A detailed analysis on repeated runsfurther shows that in most of the cases, LSTM alone does not perform better than chance.CNN, on the other hand, is heavily biased towards the class label which sees more trainingdata corresponding to it. Though the situation improves with combined CNN-LSTM, ouranalysis clearly shows the necessity of a better encoding scheme to utilize the combinedfeatures rather than mere concatenation of the penultimate features of both networks.The very fact that our combined network improves the classification accuracy by a mean mar-gin of 14.45% than the CNN-LSTM network indeed reveals that the autoencoder contributestowards filtering out the unrelated and noisy features from the concatenated penultimatefeature set. It also proves that the combined supervised and unsupervised neural networks,trained hierarchically, can learn the discriminative manifold better than the individual net-works and it is crucial for improving the classification accuracy. In addition to accuracy,we also provide the kappa coefficients [169] of our method in Fig. 6.31. Here, a highermean kappa value corresponding to a task implies that the network is able to find betterdiscriminative information from the EEG data beyond random decisions. The maximumabove-chance accuracy (75.92%) is recorded for presence/absence of the vowel task and theminimum (49.14%) is recorded for the ±nasal.To further investigate the feature representation achieved by our model, we plot T-distributedStochastic Neighbor Embedding (tSNE) corresponding to ±nasal and V/C classification tasksin Fig. 6.30 . We particularly select these two tasks as our model exhibits respectivelyminimum and maximum performance for these two. The tSNE visualization reveals that thesecond set of features are more easily separable than the first one, thereby giving a rationalefor our performance.Next, we provide performance comparison of the proposed approach with the baseline methods171Figure 6.31: Kappa coefficient values for above-chance accuracy based on Table 6.7Table 6.8: Comparison of classification accuracy± Bilabial ± Nasal C/V ± /uw/ ± /iy/[202] 56.64 63.5 18.08 79.16 59.6[164] 53 47 25 74 53Ours 75.55 73.45 85.23 81.99 73.30for PHASE-TWO of our study (cross-validation experiment) in Table 6.8. Since the modelencounters the unseen data of a new subject for testing, and given the high inter-subjectvariability of the EEG data, a reduction in the accuracy was expected. However, our networkstill managed to achieve an improvement of 18.91, 9.95, 67.15, 2.83 and 13.70 % over [202].Besides, our best model shows more reliability compared to previous works: The standarddeviation of our model’s classification accuracy across all the tasks is reduced from 22.59%[202] and 17.52%[164] to a mere 5.41%.6.5.2.4 SummaryIn an attempt to take a step towards understanding the speech information encoded in brain signals,we developed a novel mixed deep neural network scheme for a number of binary classification tasksfrom speech imagery EEG data. Unlike previous approaches which mostly deal with subject-dependent classification of EEG into discrete vowel or word labels, this work investigates a subject-invariant mapping of EEG data with different phonological categories, varying widely in terms ofunderlying articulator motions (eg: involvement or non-involvement of lips and velum, variation oftongue movements etc). Our model takes an advantage of feature extraction capability of CNN,LSTM as well as the deep learning benefit of deep autoencoders. We took [164, 202] as the baselineworks investigating the same problem and compared our performance with theirs. Our proposedmethod highly outperforms the existing methods across all the five binary classification tasks bya large average margin of 22.51%. This concept of phonological categorization will be utilized as172C1,1 C1,2      C1,N-1 C1,N C2,1                  C2,NCM-1,1                CM-1,NCM,1CM,2     CM,N-1 CM,NCCVEEG INPUTCNNTCNNCNNTCNNDAELatent VectorsDAEXgBoostStacked latent vectors    11 classes    11 classes(SPEECH TOKEN IDENTIFICATION)PHASE ONE     PHASE TWOXgBoost       2 classes(PHONOLOGICAL CATEGORIZATION)Figure 6.32: Overall framework of the proposed approachintermediate articulatory constraints for word and phoneme classification in the next subsection.6.5.3 EEG-based word/phoneme recognition via phonological categorizationIn order to infer imagined speech from active thoughts, we propose a modification to our previoushierarchical deep learning BCI system for subject-independent classification of 11 speech tokensincluding phonemes and words. Our novel approach exploits predicted articulatory information ofsix phonological categories (e.g., nasal, bilabial) as an intermediate step for classifying the phonemesand words, thereby finding discriminative signal responsible for natural speech synthesis. Theproposed network is composed of hierarchical combination of spatial and temporal CNN cascadedwith a deep autoencoder. Our best models on the KARA database achieve an average accuracy of83.42% across the six different binary phonological classification tasks, and 53.36% for the individualtoken identification task, significantly outperforming our baselines. Ultimately, our work suggeststhe possible existence of a brain imagery footprint for the underlying articulatory movement relatedto different sounds that can be used to aid imagined speech decoding.In this work, as an extension of our previous work, our goal is to detect speech tokens from speechimagery (active thoughts or imagined speech [189]). Speech imagery is about representing speech173in terms of sounds inside the human brain without overt vocalization nor articulatory movements.We hypothesize the existence of some sort of brain footprint for articulatory movements underlyingrelated speech token imagery. Hence, we attempt to first predict phonological categories and thenuse these predictions to aid recognition of imagined speech at the token level (phonemes and words).We introduce our framework for solving this problem next.6.5.3.1 Proposed Deep Learning FrameworkWe denote the multivariate time-series data as X ∈ RC ∗ T , with sets of labels Y ∈ y1, y2, ..., y11where X corresponds to the single trial EEG data, having a number of channels C, and for a numberof time steps T . Y is a one-hot encoded vector of 11 labels corresponding to individual words andlabels. In our case, C is 64 and time interval is represented in terms of 5,000 time steps. Asdiscussed earlier, we essentially build our system in two consequent steps: The first step is binaryclassification of X ∈ RC ∗T into presence or absence of 6 phonological categories: {z1, z¯1}, {z2, z¯2},{z3, z¯3}, {z4, z¯4}, {z5, z¯5}, {z6, z¯6}. The second step is to classify the concatenated autoencoderlatent vectors from these 6 classification models viz., W = ⋃6i=1wi into 11 classes: {y1, y2, ..., y11}where wi corresponds to latent vector space corresponding to ith phonological classification into{zi, z¯i}.We build on our hypothesis that the active thought process underlying covert speech doeshave some relevant features corresponding to the intended activity of nasopharynx, lips, tonguemovements and positions etc. Hence, in the first phase, we target five binary classification tasksaddressed in [164, 202], i.e. presence/absence of consonants, phonemic nasal, bilabial, high-frontvowels and high-back vowels. Additionally, we add a voiced vs. voiceless classification task whosegoal is to provide information about the intended involvement of vocal folds. In this way, rather thandirectly discriminating the individual phonemes and words, we first attempt to accurately classifyimagined phonological categories on the basis of underlying intended articulatory movements.Crucially, our target was to model the directional relationship and dependency among theelectrodes over the entire time interval. Hence, instead of the conventional approach of selectinga handful of channels as in [164, 202], we address this issue by computing the channel covariance(CCV) similar to the previous work. We then use convolutional neural networks (CNNs) [89] toextract the spatial features from the covariance matrix. Each layer decodes non-linear spatialfeature representations from the previous layer using convolutional filters and non-linear ReLU[123] activation functions applied to the resulting feature maps. We employ a four-layered 2DCNN stacking two convolutional and two fully connected hidden layers. This is the first level ofhierarchy where the network is trained with the corresponding labels as target outputs, optimizingcross-entropy cost function. We describe architectural and hyper-parameter choices for our networks174L1L2L2L1C1,1 C1,2      C1,N-1 C1,N C2,1                  C2,NCM-1,1                CM-1,NCM,1CM,2     CM,N-1 CM,NL2L1CNNT-CNNCCVEEG ClassesOUTPUT INPUT ClassesDAEXG BoostCausal ConvDilated  Conv1x1tanh sig1x1ReLUSkip ConnectionsFigure 6.33: Overview of phonological prediction of our novel architecturein Table 6.9.In parallel with CNN, we apply a temporal CNN (TCNN) [7, 183] on the channel covariancematrices to explore the hidden temporal features of the electrodes. Namely, we flatten the lowertriangular matrix of the CCV and feed the data of length 1,891 to the TCNN. In order to capture thelong term dependencies and temporal correlations of the signal, we exploit a 6 layer stacked TCNNand train in a similar manner as CNN, using Adam [84] to optimize cross-entropy function. We usestacked dilation filters with a dilation factor of 2, resulting in exponential growth of receptive fieldwith depth and increase in model capacity. This essentially enhances the non-linear discriminativepower of the network, which is vital for our problem space. We concatenate the last fully-connectedlayer from the CNN with its counterpart in the TCNN to compose a single feature vector basedon these two penultimate layers thereby forming a joint spatio-temporal encoding of the cross-covariance matrix. In order to further reduce the dimensionality of the spatio-temporal encodingsand cancel background noise effects [200], we train an unsupervised deep autoenoder (DAE) [60]on the fused heterogeneous features produced by the combined CNN and TCNN information. The175DAE forms our second level of hierarchy, with 3 encoding and 3 decoding layers, and mean squarederror (MSE) as the cost function.At the third level of hierarchy, the discrete latent vector representation of the deep autoencoderis fed into an Extreme Gradient Boost based classification layer [16, 17] motivated by [200].The classifier receives its input from the latent vectors of the deep autoencoder and is trained ina supervised manner to output the final predicted phonological classes corresponding to speechimagery.Next, our goal is to use the combined information available from all the six phonological cate-gories to predict the 11 individual speech tokens present in our EEG dataset. Such a hierarchicalapproach essentially differs from the direct speech classification approach as it imposes richer con-straints on the information space by involving features from all the phonological categorizationtasks. Our results show the utility of this approach as we report in Section 6.5.3.2. To this end,we first stack the bottleneck features of the autoencoders corresponding to the aforementioned sixclassification tasks, into a matrix of dimensions 6× 256. In order to explicitly exploit phonologicalinformation in the imagined speech recognition task, we feed this stacked latent matrix as the inputto our classification model similar to the first phase.6.5.3.2 ExperimentsWe evaluate our models on a publicly available dataset, KARA ONE [202]. It is composed ofmultimodal data for stimulus-based, imagined and articulated speech state corresponding to 7phonemic/syllabic ( /iy/, /piy/, /tiy/, /diy/, /uw/, /m/, /n/) as well as 4 words (pat, pot,knew and gnaw). The study comprising the dataset consists of 14 participants, with each promptpresented 11 times to each individual. Since our intention is to classify the phonological categoriesfrom human thoughts, we discard the facial and audio information and only consider the EEG datacorresponding to imagined speech. More details regarding the database can be found in [202].We randomly shuffle and divide the data (1913 signals from 14 individuals) into train (80%),development (10%) and test sets (10%). The architectural parameters and hyperparameters listedin Table 6.9 were selected through an exhaustive grid-search based on the development set. Weconduct a series of empirical studies starting from single hidden-layered networks for each of theblocks and, based on the validation accuracy, we increase the depth of each given network andselect the optimal parametric set from all possible combinations of parameters. For the gradientboosting classification, we fix the maximum depth at 10, number of estimators at 5,000, learningrate at 0.1, regularization coefficient at 0.3, subsample ratio at 0.8, and column-sample/iterationat 0.4. We did not find any notable change of accuracy while varying other hyperparameters whiletraining gradient boost classifier. For the phonological categorization task, input data for CNN176Table 6.9: Selected parameter setsParameters CNN TCNN DAEEpochs 50 50 200Total layers 6 6 7Hidden lay-ers’ detailsConv:32,64masks:3x3Dense:64,128mask: 5,Dilation : 2E:1024,512,128D:128,512,1024Activations ReLU, last-layer : soft-maxsigm, tanh,ReLU, last-layer : soft-maxReLU,ReLU,sigm, sigm,ReLU, tanhDropout .25, .50 .25, .50 .25, .25, .25Optimizer Adam Adam AdamLoss Categoricalcross en-tropyCategoricalcross en-tropyMean SqErrorl-rate .001 .002 .001Table 6.10: Results in accuracy on 10% test data for phonological prediction. C-L-D: CNN +LSTM + DAEMethod ± Bilab ± Nasal C/V ± /uw/ ± /iy/ AvgLSTM 46.07 45.31 45.83 48.44 46.88 46.51CNN 59.16 57.20 67.88 69.56 68.60 64.48CNN+LSTM 62.03 60.89 70.04 72.76 63.75 65.89C-L-D 78.65 74.57 87.96 83.25 77.30 80.35Our model 81.67 78.33 89.16 85.00 87.20 84.27and TCNN (covariance matrix) is of length 61 × 61 and 1,891 respectively, while for the speechrecognition task, the input data (phonological features) is of length 6× 256 and 1,536 respectively.The input data for deep autoencoders pertaining the two tasks is of length 2,915 (1,891 TCNN +1,024 CNN features).Baselines We use two baselines, one based on an individual LSTM and another based on anindividual CNN. In each case, we pass the data from the cross-variance matrix and classify directlybased on output from each of these networks. In addition, we compare to previous works on thesame dataset [164, 202]. For meaningful comparisons, since these previous works follow a cross-validation set up (14-fold where the model is trained on 13 subjects’ data and tested on the 14th),we mimic the same data splits and report accuracy. To establish a benchmark for computationallycostly deep learning work, we choose our 80%, 10%, 10% data splits after shuffling the data.Results of phonological category predictionTo demonstrate the significance of the hierarchical CNN-TCNN-DAE method, we also conductseparate experiments with the individual networks and summarize the results in Table 6.10. Fromthe average accuracy scores, we observe that our proposed network performs much better thanindividual networks. A detailed analysis on repeated runs further shows that in most of the cases,177Table 6.11: Classification Performance metrics on 10% test data in phonological prediction taskMetrics Precision Recall Specificity f1 score Kappa± Bilab 72.09 75.61 84.81 73.81 63.34± Nasal 67.44 70.73 82.28 69.05 56.66C/V 86.36 65.52 96.7 74.51 78.32± /uw/ 77.27 56.67 94.44 65.39 70.00± /iy/ 86.04 78.72 91.78 82.22 74.40± Voiced 78.95 86.96 68.63 82.76 58.32Figure 6.34: Variation of performance accuracy of phonological prediction with varying training-validation-test data ratioLSTM alone does not perform better than chance. CNN, on the other hand, is heavily biasedtowards the class label from which it sees more training data. Although the situation improveswith combined CNN-LSTM, our analysis clearly shows the necessity of a better encoding schemeto utilize the combined features rather than mere concatenation of the penultimate features ofboth networks. CNN-LSTM-DAE improves classification accuracy by a significant margin, thusdemonstrating the utility of the autoencoder contribution towards filtering out the unrelated andnoisy features from the concatenated penultimate feature set. Replacing the LSTM block withTCNN block endows the network with more temporal discriminative power, resulting in an increaseof 3.93% mean accuracy as shown in Table 6.10. In addition to accuracy, we provide the precision,178Table 6.12: Comparison of accuracy on 10% test data for speech token prediction taskMethod EEG data Phonological featuresLSTM 8.45 15.83CNN 8.88 16.02CNN+LSTM 12.44 22.10CNN+LSTM+DAE 23.45 49.19Our model 28.08 53.36Classes/iy/ /piy/ /tiy/ /diy/ /uw/ /m/ /n/ pit pat knew gnaw Total/iy/ 6 1 1 1 1 0 0 1 0 1 0 12/piy/ 1 7 1 0 0 1 0 1 1 0 0 12/tiy/ 2 0 8 1 0 0 0 0 0 1 0 12/diy/ 1 0 2 8 0 0 0 0 0 1 0 12/uw/ 3 0 0 0 7 0 0 0 0 1 1 12/m/ 0 1 0 0 0 6 3 0 0 1 1 12/n/ 0 0 0 0 0 3 6 0 0 2 1 12pit 1 2 1 0 0 0 0 5 2 0 1 12pat 0 1 1 0 0 0 0 3 7 0 0 12knew 1 0 0 0 1 0 1 1 0 6 2 12gnaw 0 0 0 0 0 0 2 0 0 3 7 12Classes/iy/ /piy/ /tiy/ /diy/ /uw/ /m/ /n/ pit pat knew gnaw Total/iy/ 2 2 1 1 1 2 0 1 0 1 1 12/piy/ 1 2 1 2 1 1 1 1 2 0 0 12/tiy/ 2 2 1 1 0 2 2 0 0 1 1 12/diy/ 1 0 2 1 0 1 1 2 0 2 2 12/uw/ 2 1 1 1 2 0 1 1 1 1 1 12/m/ 2 1 0 0 0 2 2 1 1 2 1 12/n/ 1 1 1 1 1 1 2 0 0 2 2 12pit 1 2 1 0 1 1 0 2 2 1 1 12pat 0 3 1 1 1 0 0 2 2 0 2 12knew 2 1 1 0 1 1 1 1 0 2 2 12gnaw 1 0 0 1 1 2 2 0 1 2 2 12Figure 6.35: Inter-subject confusion matrix for speech token prediction with covariance data (left)and with phonological feature data (right)recall, specificity, f1 score and Kappa coefficients of our method for all the six classification tasksin Table 6.11. Kappa coefficients offer a metric for evaluating the utility of classifier decisionsbeyond mere chance [169]. Here, a higher mean kappa value corresponding to a task implies thatthe network is able to find better discriminative information from the EEG data beyond randomdecisions. The maximum above-chance accuracy (78.32%) is recorded for presence/absence of thevowel task and the minimum (56.66%) is recorded for the ±nasal.Further, to evaluate the robustness of our model against availability of data, we run a set ofexperiments varying the train-test ratio of the data (results shown in Figure 6.34). As Figure 6.34shows, even with less training data (40% ) and more, and potentially more diverse test data (50%),our model performs above chance, which indicates its reliability even under these extreme datadistribution condition.We provide performance of the baseline methods on direct covariance data and phonologicalfeature data in Table 6.12. For a closer look at the results, we report sample confusion matrixof our model on a leave-one-subject-out classification strategy in Figure 6.35. In this step, weessentially train the network on the data of 13 subjects and test on the 14th subject, to checkthe inter-subject variability of our model. As it is evident from the figure, with direct covariancedata, the predicted classes corresponding to each true label are widely distributed throughoutthe matrix and hardly gives any significant information about the actual speech token. However,179Figure 6.36: Precision and recall metrics corresponding to each speech token on 10% train datainvolvement of the phonological categorization as an intermediate step increases the predictionaccuracy. Interestingly, the false negatives corresponding to each of the tokens also inform us aboutthe respective structure of the word or phoneme. For example, the misclassification of /n/ as /m/,‘knew’ and ‘gnaw’ in a few cases, show that while the network gets strong discriminative featuresfrom the other five networks, features pertaining to the nasal category require more discriminativeability to more accurately categorize the phoneme /n/. Such an observation indeed proves that thephonological features play a significant role for achieving an accurate classification of the speechtokens.Furthermore, Figure 6.36 records the precision and recall scores of all the speech tokenson 80-10-10 train-dev-test split. In Figure 6.37, we again vary the train-test ratio of data andpresent the performance accuracy for speech token prediction corresponding to the top 4 modelsas indicated in Table 6.12.6.5.3.3 SummaryWe reported a novel hierarchical deep neural network architecture composed of parallel spatio-temporal CNN and a deep autoencoder for phonological and speech token prediction from imaginedspeech EEG data. Overall, we made the following contributions: (1) we proposed a novel methodfor embedding the high dimensional EEG data into a cross-covariance matrix that captures the180Figure 6.37: Variation of performance accuracy of speech token prediction for top 4 algorithmswith varying training-validation-test data ratiojoint variability of the electrodes. Rather than attempting to directly decode speech thoughts intospeech tokens, (2) we exploited the cross-covariance matrix to successfully classify the phonologicalattributes of these thoughts into 6 categories; and (3) we used these predicted phonological cate-gories to identify speech tokens. Ultimately, (4) our work suggests the existence of a brain imageryfootprint for underlying ariculatory movements representing speech tokens.6.5.4 Section SummaryIn this section, we developed models for decoding speech-related information from EEG signals. Wefirst introduced a hierarchical model in Section 6.5.1 for the recognition of vowels, short and longwords. Next, we modified the model and examined its application in phonological categorization.These phonological categories (like presence or absence of consonants, nasal and bilabial sounds,etc) can be looked as articulatory information extracted from the EEG signal. Finally, with someadditional modifications in our backbone network, we use this phonological information to classifywords and phonemes.Though the final classification accuracy is not satisfactory, it definitely demonstrated an im-provement in performance due to the inclusion of the articulatory information in the proposed181pathway. This result is particularly important in our investigation as it shows that although the6.6 Summary and Future DirectionsIn this chapter, different approaches were introduced to facilitate extensions of the present workon hand-to-formant mapping. First, we studied several speech-related interfaces that can be usedto control artificial vocal tract geometry via hand movements. We started with the analysis ofPink Trombone and VTDemo and presented a discussion on their comparative control along withtheir advantages and disadvantages. Then we developed two mechanical interfaces - one kinematicand one force-based, for the manipulation of tongue structures. The kinematic interface, alsocalled Sound Stream, was used to control a 2D mechanical tongue-like structure directly with anovel 5 DOF hand movement-based control strategy using sliders. On the other hand, the force-based interface, called Sound Stream II, was used to control selected tongue muscle excitations(longitudinal and genioglossus muscles) via joystick force sensors. This force-activated 4 DOFmechanical control can be used as an alternative pathway to the conventional kinematic approachesof controlling vocal tract movements. This can be considered to be the natural extension of ourwork, where our hand movements can be used to control a vocal tract structure instead of anacoustic space. This will provide us a better estimate of the difficulty level of speech motor controltasks.After investigating different control strategies, we next turned to exploring different ways ofmapping vocal tract movements to speech. First, we classified vocal tract configurations fromMRI videos based on vowels, consonants, and vowel-consonant-vowel transitions using coupledconvolutional and recurrent neural networks. Next, we performed continuous vowel synthesis fromultrasound images using an improved version of 3D CNNs and formant-based speech synthesisernamed Klatt. Furthermore, we introduced a reversible articulatory-acoustic mapping for an onlinearticulatory speech synthesiser (called Pink Trombone) corresponding to static vowels, using deepconvolutional autoencoder and normalizing flow techniques. These works are particularly importantbecause they form the basis of future investigations on determining the information capacity of thearticulatory process using imaging modalities.Finally, we explored an EEG-to-speech mapping using hierarchical deep learning-based tech-niques where we attempted to infer speech tokens and phonological categories from the brain signals.For this, we first computed EEG channel covariance and then extracted spatio-temporal featuresfrom the covariance matrices. We introduced different modifications of the proposed initial deeplearning model to improve its feature extraction capability. We found evidence of the presence ofarticulatory footprint in the brain signals. However, the information content of the EEG signalsappeared to be poor, which limited the possibility of performing further investigations on this. Fu-182ture work should be directed towards understanding the actual speech-related information contentof EEG signals and seeking advanced brain signal acquisition devices with better resolution.Possible future directions related to all these works will be discussed further in the next chapter(Chapter 7).183CHAPTER 7ConclusionsIn this work, we addressed some of the gaps associated with the quantitative understanding ofthe difficulty level of a speech-like complex motor control task. The main ideas of the work canbe grouped into two categories. Firstly, this thesis presents the formulation and validation ofindices of difficulty in different trajectory tasks, targeting the advancement in information theoreticview of motor control. Driven by the need for finding better trajectory complexity metrics thatpotentially capture the movement parameters and constraints, we define two indices of difficulty- one in a quantal space and the other in a non-quantal space. We also show their effectivenesswith toy examples (in Chapter 3) as well as with formant trajectories (in Chapter 5). Secondly,this work proposes computational models of kinematic-to-acoustics and acoustics-to-perceptionmapping (in Chapter 4). Further, we utilize the said mappings to reduce the difficulty of the taskby categorizing the formant space as well as improve the user performance by assisting him todecrease the required movement duration for performing the task (in Chapter 5). The quantitativeanalysis and experimental validation strengthen the view that if our hand motor control can learnto take advantage of the proposed mapping and the perceptual constraints, then it is quite obviousthat the speech articulators might be leveraging any such available means of reducing the difficultyof the otherwise complex speech motor task. We also report the progress on our works connectingarticulatory and acoustic domains as well as other alternative strategies for addressing the motorcontrol problems (in Chapter 6).In this context, it is to be noted that we performed our investigation with a hand-to-formantmapping and a computational perception model. However, a full-fledged comprehensive evaluationof the quantal effects related to speech production and perception utilizing the proposed pathwaycalls for experiments to be done in the actual vocal tract space which in turn needs further analysisof several critical factors which will be discussed in Section 7.2. In what follows, we summarize themain contributions of the thesis and their impacts in the area of motor control, particularly speechmotor control.7.1 Summary of ContributionsA detailed summary of the contributions in this work is presented below:184• Identified and included a missing factor in the existing difficulty metric of tra-jectory tasks. We showed with toy examples that the prevalent Steering Law-based metricused for the quantification of trajectory task difficulty is not effective in differentiating thedifficulty levels of tasks with varying curvatures. Simultaneously, we introduced a new com-ponent that indicates the sharpness of bends in the trajectories. We further evaluated withtoy examples how the proposed index of difficulty allows us to quantify the number and angleof bends in tunnel tasks.• Formulated a difficulty metric of movement tasks in a partitioned 2D space. Wefound no existing metric that could quantify the level of difficulty of the movement tasks in aquantal or partitioned space. Therefore we extended the Fitts’ law and Steering Law into adiscrete, categorized 2D space and evaluated it on toy examples. We also demonstrated therelative difficulty of one task with respect to another by varying different task specifications,like the length and tolerance of the path, 2D target area, etc.• Developed a novel mapping between hand kinematics and categorical perception.We designed and implemented a deep learning-based method for developing a perceptualmapping between the hand movements and perceptual categories in a quantal formant space.We found that spatio-temporal Graph CNN with Gaussian Radial Basis functions coupled toLSTM encoder-decoder networks can efficiently capture the dynamics of user’s hand move-ment while simultaneously taking into account the hand coordinates relative to differentcardinal vowels in the 2D formant space. We also found that the use of LSTM networks cou-pled to CTC loss function is appropriate in the context of modeling the categorical perceptionin a kinematic space.• Demonstrated the appropriateness of the proposed difficulty metric with vali-dation study. Our results with a number of pilot studies and a final single-user studydemonstrated that the proposed difficulty metrics in quantal and uniform 2D space havegood correlation with the movement times required by the user to perform the respectivetasks. This implies that the proposed indices of difficulty are good representatives of the taskcomplexities and can be used with minor modifications for quantifying the difficulty levels oftrajectory tasks in 2D space.• Demonstrated that movement tasks have lower index of difficulty in a quantalspace than in a uniform space. Using examples of trajectory tasks in a square space (toyexample), we showed an average four-fold reduction of task difficulty due to the transformationof a uniform continuous space into a quantal categorized space. Additionally, we showed thata movement task in a continuous formant space is, on average, around 3.5 times more difficult185than that in a quantal formant space.• Demonstrated user throughput improvement for movement tasks leveraging pro-posed mapping. We showed that the time taken by the user to perform a movement task inthe transformed quantal space leveraging the perceptual mapping is around 4.5 times lowerthan that in the original continuous space. Additionally, we also showed a significant im-provement in the index of performance of the user owing to the transformation of the taskspace.• Estimated the information rate of movement tasks with three motor controlparadigms in continuous and quantal task space. We determined the rate of informa-tion transmission through the human motor control pathway for a set of control paradigms- mouse-based control, a joint 2D and a decoupled 1D+1D hand movement control for per-forming trajectory tasks. For this, we theoretically evaluated the difficulty level of a quantaland a non-quantal task space and experimentally recorded the movement time duration ofthe user in both the task spaces.• Showed the possibility of human motor control pathway possessing a certainamount of information capability for performing trajectory tasks. From the exper-iments, we observed that with increase in difficulty level, the motor control pathway doesnot produce drastically high information in a given amount of time but tries to take betteradvantage of the constraints to make the task easier. The rate of information processing isnearly constant in different hand motor control paradigms. This means that the user is notproducing more information in a certain amount of time as the task gets harder, rather takingbetter advantage of the information that they can produce by making the task easier.• Demonstrated that independent control of trajectory tasks is harder than thatwith coupled control. We quantitatively investigated the user’s throughput correspondingto a joint 2D control and an independent 1D+1D control under a variety of trajectory tasks.In each case with a fixed level of task difficulty, we found that the movement time required by1D+1D control is much higher than that in a coupled 2D control. Consequently, we provedthat the former control paradigm has a lower index of performance than the latter one.• Took a step towards understanding information theoretic view of speech motorcontrol. The information theoretic view of speech movement trajectories has never beenlooked at previously, in a way we approached the problem. We, for the first time, present ourstudy exploring task difficulty of hand movements in a vowel formant space. Although themotor control paradigm of a musculoskeletal structure like the hand is fundamentally differentfrom that of a muscular hydrostat like the tongue, this study essentially reveals the capability186of human motor control in leveraging a given constraint through a non-linear mapping. Ourquantitative study with hand motor control and its abilities in utilizing a speech-like quantalspace to make movement tasks easier and faster undoubtedly gives an intuitive explanationof the quantal effects in speech production and perception.• Developed articulatory-acoustic mappings to facilitate extension of current work.To expedite future investigation in the actual vocal tract space, we established an automatizedconnection between ultrasound-based tongue movements and the formant trajectories withoutexplicitly tracking tongue motion. Furthermore, we developed a joint representation of thearticulatory and acoustic domains for encouraging extension of the current work. The jointrepresentation would enable invertible mapping from articulatory to acoustic (forward) andacoustic to articulatory (reverse) domain. This allows a movement trajectory in the formantspace to be transformed to an equivalent movement trajectory in the articulatory space.• Developed hardware sound interfaces and validated feasibility of software soundinterfaces for extending the current work. We observed the feasibility of two softwareinterfaces - VTDemo and Pink Trombone in investigating other ways of hand-to-speech motorcontrol. We also developed a kinematic hardware interface called ”Sound-Stream” connectedto an artificial tongue-like structure and a force-based hardware interface called ”Sound-Stream-II” connected to ArtiSynth tongue. Both of the interfaces are connected from thehand space to sound space via intermediate articulatory movements. This will enable furtherinvestigation of the current problem in articulatory as well as muscle excitation space.7.2 Limitations and Future DirectionsOur results demonstrate some significant contributions in the field of (speech) motor control. How-ever, we also perceive some limitations of the current study that can be worked on as futureinvestigations. The important limitations and future directions are summed up in the followingsubsections:7.2.1 Main limitations and possible improvementsThe present study involves only one user and the results are primarily based on the performanceof the single user. Therefore, there is a faint possibility that some of the results are particularlyrelated to the behavior of the participant. The study needs to be run with more participants in thefuture to validate the outcomes of our experiment and ignore the outliers (if any). The trajectorytasks performed by the user with the three modes of control in all the three rounds are somewhatsimilar. Therefore, it is possible that the user performed better in the latter rounds or experiments187because of the order of the experiments and his familiarity with the task from the earlier roundsor experiments. Though nothing strikingly different is visible from the user data, it is also possiblethat the user was fatigued towards the end of each round. We will have to investigate it moreand figure out a way of identifying any source of bias in the experiments. Besides, with moreparticipants and less practise time, it is possible for us to encounter participant errors or network-related errors. We will have to formulate ways to come up with effective ways of controlling andovercoming such situations. We do not have any physical boundaries for the current study andhence the user can go outside the control space, which can lead to wrong inference. In our futurestudies, we need to figure out ways to deal with the boundary issue making sure not to include thecontribution of physical constraints.In the current study, we did not consider the effect of variation of control-display ratio on theindex of difficulty and throughput computation. This is because our index of difficulty formulationwas based on the display space, following the majority of the HCI tasks studying Fitts’ law. In ourfuture works, we need to determine the impact of its variation on the difficulty and performancemetrics following the analysis done in [15, 81].Further, we considered the coefficients of the difficulty metric components to be one in thisstudy. Rigorous user study needs to be performed on the toy example space to evaluate the propercoefficients of those components that would reveal the relative importance of different movementparameters in the trajectory task. Different Fitts’ law type studies can be conducted directly inthe toy example space to reach a conclusion regarding this. These were not done as a part of thecurrent study because these are outside the scope of the current work. But in order to furthervalidate the index of difficulty formulae, one can follow the experimental pathways detailed in theprevious studies examining Fitts’ law and its variants [3–5, 53, 54, 107–109].Moreover, the perceptual feedback in our user study was purely based on visual symbols. As aresult of that, the timing information was missing in the feedback. The study assumed relativelyuniform speed of the user in performing the trajectory tasks and hence there was no relative varia-tion in timing of the perceptual categories. For example, we did not have any vowel with twice thetiming of the other in the perceptual output, like /a/-/a/-/i/-/u/. Further studies could be directedon finding how such situations can be tackled. This will need the modification of the perceptualnetwork as well as the inclusion of a time-factor in the index of difficulty formulation. As a result,the user will also need to learn to control the relative speed of the trajectories. The time factorcan be incorporated in the study by dynamically varying the control-display gain or by includingadditional time-based constraints in such a way that the user can leverage that to achieve the de-sired relative timing. In that case, the perceptual vowel sequences can be visually represented usingadditional time-signature-based notations that are used for playing musical instruments like piano.If instead auditory feedback is preferred, one can take help of the recent study by Thompson et al.188[172] that investigates learning of audiomotor skills with touch screen-based speech synthesizer.Finally, in this study, we did not consider reverse movements like /a/-/i/-/a/ as our index ofdifficulty metric is not defined for a bend angle of zero degrees. We also left out a few voweltrajectories lying approximately on the same trajectory in order to avoid confusion arising due topossible erroneous network outputs. Further studies need to be performed to improve the mappingin order to include the missing vowel sequences in the study.7.2.2 Investigation on throughput improvementThis work proposes new indices of difficulty along with intuitive explanations, illustrations withtoy examples as well as validation study with three different motor control paradigms in acousticspace. While the correlation between the task difficulty and movement time is satisfactory, moreefforts need to be invested towards finding the coefficients or weights of different factors includedin the indices of difficulty formulae. Besides, in this work, we only focus on the original Fitts’formulation as a starting point for determining the trajectory task difficulty in formant space.More research should be performed to check if the Welford formulation, MacKenzie formulation,polynomial formulation or any other version is more suitable for modeling the relationship betweenthe task parameters and the movement time. Moreover, this is the first work to explore the indexof difficulty in a partitioned quantal space. Therefore, more careful investigations need to beperformed to determine what should the effective width of the tunnel or the effective 2D targetwidth be, in case of such a partitioned space. The aforementioned studies will not only reveal moreinformation on the quantal speech task but also will have contributions in the field of Human-Computer Interfaces (HCI) as it is a significant aspect of many HCI tasks that has never beenpreviously investigated. Finally, rigorous evaluation of the proposed metrics is essential to reach adefinite conclusion. This can be done by conducting empirical studies with more users and can beconsidered as one of the most obvious extensions of the current study.7.2.3 Investigation of physical and biomechanical constraintsThe current study is not a complete investigation of all available constraints in the speech pro-duction. Besides the perceptual constraints considered in the current study, there are some otherconstraints that the speech articulators are known to take advantage of. These mainly include thephysical and biomechanical constraints. The vocal tract is approximately a cylindrical structurethat acts as the device generating the sounds. The articulators including the tongue leverage thestructure of the jaw, palate, etc to reduce the difficulty of the speech task. For example, bracing ofthe tongue during speech against the opposing vocal tract surfaces like teeth, palate etc has beenfound to be very effective in speech. During most speech tasks, the tongue maintains a continuous189contact with the upper molars. This indicates that the tongue finds it essential to actively bracein running speech, which is an example of reduction of the available degrees of freedom of the highdimensional control system. Such physical and biomechanical constraints can also be incorporatedin the current hand-to-formant motor control study by inclusion of different solid boundaries andtunnels within the formant space as well as careful consideration of the finger and hand movements.For example, we can introduce some obstructions and force fields in the 2D control space to helpin movement planning or to modify the movement trajectories. Presence of the obstructions willmotivate rapid bang-bang control in certain parts of the space whereas the availability of forcefields will facilitate some ”sweet spots or movements” that will be easier to achieve than some othermovements. These additional conditions will help us move closer towards the actual articulatorytask space. Besides, the use of force feedback mouse can incorporate additional force constraints inthe current task. Similarly, by careful examination of the finger movements, we can involve someadditional joint-constraints in the study. For example, the movement of the ring finger is highlycoupled to that of the pinky finger as well as the middle finger. Similarly, it is easier to reach somespatial location in space than the others due to the wrist and finger constraints. Thoughtful inves-tigations along these direction will result in more interesting experiments leading to the alterationof difficulty level of the tasks, which can then be quantified by using the proposed metrics or theirextended versions.7.2.4 Investigation in force spaceIn this thesis, we only consider kinematic trajectories within the formant frequency space. However,the kinematic movements are the results of the muscle excitation trajectories. The motor controlis a set of time-varying muscle activation that generates desired movements in a biomechanicalsystem. The synergies of muscle excitation represent the neural control processes underlying themovements. In other words, they connect the causal neural activities to the resultant observedmotions. Therefore, the kinematic trajectories of the limbs and articulators are simply the resultantsof the continuously varying muscle activations. However, there is no current study that investigatesthe muscle activation space from the information theoretic point of view and hence it would beinteresting to explore and formulate the index of difficulty metrics in the muscle excitation trajectoryrather than the kinematic trajectory space.For this, we have already developed the Sound Stream II interface, that can act as a startingpoint of isometric force-based investigations with tongue muscles in ArtiSynth toolkit. It is highlypossible that the speed-accuracy trade-offs associated with speech articulation occur in the muscleactivation space rather than in the kinematic space which is the topic of current investigation. Thatmight potentially be one of the reasons why the kinematic movements appear to be significantly190easier than what is expected, as the targets might actually be in the force domain instead of beingin the kinematic space. Put differently, the force-based targets might be wide enough, as a resultof which the perceived index of difficulty is much lesser than that computed in kinematic space.In such case, minimal computational and activation efforts will be required to achieve a permissi-ble excitation value (as the range of permissible values is wide) and then the biomechanical andphysical constraints can take care of the remaining procedures in running speech to make the in-tended motions complete. Detailed investigation in this direction can reveal significant informationabout the effective muscle activation synergies underlying speech articulation that will help us tounderstand speech motor control better.7.2.5 Investigation with vocal tract movementsAs previously mentioned, the primary motivation behind the hand-to-formant study is to under-stand the complexity of articulatory movements within the vocal tract. Now that our currentstudy demonstrates some clear evidence regarding the capability of human motor control system inutilizing perceptual constraints, the next step will be to attempt to investigate similar phenomenain the articulatory space with additional constraints. This can include bite-block experiments orexperiments with tongue twisters. Besides, perturbations in the auditory feedback or analysis ofspeech in a non-native language can also be utilized to impose new constraints in the speech pro-duction process and consequently explained in terms of difficulty and throughput metrics. Similarexperiments can also be conducted to evaluate the differences in movement complexities of bracedversus non-braced tongue movements (Eg: ’la-ma-ma-ma’ vs ’ca-ma-ma-ma’), slow versus rapidflap/tap sequence movements (Eg: ’ta-da-ta-da-ta-da’), etc. To facilitate this, we have already de-veloped articulatory-acoustic mappings with pink trombone and ultrasound images that can act asthe starting point of relevant investigations. As evident, investigation of the information theoreticview of such articulatory motor control in the vocal tract space will require rethinking of the indexof difficulty formulation as well as more empirical data.7.2.6 Investigation in active thought spaceIn this work, we investigated the mapping of speech-imagery EEG to speech tokens and phonologicalcategories through a classification pathway. We concluded that the EEG signal is extremely noisyand possibly does not possess enough information required to perform successful classification ofspeech tokens in a high vocabulary dataset. Therefore, as the next step, it would be interestingto try to decode the inner representation of articulatory kinematic trajectories from the corticalarea in brain with Electro-corticography or Magnetoencephalography data that allows monitoringbrain activity with high spatio-temporal as well as spectral resolution. By locating the most191important sites of speech-related activities in the sensorimotor cortex as well as with appropriatepre-processing of the data using advanced signal processing techniques, it might be possible toextract the maximum information related to articulatory kinematic trajectories from the brain.Correspondingly, we could infer the articulatory movements or resultant speech acoustics fromthe imagined speech brain signal data using deep learning-based mapping. Using similar data-acquisition techniques, we could also investigate some of the fundamental questions raised at thebeginning of the thesis (Eg: why are tongue twisters so difficult? How are they encoded in thehuman brain? How do the auditory feedback perturbations vary the encoded trajectories? etc).7.3 Concluding RemarksTo conclude, this thesis has revealed an information theoretic view of speech-related hand motorcontrol task which has not been previously investigated. The work has filled in the gaps related toanalysis of difficulty level of the trajectory tasks in the context of general motor control theory. Ithas also conclusively demonstrated that the movements with categorical perceptual constraints areeasier to perform than those without such constraints. This work opens a new research directionon understanding how difficult is the original speech task and how the human motor control systemtakes advantage of the available constraints to reduce the degrees of freedom of the system, therebymaking the task easier. We provide a starting point for answering the question we began the thesiswith, i.e., ”How hard is it to speak?”. Further research in this direction will eventually enable aclearer understanding of speech motor control and development of better silent speech interfacesleveraging the right motor control paradigm.192References[1] VTDemo. https://www.phon.ucl.ac.uk/resource/vtdemo/. Accessed: 2010-09-30.[2] A. H. Abdi, P. Saha, V. P. Srungarapu, and S. Fels. Muscle excitation estimation in biomechanicalsimulation using naf reinforcement learning. In Computational Biomechanics for Medicine, pages133–141. Springer, 2020.[3] J. Accot and S. Zhai. Beyond fitts’ law: models for trajectory-based hci tasks. In Proceedings of theACM SIGCHI Conference on Human factors in computing systems, pages 295–302, 1997.[4] J. Accot and S. Zhai. Performance evaluation of input devices in trajectory-based tasks: anapplication of the steering law. In Proceedings of the SIGCHI conference on Human Factors inComputing Systems, pages 466–472, 1999.[5] J. Accot and S. Zhai. Scale effects in steering law tasks. In Proceedings of the SIGCHI conference onHuman factors in computing systems, pages 1–8, 2001.[6] T. Baer, J. C. Gore, L. C. Gracco, and P. W. Nye. Analysis of vocal tract shape and dimensionsusing magnetic resonance imaging: Vowels. The Journal of the Acoustical Society of America, 90(2):799–828, 1991.[7] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrentnetworks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.[8] R. Behroozmand, R. Shebek, D. R. Hansen, H. Oya, D. A. Robin, M. A. Howard III, and J. D.Greenlee. Sensory–motor networks involved in speech production and motor control: An fmri study.Neuroimage, 109:418–428, 2015.[9] K. Brigham and B. V. Kumar. Imagined speech classification with eeg signals for silentcommunication: a preliminary investigation into synthetic telepathy. In iCBBE, 2010, pages 1–4.IEEE, 2010.[10] S. K. Card, W. K. English, and B. J. Burr. Evaluation of mouse, rate-controlled isometric joystick,step keys, and text keys for text selection on a crt. Ergonomics, 21(8):601–613, 1978.[11] R. Carlson and B. Granstro¨m. Model predictions of vowel dissimilarity. In STL-QPSR, volume 3,pages 84–104. 1979.[12] R. Carlson, B. Granstro¨m, and G. Fant. Some studies concerning perception of isolated vowels.Speech Transmission Laboratory Quarterly Progress and Status Report, 11(2-3):19–35, 1970.[13] R. Carlson, G. Fant, and B. Granstro¨m. Two-formant models, pitch, and vowel perception. ActaAcustica united with Acustica, 31(6):360–362, 1974.[14] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset.In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages6299–6308, 2017.[15] G. Casiez, D. Vogel, R. Balakrishnan, and A. Cockburn. The impact of control-display gain on user193performance in pointing tasks. Human–computer interaction, 23(3):215–250, 2008.[16] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acmsigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.[17] T. Chen, T. He, M. Benesty, et al. Xgboost: extreme gradient boosting. R package version 0.4-2,pages 1–4, 2015.[18] T. Chiba and M. Kajiyama. The vowel: Its nature and structure, volume 652. Phonetic society ofJapan Tokyo, 1958.[19] L. A. Chistovich. Central auditory processing of peripheral vowel spectra. The Journal of theAcoustical Society of America, 77(3):789–805, 1985.[20] L. A. Chistovich and V. V. Lublinskaya. The ‘center of gravity’effect in vowel spectra and criticaldistance between the formants: Psychoacoustical study of the perception of vowel-like stimuli.Hearing research, 1(3):185–195, 1979.[21] L. A. Chistovich, R. L. Sheikin, and V. V. Lublinskaja. Centres of gravity and spectral peaks as thedeterminants of vowel quality. 1979.[22] K. Cho, B. Van Merrie¨nboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXivpreprint arXiv:1406.1078, 2014.[23] C. G. Christou, D. Michael-Grigoriou, D. Sokratous, and M. Tsiakoulia. Buzzwirevr: An immersivegame to supplement fine-motor movement therapy. In ICAT-EGVE, pages 149–156, 2018.[24] C. G. Christou, D. Michael-Grigoriou, D. Sokratous, and M. Tsiakoulia. A virtual reality loop andwire game for stroke rehabilitation. 2018.[25] T. G. Csapo´, T. Gro´sz, G. Gosztolya, L. To´th, and A. Marko´. Dnn-based ultrasound-to-speechconversion for a silent speech interface. Proc. Interspeech 2017, pages 3672–3676, 2017.[26] G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr usingrectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech andsignal processing, pages 8609–8613. IEEE, 2013.[27] N. d’Alessandro, R. Pritchard, J. Wang, and S. Fels. Ubiquitous voice synthesis: interactivemanipulation of speech and singing on mobile distributed platforms. In CHI’11 Extended Abstractson Human Factors in Computing Systems, pages 335–340. 2011.[28] A. Das, J. Li, R. Zhao, and Y. Gong. Advancing connectionist temporal classification with attentionmodeling. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 4769–4773. IEEE, 2018.[29] C. S. DaSalla, H. Kambara, Y. Koike, and M. Sato. Spatial filtering and single-trial classification ofeeg during vowel speech imagery. In iCREATe ’09, page 27. ACM, 2009.[30] C. S. DaSalla, H. Kambara, M. Sato, and Y. Koike. Single-trial classification of vowel speech imageryusing common spatial patterns. Neural networks, 22(9):1334–1339, 2009.[31] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast194localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852,2016.[32] P. Delattre, A. M. Liberman, F. S. Cooper, and L. J. Gerstman. An experimental study of theacoustic determinants of vowel color; observations on one-and two-formant vowels synthesized fromspectrographic patterns. Word, 8(3):195–210, 1952.[33] B. Denby and M. Stone. Speech synthesis from real time ultrasound images of the tongue. In 2004IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–685.IEEE, 2004.[34] B. Denby, Y. Oussar, G. Dreyfus, and M. Stone. Prospects for a silent speech interface usingultrasound imaging. In 2006 IEEE International Conference on Acoustics Speech and SignalProcessing Proceedings, volume 1, pages I–I. IEEE, 2006.[35] S. Deng, R. Srinivasan, T. Lappas, and M. D’Zmura. Eeg classification of imagined syllable rhythmusing hilbert spectrum methods. Journal of neural engineering, 7(4):046006, 2010.[36] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. arXivpreprint arXiv:1410.8516, 2014.[37] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprintarXiv:1605.08803, 2016.[38] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, andT. Darrell. Long-term recurrent convolutional networks for visual recognition and description. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634,2015.[39] C. G. Drury. Application of fitts’ law to foot-pedal design. Human factors, 17(4):368–373, 1975.[40] M. Duarte and S. M. Freitas. Speed–accuracy trade-off in voluntary postural movements. Motorcontrol, 9(2):180–196, 2005.[41] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P.Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neuralinformation processing systems, pages 2224–2232, 2015.[42] M. D’Zmura, S. Deng, T. Lappas, S. Thorpe, and R. Srinivasan. Toward eeg sensing of imaginedspeech. In HCI, pages 40–48. Springer, 2009.[43] L. Fadiga, L. Craighero, G. Buccino, and G. Rizzolatti. Speech listening specifically modulates theexcitability of tongue muscles: a tms study. European journal of Neuroscience, 15(2):399–402, 2002.[44] G. Fant. Acoustic description and classification of phonetic units. Ericsson Technics, 1(1973):32–83,1959.[45] S. Fels and G. Hinton. Glove-talkii: an adaptive gesture-to-formant interface. In Proceedings of theSIGCHI conference on Human factors in computing systems, pages 456–463, 1995.[46] S. Fels and G. E. Hinton. Glove-talkii: Mapping hand gestures to speech using neural networks. InAdvances in Neural Information Processing Systems, pages 843–850, 1995.195[47] S. Fels, B. Pritchard, E. Vatikiotis-Bateson, and V. V. Team. Gesture controlled synthetic speechand song. The Journal of the Acoustical Society of America, 125(4):2495–2495, 2009.[48] S. S. Fels. Using normalized rbf networks to map hand gestures to speech. In Radial basis functionnetworks 2, pages 59–101. Springer, 2001.[49] S. S. Fels and G. Hinton. Glove-Talk II: Mapping Hard Gestures to Speech Using Neural Networks:an Approach to Building Adaptive Interfaces. Citeseer, 1994.[50] S. S. Fels and G. E. Hinton. Glove-talkii-a neural-network interface which maps gestures to parallelformant speech synthesizer controls. IEEE transactions on neural networks, 9(1):205–212, 1998.[51] S. S. Fels, R. Pritchard, and E. Vatikiotis-Bateson. Building a portable gesture-to-audio/visualspeech system. In AVSP, pages 13–18. Citeseer, 2008.[52] S. S. Fels, B. Pritchard, and A. Lenters. Fortouch: A wearable digital ventriloquized actor. In NIME,pages 274–275, 2009.[53] P. M. Fitts. The information capacity of the human motor system in controlling the amplitude ofmovement. Journal of experimental psychology, 47(6):381, 1954.[54] P. M. Fitts and J. R. Peterson. Information capacity of discrete motor responses. Journal ofexperimental psychology, 67(2):103, 1964.[55] H. D. Fraser, S. Fels, and R. Pritchard. Walk the walk, talk the talk. In 2008 12th IEEEInternational Symposium on Wearable Computers, pages 117–118. IEEE, 2008.[56] B. Galantucci, C. A. Fowler, and M. T. Turvey. The motor theory of speech perception reviewed.Psychonomic bulletin & review, 13(3):361–377, 2006.[57] P. Ghane. Silent speech recognition in EEG-based Brain Computer Interface. PhD thesis, IndianaUniversity-Purdue University Indianapolis, 2015.[58] B. Gick, P. Anderson, H. Chen, C. Chiu, H. B. Kwon, I. Stavness, L. Tsou, and S. Fels. Speechfunction of the oropharyngeal isthmus: a modelling study. Computer Methods in Biomechanics andBiomedical Engineering: Imaging & Visualization, 2(4):217–222, 2014.[59] E. F. Gonza´lez-Castan˜eda, A. A. Torres-Garc´ıa, C. A. Reyes-Garc´ıa, and L. Villasen˜or-Pineda.Sonification and textification: Proposing methods for classifying unspoken words from eeg signals.Biomedical Signal Processing and Control, 37:82–91, 2017.[60] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT Press, 2016.[61] G. Gosztolya, A´. Pinte´r, L. To´th, T. Gro´sz, A. Marko´, and T. G. Csapo´. Autoencoder-basedarticulatory-to-acoustic mapping for ultrasound silent speech interfaces. In 2019 International JointConference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.[62] H. Goyal, P. Saha, B. Gick, and S. Fels. Eeg-to-f0: Establishing artificial neuro-muscular pathway forkinematics-based fundamental frequency control. Canadian Acoustics, 47(3):112–113, 2019.[63] A. Graves. Connectionist temporal classification. In Supervised Sequence Labelling with RecurrentNeural Networks, pages 61–93. Springer, 2012.196[64] A. Graves, S. Ferna´ndez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rdinternational conference on Machine learning, pages 369–376, 2006.[65] D. Griffin and J. Lim. Signal estimation from modified short-time fourier transform. IEEETransactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.[66] F. H. Guenther. Neural control of speech. Mit Press, 2016.[67] F. H. Guenther and T. Vladusich. A neural theory of speech acquisition and production. Journal ofneurolinguistics, 25(5):408–422, 2012.[68] C. Guger, W. Harkam, C. Hertnaes, and G. Pfurtscheller. Prosthetic control by an eeg-basedbrain-computer interface (bci). In AAATE, pages 3–6, 1999.[69] C. Herff and T. Schultz. Automatic speech recognition from neural signals: a focused review.Frontiers in neuroscience, 10:429, 2016.[70] S. Hiroya and M. Honda. Estimation of articulatory movements from speech acoustics using anhmm-based speech production model. IEEE Transactions on Speech and Audio Processing, 12(2):175–185, 2004.[71] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780,1997.[72] J. Hodgen and P. Valdez. A stochastic articulatory-to-acoustic mapping as a basis for speechrecognition. In IMTC 2001. Proceedings of the 18th IEEE Instrumentation and MeasurementTechnology Conference. Rediscovering Measurement in the Age of Informatics (Cat. No. 01CH37188), volume 2, pages 1105–1110. IEEE, 2001.[73] C.-W. Huang, L. Dinh, and A. Courville. Augmented normalizing flows: Bridging the gap betweengenerative flows and latent variable models. arXiv preprint arXiv:2002.07101, 2020.[74] M. Huckvale. VT Demo. https://www.phon.ucl.ac.uk/resource/vtdemo/, 2010. version 3.6.0.[75] T. Hueber, G. Aversano, G. Cholle, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, and M. Stone.Eigentongue feature extraction for an ultrasound-based silent speech interface. In 2007 IEEEInternational Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 1, pagesI–1245. IEEE, 2007.[76] P. Hughes. Python and tkinter programming. Linux Journal, 2000(77es):23, 2000.[77] B. M. Idrees and O. Farooq. Vowel classification using wavelet decomposition during speech imagery.In SPIN, 2016, pages 636–640. IEEE, 2016.[78] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.[79] R. J. Jagacinski, D. W. Repperger, M. S. Moran, S. L. Ward, and B. Glass. Fitts’ law and themicrostructure of rapid discrete movements. Journal of Experimental Psychology: Human Perceptionand Performance, 6(2):309, 1980.[80] E. M. Juanpere and T. G. Csapo´. Ultrasound-based silent speech interface using convolutional and197recurrent neural networks. Acta Acustica united with Acustica, 105(4):587–590, 2019.[81] B. H. Kantowitz and G. C. Elvers. Fitts’ law with an isometric controller: effects of order of controland controldisplay gain. Journal of Motor Behavior, 20(1):53–66, 1988.[82] G. D. Kessler, L. F. Hodges, and N. Walker. Evaluation of the cyberglove as a whole-hand inputdevice. ACM Transactions on Computer-Human Interaction (TOCHI), 2(4):263–283, 1995.[83] J. Kim, S.-K. Lee, and B. Lee. Eeg classification in a single-trial basis for vowel speech perceptionusing multivariate empirical mode decomposition. Journal of neural engineering, 11(3):036010, 2014.[84] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.[85] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advancesin Neural Information Processing Systems, pages 10215–10224, 2018.[86] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,2013.[87] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXivpreprint arXiv:1609.02907, 2016.[88] D. H. Klatt. Software for a cascade/parallel formant synthesizer. the Journal of the AcousticalSociety of America, 67(3):971–995, 1980.[89] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. In Advances in neural information processing systems, pages 1097–1105, 2012.[90] S. R. Kuberski and A. I. Gafos. Fitts’ law in tongue movements of repetitive speech. Phonetica,pages 1–20, 2019.[91] S. R. Kuberski and A. I. Gafos. The speed-curvature power law in tongue movements of repetitivespeech. Plos one, 14(3):e0213851, 2019.[92] A. Kunikoshi, Y. Qiao, N. Minematsu, and K. Hirose. Speech generation from hand gestures basedon space mapping. In Tenth Annual Conference of the International Speech CommunicationAssociation, 2009.[93] D. R. Lametti, S. M. Nasir, and D. J. Ostry. Sensory preference in speech production revealed bysimultaneous alteration of auditory and somatosensory feedback. Journal of Neuroscience, 32(27):9351–9358, 2012.[94] A. C. Lammert, C. H. Shadle, S. S. Narayanan, and T. F. Quatieri. Speed-accuracy tradeoffs inhuman speech production. PloS one, 13(9):e0202180, 2018.[95] Y. LeCun et al. Generalization and network design strategies. Connectionism in perspective, pages143–155, 1989.[96] A. Lenters. An adaptive gesture-to-speech interface for diva.[97] C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang. Spatio-temporal graph convolution for skeleton basedaction recognition. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.198[98] X. Li, X. Ying, and M. C. Chuah. Grip: Graph-based interaction-aware trajectory prediction. In2019 IEEE Intelligent Transportation Systems Conference (ITSC), pages 3960–3966. IEEE, 2019.[99] B. Lindblom. Spectrographic study of vowel reduction. The journal of the Acoustical society ofAmerica, 35(11):1773–1781, 1963.[100] S. G. Lingala, B. P. Sutton, M. E. Miquel, and K. S. Nayak. Recommendations for real-time speechmri. Journal of Magnetic Resonance Imaging, 43(1):28–44, 2016.[101] Y. Liu, P. Saha, A. Shamei, B. Gick, and S. Fels. Mapping a continuous vowel space to hand gestures.[102] Y. Liu, P. Saha, B. Gick, and S. Fels. Deep learning based continuous vowel space mapping fromhand gestures. In Acoustics Week in Canada 2019, 2019.[103] Y. Liu, P. Saha, and B. Gick. Visual feedback and self-monitoring in speech learning via handmovement. The Journal of the Acoustical Society of America, 148(4):2765–2765, 2020.[104] Y. Liu, P. Saha, A. Shamei, B. Gick, and S. Fels. Mapping a continuous vowel space to handgestures. Canadian Acoustics, 48(1), 2020.[105] C. Luo and A. L. Yuille. Grouped spatial-temporal aggregation for efficient action recognition. InProceedings of the IEEE International Conference on Computer Vision, pages 5512–5521, 2019.[106] M. K. Ma, S. S. Fels, and R. Pritchard. A parallel-formant speech synthesizer in max/msp. InICMC, 2006.[107] I. S. MacKenzie. A note on the information-theoretic basis for fitts’ law. Journal of motor behavior,21(3):323–330, 1989.[108] I. S. MacKenzie. Fitts’ law as a research and design tool in human-computer interaction.Human-computer interaction, 7(1):91–139, 1992.[109] I. S. MacKenzie and W. Buxton. Extending fitts’ law to two-dimensional tasks. In Proceedings of theSIGCHI conference on Human factors in computing systems, pages 219–226, 1992.[110] S. Mahajan, I. Gurevych, and S. Roth. Latent normalizing flows for many-to-many cross-domainmappings. arXiv preprint arXiv:2002.06661, 2020.[111] P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff. Lstm-basedencoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148, 2016.[112] M. Mandal, L. K. Kumar, M. S. Saran, et al. Motionrec: A unified deep framework for moving objectrecognition. In The IEEE Winter Conference on Applications of Computer Vision, pages 2734–2743,2020.[113] K. J. Maner, A. Smith, and L. Grayson. Influences of utterance length and complexity on speechmotor performance in children and adults. Journal of Speech, Language, and Hearing Research, 43(2):560–573, 2000.[114] S. Martin, P. Brunner, I. Iturrate, J. d. R. Milla´n, G. Schalk, R. T. Knight, and B. N. Pasley. Wordpair classification during imagined speech using direct brain recordings. Scientific reports, 6:25803,2016.199[115] J. Masci, U. Meier, D. Cires¸an, and J. Schmidhuber. Stacked convolutional auto-encoders forhierarchical feature extraction. In International conference on artificial neural networks, pages 52–59.Springer, 2011.[116] S. Mathur and B. H. Story. Vocal tract modeling: Implementation of continuous length variations ina half-sample delay kelly-lochbaum model. In Signal Processing and Information Technology, 2003.ISSPIT 2003. Proceedings of the 3rd IEEE International Symposium on, pages 753–756. IEEE, 2003.[117] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto. librosa: Audioand music signal analysis in python. In Proceedings of the 14th python in science conference,volume 8, 2015.[118] D. E. Meyer, R. A. Abrams, S. Kornblum, C. E. Wright, and J. Keith Smith. Optimality in humanmotor performance: ideal control of rapid aimed movements. Psychological review, 95(3):340, 1988.[119] B. Min, J. Kim, H.-j. Park, and B. Lee. Vowel imagery decoding toward silent speech bci usingextreme learning machine with electroencephalogram. BioMed research international, 2016, 2016.[120] A. Mohamed, K. Qian, M. Elhoseiny, and C. Claudel. Social-stgcnn: A social spatio-temporal graphconvolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pages 14424–14432, 2020.[121] K. Mohanchandra and S. Saha. A communication paradigm using subvocalized speech: translatingbrain signals into speech. Augmented Human Research, 1(1):3, 2016.[122] R. Mo¨tto¨nen and K. E. Watkins. Motor representations of articulators contribute to categoricalperception of speech sounds. Journal of Neuroscience, 29(31):9819–9825, 2009.[123] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th international conference on machine learning (ICML-10), pages 807–814,2010.[124] S. Narayanan, A. Toutios, V. Ramanarayanan, A. Lammert, J. Kim, S. Lee, K. Nayak, Y.-C. Kim,Y. Zhu, L. Goldstein, et al. Real-time magnetic resonance imaging and electromagneticarticulography database for speech production research (tc). The Journal of the Acoustical Society ofAmerica, 136(3):1307–1311, 2014.[125] R. Netsell. Speech motor control and selected neurologic disorders. Speech Motor Control, pages247–261, 1982.[126] C. H. Nguyen, G. K. Karavas, and P. Artemiadis. Inferring imagined speech using eeg signals: a newapproach using riemannian manifold features. Journal of neural engineering, 15(1):016002, 2017.[127] K. I. Nordstrom, S. Fels, C. D. Hassall, and B. Pritchard. Developing vowel mappings for aninteractive voice synthesis system controlled by hand motions. The Journal of the Acoustical Societyof America, 127(3):2021–2021, 2010.[128] K. Ogata and Y. Matsuda. The effect of training in producing continuous vowels with adata-glove-driven vocal tract configuration tool. In Proceedings of Meetings on Acoustics 172ASA,volume 29, page 060009. Acoustical Society of America, 2016.[129] K. Ogata, K. Matsumura, and Y. Matsuda. Data-glove-driven vocal tract configuration methods forvowel synthesis. Acoustical Science and Technology, 36(6):527–536, 2015.200[130] M. J. Orr et al. Introduction to radial basis function networks, 1996.[131] D. OShaughnessy. Formant estimation and tracking. In Springer handbook of speech processing,pages 213–228. Springer, 2008.[132] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi. Sequence-to-sequence prediction ofvehicle trajectory via lstm encoder-decoder architecture. In 2018 IEEE Intelligent VehiclesSymposium (IV), pages 1672–1678. IEEE, 2018.[133] G. E. Peterson and H. L. Barney. Control methods used in a study of the vowels. The Journal of theacoustical society of America, 24(2):175–184, 1952.[134] G. Pfurtscheller, R. Scherer, and C. Neuper. Eeg-based brain-computer interface. OXFORD SERIESIN HUMAN-TECHNOLOGY INTERACTION, page 315, 2008.[135] R. K. Potter and J. C. Steinberg. Toward the specification of speech. The Journal of the AcousticalSociety of America, 22(6):807–820, 1950.[136] B. Pritchard and S. Fels. Grassp: Gesturally-realized audio, speech and song performance. InProceedings of the 2006 conference on New interfaces for musical expression, pages 272–276. Citeseer,2006.[137] F. Pulvermu¨ller. Brain mechanisms linking language and action. Nature reviews neuroscience, 6(7):576–582, 2005.[138] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv preprintarXiv:1505.05770, 2015.[139] A. E. Rosenberg. Effect of glottal pulse shape on the quality of natural vowels. The Journal of theAcoustical Society of America, 49(2B):583–590, 1971.[140] A. C. Roy, L. Craighero, M. Fabbri-Destro, and L. Fadiga. Phonological and lexical motorfacilitation during speech listening: a transcranial magnetic stimulation study. Journal ofPhysiology-Paris, 102(1-3):101–105, 2008.[141] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal ofComputer Vision, 115(3):211–252, 2015.[142] P. Saha and S. Fels. Hierarchical deep feature learning for decoding imagined speech from eeg. toappear in AAAI, 2019. 2 pg abstract.[143] P. Saha and S. Fels. Hierarchical deep feature learning for decoding imagined speech from eeg. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 10019–10020, 2019.[144] P. Saha and S. Fels. Learning joint articulatory-acoustic representations with normalizing flows.Proc. Interspeech 2020, pages 3196–3200, 2020.[145] P. Saha, D. R. Mohapatra, S. Praneeth, and S. Fels. Sound-stream ii: Towards real-timegesture-controlled articulatory sound synthesis. Canadian Acoustics, 46(4):58–59, 2018.[146] P. Saha, P. Srungarapu, and S. Fels. Towards automatic speech identification from vocal tract shapedynamics in real-time mri. arXiv preprint arXiv:1807.11089, 2018.201[147] P. Saha, M. Abdul-Mageed, and S. Fels. Speak your mind! towards imagined speech recognition withhierarchical deep learning. Proc. Interspeech 2019, pages 141–145, 2019.[148] P. Saha, S. Fels, and M. Abdul-Mageed. Deep learni