Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Grammaticus ex machina : tone inventories as hypothesized by machine Fry, Michael David 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_may_fry_michael.pdf [ 19.27MB ]
Metadata
JSON: 24-1.0389820.json
JSON-LD: 24-1.0389820-ld.json
RDF/XML (Pretty): 24-1.0389820-rdf.xml
RDF/JSON: 24-1.0389820-rdf.json
Turtle: 24-1.0389820-turtle.txt
N-Triples: 24-1.0389820-rdf-ntriples.txt
Original Record: 24-1.0389820-source.json
Full Text
24-1.0389820-fulltext.txt
Citation
24-1.0389820.ris

Full Text

Grammaticus ex machina: Tone inventories as hypothesizedby machinebyMichael David Frya thesis submitted in partial fulfillmentof the requirements for the degree ofDoctor of Philosophyinthe faculty of graduate and postdoctoral studies(Linguistics)The University of British Columbia(Vancouver)April 2020c© Michael David Fry, 2020The following individuals certify that they have read, and recommend to the Faculty ofGraduate and Postdoctoral Studies for acceptance, the thesis entitled:Grammaticus ex machina: Tone inventories as hypothesized by machinesubmitted by Michael David Fry in partial fulfillment of the requirements for the degreeof Doctor of Philosophy in Linguistics.Examining Committee:Molly Babel, LinguisticsSupervisorDouglas Pulleyblank, LinguisticsSupervisory Committee MemberPaul Tupper, Simon Fraser University, MathematicsSupervisory Committee MemberValter Ciocca, Audiology & Speech SciencesUniversity ExaminerGunnar Hansson, LinguisticsUniversity ExamineriiAbstractA fundamental task of linguistics is to accurately describe the sound patterns of a lan-guage. In the field of phonology, this often starts with identifying the set of contrastivesounds in the language, its phoneme inventory. If the language under investigation is a tonelanguage, then identifying the contrastive tones in the language, its tone inventory, is alsoneeded. Historically, phonologists have identified phoneme and tone inventories throughlengthy elicitation sessions in order to determine contrasting units. Yet, given the recentadvances in machine learning, there may be another way. In this thesis, I argue, by wayof demonstration, that machine learning has become a valuable tool for field and theoret-ical linguists in the description of language and in the development of linguistic theory.Specifically, I present empirical support, using machine learning methods, for the theory ofEmergent Phonology, which holds that phonology emerges as the “consequence of accumu-lated phonetic experience” (Lindblom, 1999, p. 195). This support comes in the form ofhypothesized tone inventories (part of one’s phonology) that emerge, via an unsupervisedlearning model, from acoustic-phonetic data for a given language. Since the hypothesizedinventories match fairly well with the tone inventories standardly reported in the literature,an aspect of phonology is shown to have emerged from phonetics and support for EmergentPhonology is achieved. To test the robustness of the unsupervised learning method, it isapplied to four languages: Mandarin, Cantonese, Fungwa and English. Finally, since theidentification of tone inventories has hitherto been under the purview of human linguists,success in this project provides a first step towards creating a grammaticus ex machina – alinguist (grammarian) from the machine.iiiLay SummaryThe primary goal of this thesis is to demonstrate that machine learning is a valuable toolthat can be used by linguists to analyze language. To achieve this, machine learning is usedto generate tone inventories for several languages. Tones are distinctive pitch patterns in alanguage that change the lexical or grammatical meaning of a word. For example, in thetone language of Mandarin, /ma/ produced with a high-flat pitch means ‘mother’ and /ma/produced with a falling-then-rising pitch means ‘horse.’ A tone inventory lists all distinctivepitch patterns that occur in a language. In the past, it has been the job of human linguiststo determine the tone inventory for a language; this thesis, however, considers how machinelearning may be used to automate the process.ivPrefaceThis dissertation is original work by the author. I wrote all chapters and computer codeused in this project. None of this work has been published in written form, but aspectsof it have been presented at the Acoustical Society of America conference held in Victoria,BC in November 2018 (Fry, 2018). The idea for this thesis is primarily my own, butearly conversations with the late Dr. Eric Vatikiotis-Bateson and, subsequently, with mycommittee have helped refine the project. As the work herein is entirely computational, noethics approval was required.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Unpacking the introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Machine learning as a tool for linguists . . . . . . . . . . . . . . . . . 21.1.2 Bridging computational and theoretical linguistics . . . . . . . . . . 41.1.2.1 A word on biases and machine learning . . . . . . . . . . . 51.1.3 The intersection of language acquisition . . . . . . . . . . . . . . . . 51.2 Outlining the project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.1 Operationalizing Phonological Units and Processes . . . . . . . . . . 11vi1.2.2 Operationalizing the unsupervised learning model . . . . . . . . . . 131.2.3 Operationalizing acoustic parameters of digitized speech . . . . . . . 161.3 Summary of the project: Thesis statement . . . . . . . . . . . . . . . . . . . 161.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Literature Review: Emergent phonology, tone, machine learning andthe overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1 The linguistics side: Emergent Phonology and tone . . . . . . . . . . . . . . 192.1.1 Emergent Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.1 Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.2 Emergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1.1.3 Phonetic Experiences . . . . . . . . . . . . . . . . . . . . . 232.1.1.4 Emergent Phonology Summary . . . . . . . . . . . . . . . . 242.1.2 Tone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1.2.1 Phonetics of Tone . . . . . . . . . . . . . . . . . . . . . . . 272.1.2.2 Phonology of Tone . . . . . . . . . . . . . . . . . . . . . . . 312.1.3 The linguistic side: Summary statement . . . . . . . . . . . . . . . . 352.1.3.1 An important caveat . . . . . . . . . . . . . . . . . . . . . . 352.2 The computational side: Machine learning . . . . . . . . . . . . . . . . . . . 362.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.2.1.1 A brief overview of neural networks . . . . . . . . . . . . . 372.2.1.2 Supervised learning in neural networks . . . . . . . . . . . 392.2.1.3 The rise of deep learning . . . . . . . . . . . . . . . . . . . 432.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2.2.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . 452.2.2.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . 472.2.3 The computational side: Summary . . . . . . . . . . . . . . . . . . . 482.3 The overlap: Combining computation and linguistics . . . . . . . . . . . . . 49vii3 Methodology: Implementation and Explication . . . . . . . . . . . . . . . 503.1 Method Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.2.1 Syllable Demarcation . . . . . . . . . . . . . . . . . . . . . . . . . . 523.2.1.1 A word on syllable-frames . . . . . . . . . . . . . . . . . . . 533.2.2 Acoustic Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.2.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 563.3 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.3.1 Training an adversarial autoencoder . . . . . . . . . . . . . . . . . . 593.3.2 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3.3 Evaluating training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.3.4 Reducing the dimensionality of acoustic parameters . . . . . . . . . 623.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.4.2 Visualizing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.1 Case Study I: Mandarin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1.1 Mandarin Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.1.2 Motivation for inclusion . . . . . . . . . . . . . . . . . . . . . . . . . 714.1.3 Corpus Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.1.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.1.5.1 Adversarial Autoencoder Performance . . . . . . . . . . . . 734.1.5.2 Hypothesized Tone Inventories . . . . . . . . . . . . . . . . 744.1.5.3 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . 794.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81viii4.2 Case Study II: Cantonese . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2.1 Cantonese Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2.1.1 Cantonese Tone Mergers . . . . . . . . . . . . . . . . . . . 844.2.2 Motivation for inclusions . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.3 Corpus Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.2.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 854.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.2.4.1 Adversarial Autoencoder Performance . . . . . . . . . . . . 864.2.4.2 Hypothesized Tone Inventories . . . . . . . . . . . . . . . . 884.2.4.3 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . 944.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.3 Case Study III: Fungwa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.3.1 Motivation for inclusion . . . . . . . . . . . . . . . . . . . . . . . . . 1004.3.2 Corpus Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.3.2.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . 1004.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.3.3.1 Adversarial Autoencoder Performance . . . . . . . . . . . . 1014.3.3.2 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . 1074.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.4 Case Study IV: English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.4.1 Motivation for Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . 1104.4.2 Corpus Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.4.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 1114.4.3 TIMIT Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.4.3.1 Adversarial Autoencoder Performance . . . . . . . . . . . . 1114.4.3.2 Hypothesized Tone Inventories . . . . . . . . . . . . . . . . 1124.4.3.3 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . 1174.4.4 Buckeye Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119ix4.4.4.1 Adversarial Autoencoder Performance . . . . . . . . . . . . 1194.4.4.2 Hypothesized Tone Inventories . . . . . . . . . . . . . . . . 1204.4.4.3 Cluster Evaluation . . . . . . . . . . . . . . . . . . . . . . . 1244.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254.5 Cross-Language Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275 General Discussion, Future Directions and Conclusion . . . . . . . . . . 1295.1 Summary of the project and results . . . . . . . . . . . . . . . . . . . . . . . 1295.2 Presently available uses of the method . . . . . . . . . . . . . . . . . . . . . 1315.2.1 Allotones in Mandarin . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.2.2 Speaker differences in Cantonese . . . . . . . . . . . . . . . . . . . . 1355.3 Future applications of the method in phonological research . . . . . . . . . 1385.3.1 The emergence of phonological patterns . . . . . . . . . . . . . . . . 1385.3.2 Comparison to language acquisition . . . . . . . . . . . . . . . . . . 1395.3.3 Language Typology . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405.4 Refinements for the method . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405.4.1 Incorporating additional acoustic-phonetic features . . . . . . . . . . 1415.4.2 New clustering evaluation metrics to discern optimal clustering . . . 1425.4.3 Achieving a fully unsupervised method . . . . . . . . . . . . . . . . . 1425.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144A Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156A.1 Distribution of groundtruth labels in clusters (hypothesized tones) identifiedusing the method for Mandarin . . . . . . . . . . . . . . . . . . . . . . . . . 157A.2 Hypothesized tones for Mandarin by clustering latent codes from a vanillaautoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158xA.3 Hypothesized tones for Mandarin by clustering acoustic-parameters withoutan autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159xiList of TablesTable 1.1 Mandarin Tones, adapted from (Xu, 1997). . . . . . . . . . . . . . . . . . 12Table 2.1 The questions outlining the exposition of this chapter. . . . . . . . . . . 18Table 2.2 Mandarin Tones, adapted from (Xu, 1997). . . . . . . . . . . . . . . . . . 26Table 2.3 Mandarin Tones, adapted from (Xu, 1997). . . . . . . . . . . . . . . . . . 32Table 3.1 Method Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Table 4.1 Mandarin tones, adapted from (Xu, 1997) . . . . . . . . . . . . . . . . . 69Table 4.2 Cantonese tones, adapted from (Lam et al., 2016) . . . . . . . . . . . . . 83Table 4.3 A comparison of the optimal number of tones for a language as determinedby the method with that standardly reported in the literature. . . . . . . 127xiiList of FiguresFigure 1.1 A visualization of feature extraction from (Unni, 2018). The columnscorrespond to categories; the rows (from bottom-up) correspond to in-creasingly higher-level features for the categories. . . . . . . . . . . . . . 7Figure 1.2 An example of clustering in two-dimensions with idealised data. . . . . 15Figure 2.1 A zoomed-in waveform of a sustained /i/. With 6 quasi-periodic cyclespresent in 0.043s, this utterance has an approximate f0 of 140Hz. . . . . 27Figure 2.2 F0 contours of a high-level and a rising tone exemplar of Mandarin, pro-duced by a native male talker of Mandarin. Tones were produced as thefirst syllable of a two syllable utterance. . . . . . . . . . . . . . . . . . . 28Figure 2.3 F0 contours overlaid on a single graph. The lefthandside image overlaysf0 contours for all tones produced by a single talker in the MandarinChinese Phonetic Segmentation and Tone Corpus (Yuan et al., 2015);the righthandside image overlays f0 contours for only the rising tone forthe same talker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 2.4 An autosegmental example of a tone associating with phoneme. . . . . . 33Figure 2.5 Phonological Structure for the English word level. . . . . . . . . . . . 33Figure 2.6 An image of a neuron and its dendrites and axon. . . . . . . . . . . . . 38Figure 2.7 A simple ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 2.8 Example MNIST images . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 2.9 A simple neural network design for learning MNIST digits. . . . . . . . 41xiiiFigure 2.10 A simple autoencoder design with labels for the encoding portion (to thelatent code) and the decoding portion (from a latent code to reconstruc-tion). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 2.11 A dendrogram of linked data points/clusters. Each arch corresponds tothe distance needed to connect two points or clusters. These data arefrom the Fungwa case study in §4.3. . . . . . . . . . . . . . . . . . . . . 48Figure 3.1 MFA output for Cantonese. The result is a Praat TextGrid (Boersmaet al., 2002) that has syllables and sound segments aligned to audio. . . 53Figure 3.2 A visualization of the Adversarial Autoencoder model generated usingTensorboard (Mane´ et al., 2015). The autoencoder, like that describedin §2.2.2.1, is outlined in blue. . . . . . . . . . . . . . . . . . . . . . . . 60Figure 3.3 Adversarial Autoencoder loss functions. . . . . . . . . . . . . . . . . . . 61Figure 3.4 A dendrogram of linked data points/clusters. The upper image presentsjust the dendrogram. The lower image presents the same dendrogramwith the longest distance cut, suggesting the optimal clustering for thisdata is 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Figure 3.5 Hypothesized Tones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Figure 4.1 Mean f0 contours of the four contrastive Mandarin tones and the neutraltone. These contours were derived from the corpus of this case studyusing the ground-truth tone labels. . . . . . . . . . . . . . . . . . . . . . 71Figure 4.2 Adversarial autoencoder convergence for the Mandarin corpus data asshown in the reduction of reconstruction error on the test set. . . . . . . 73Figure 4.3 Reconstructed Mandarin tones represented as normalized f0 contours.The top row shows ground-truth exemplars; the bottom row shows cor-responding reconstructions . . . . . . . . . . . . . . . . . . . . . . . . . 74xivFigure 4.4 Hypothesized tones as generated by the method for Mandarin, visualizedas f0 contours. Each pane corresponds to the set of hypothesized tonesfor a preset number of tones. Error bars represent variability around themedian f0 values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Figure 4.5 Visualization of the variability of each tone cluster identified by themethod for Mandarin for a preset number of tones. Each plotted linecorresponds to an f0 contour within the identified cluster. . . . . . . . . 76Figure 4.6 Hypothesized tones for Mandarin with an inventory comprising two-tones(left) and three-tones (right). . . . . . . . . . . . . . . . . . . . . . . . . 77Figure 4.7 Hypothesized tones for Mandarin with an inventory comprising four-tones(left) and five-tones (right). . . . . . . . . . . . . . . . . . . . . . . . . . 78Figure 4.8 Hypothesized tones for Mandarin with an inventory comprising six throughnine tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Figure 4.9 Dendrogram evaluation of Mandarin tone clusterings. By cutting thelongest distance of the dendrogram, the optimal clustering comprises fivetones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure 4.10 Variance evaluations of Mandarin tone clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre)indicates two; and the Silhouette Index (right) also indicates two. . . . . 80Figure 4.11 A comparison of the standard analysis of Mandarin tones (left) withhypothesized tone inventories (generated by the method). The hypothe-sized inventories contain: (a) the same number of tones as is standardlyreported for the language; (b) the optimal number of tones as determinedby variance metrics; and (c) the optimal number of tones as determinedby the dendrogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82xvFigure 4.12 Mean f0 contours of the six contrastive Cantonese tones and the high-falling variant of Tone 1. These contours were derived from the corpusof this case study using the ground-truth tone labels. Note: the high-falling variant of Tone 1 was added manually (an average of raw exemplarsmanually extracted from the corpus) because it was not annotated in thecorpus data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Figure 4.13 Adversarial autoencoder convergence for the Cantonese corpus data asshown in the reduction of reconstruction error on the test set. . . . . . . 87Figure 4.14 Reconstructed Cantonese f0 contours. The top row presents ground-truthexemplars; the bottom row presents corresponding reconstructions. . . . 87Figure 4.15 Hypothesized tones as generated by the method for Cantonese, visualizedas f0 contours. Each pane corresponds to the set of hypothesized tonesfor a preset number of tones. Error bars represent variability around themedian f0 values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Figure 4.16 Visualization of the variability of each tone cluster identified by themethod for Cantonese for a preset number of tones. Each plotted linecorresponds to a f0 contour within the identified cluster. . . . . . . . . . 91Figure 4.17 Hypothesized tones for Cantonese with an inventory comprising two-tones(left) and three-tones (right). . . . . . . . . . . . . . . . . . . . . . . . . 92Figure 4.18 Hypothesized tones for Cantonese with an inventory comprising four-tones (left) and five-tones (right). . . . . . . . . . . . . . . . . . . . . . . 92Figure 4.19 Hypothesized tones for Cantonese with an inventory comprising six-tones(left) and seven-tones (right). . . . . . . . . . . . . . . . . . . . . . . . . 93Figure 4.20 Hypothesized tones for Cantonese with an inventory comprising eightthrough eleven tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Figure 4.21 Dendrogram evaluation of Cantonese tone clusterings. By cutting thelongest distance of the dendrogram, the optimal clustering comprises tentones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95xviFigure 4.22 Variance evaluations of Cantonese tone clusterings. The CH-Index (left)indicates the optimal number of clusters is nine (although five quite close);the DB-Index (centre) indicates five; and the Silhouette Index (right) alsoindicates five. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Figure 4.23 A comparison of the standard analysis of Cantonese tones (left) withhypothesized tone inventories (generated by the method). The hypothe-sized inventories contain: (a) the same number of tones as is standardlyreported for the language; (b) the optimal number of tones as determinedby variance metrics; and (c) the optimal number of tones as determinedby the dendrogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 4.24 Hypothesized tones for Cantonese with an inventory comprising five-tones(left) and six-tones (right). The focus here is on the separation of thelow-level tone in the five-tone analysis into a low-level and low-falling tonein the six-tone analysis. This pattern mirrors the on-going tone mergerof T4 and T6 in Cantonese. . . . . . . . . . . . . . . . . . . . . . . . . . 98Figure 4.25 Hypothesized Tones for Cantonese with an inventory comprising six-tones(left) and seven-tones (right). The focus here is the separation of the low-rising tone in the six-tone analysis into a low-rising and high-rising tonein the seven-tone analysis. This pattern mirrors the on-going tone mergerof T2 and T5 in Cantonese. . . . . . . . . . . . . . . . . . . . . . . . . . 98Figure 4.26 F0 contours of the two tones in Fungwa. These contours are exemplarsof high and low tones taken from a single male speaker in the corpus ofthis case study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 4.27 Adversarial autoencoder convergence for the Fungwa corpus data as shownin the reduction of reconstruction error on the test set. . . . . . . . . . . 101Figure 4.28 Reconstructed Fungwa f0 contours. The top row presents ground-truthexemplars; the bottom row presents corresponding reconstructions. . . . 102xviiFigure 4.29 Hypothesized tones as generated by the method for Fungwa, visualizedas f0 contours. Each pane corresponds to the set of hypothesized tonesfor a preset number of tones. Error bars represent variability around themedian f0 values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103Figure 4.30 Visualization of the variability of each tone cluster identified by themethod for Fungwa for a preset number of tones. Each line correspondsto a f0 contour within an identified cluster. . . . . . . . . . . . . . . . . 104Figure 4.31 Hypothesized tones for Fungwa with an inventory comprising two-tones(left) and three-tones (right). . . . . . . . . . . . . . . . . . . . . . . . . 105Figure 4.32 Hypothesized tones for Fungwa with an inventory comprising four throughsix tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106Figure 4.33 Hypothesized tones for Fungwa with an inventory comprising seven throughnine tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Figure 4.34 Dendrogram evaluation of Fungwa tone clusterings. By cutting the longestdistance of the dendrogram, the optimal clustering comprises four tones. 108Figure 4.35 Variance evaluations of Fungwa tone clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre)indicates two; and the Silhouette Index (right) also indicates two. . . . . 108Figure 4.36 A comparison of the standard analysis of Fungwa tones (left) with hy-pothesized tone inventories (generated by the method). The hypothe-sized inventories contain: (a) the same number of tones as is standardlyreported for the language; (b) the optimal number of tones as determinedby variance metrics; and (c) the optimal number of tones as determinedby the dendrogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Figure 4.37 Adversarial autoencoder convergence for the TIMIT corpus (English)data as shown in the reduction of reconstruction error on the test set. . 112Figure 4.38 Reconstructed English (TIMIT) f0 contours. The top row presents ground-truth exemplars; the bottom row presents corresponding reconstructions. 112xviiiFigure 4.39 Hypothesized tones as generated by the method for English (TIMIT),visualized as f0 contours. Each pane corresponds to the set of hypothe-sized tones for a preset number of tones. Error bars represent variabilityaround the median f0 values. . . . . . . . . . . . . . . . . . . . . . . . . 114Figure 4.40 Visualization of the variability of each tone cluster identified by themethod for English (TIMIT) for a preset number of tones. Each linecorresponds to a f0 contour within an identified cluster. . . . . . . . . . 115Figure 4.41 Hypothesized tones for English (TIMIT) with an inventory comprisingtwo-tones (left) and three-tones (right). . . . . . . . . . . . . . . . . . . 116Figure 4.42 Hypothesized tones for English (TIMIT) with an inventory comprisingfour-tones (left) and five-tones (right). . . . . . . . . . . . . . . . . . . . 116Figure 4.43 Hypothesized tones for English (TIMIT) with an inventory comprisingsix through nine tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117Figure 4.44 Dendrogram evaluation of English (TIMIT) clusterings. By cutting thelongest distance of the dendrogram, the optimal number of clusters is five.118Figure 4.45 Variance evaluations of English (TIMIT) clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre)indicates two; and the Silhouette Index (right) also indicates two. . . . . 118Figure 4.46 Adversarial autoencoder convergence for the Buckeye corpus (English)data as shown in the reduction of reconstruction error on the test set. . 119Figure 4.47 Reconstructed English (Buckeye) f0 contours. The top row presentsground-truth exemplars; the bottom row presents corresponding recon-structions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120Figure 4.48 Hypothesized tones as generated by the method for English (Buckeye),visualized as f0 contours. Each pane corresponds to the set of hypothe-sized tones for a preset number of tones. Error bars represent variabilityaround the median f0 values. . . . . . . . . . . . . . . . . . . . . . . . . 121xixFigure 4.49 Visualization of the variability of each tone cluster identified by themethod for English (Buckeye) for a preset number of tones. Each linecorresponds to a f0 contour within an identified cluster. . . . . . . . . . 122Figure 4.50 Hypothesized tones for English (Buckeye) with an inventory comprisingtwo-tones (left) and three-tones (right). . . . . . . . . . . . . . . . . . . 123Figure 4.51 Hypothesized tones for English (Buckeye) with an inventory comprisingfour-tones (left) and five-tones (right). . . . . . . . . . . . . . . . . . . . 123Figure 4.52 Hypothesized tones for English (Buckeye) with an inventory comprisingsix through nine tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124Figure 4.53 Dendrogram evaluation of English (Buckeye) clusterings. By cutting thelongest distance of the dendrogram, the optimal number of clusters is four.125Figure 4.54 Variance evaluations of English (Buckeye) clusterings. The CH-Index(left) indicates the optimal number of clusters is two; the DB-Index (cen-tre) indicates two; and the Silhouette Index (right) also indicates two. . 125Figure 4.55 Hypothesized tones for English given the optimal clusterings identifiedby the evaluation metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . 126Figure 5.1 Hypothesized tones for Mandarin (six and seven tones) and the propor-tion of ground-truth tone labels that occur within the cluster correspond-ing to that tone. The green squares highlight tones that have a significantportion of ground-truth Tone 3s (>18%). . . . . . . . . . . . . . . . . . 133Figure 5.2 Hypothesized tones for Mandarin (eight and nine tones) and the propor-tion of ground-truth tone labels that occur within the cluster correspond-ing to that tone. The green squares highlight tones that have a significantportion of ground-truth Tone 3s (>18%). . . . . . . . . . . . . . . . . . 134Figure 5.3 Hypothesized tone inventories for four speakers of GuangZhou Cantonese.Each inventory comprises seven tones because GuangZhou Cantonesecontains seven tones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136xxFigure 5.4 An observation of the structure seen in the f0-location that a tone beginsand ends at. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137Figure A.1 Hypothesized tones for Mandarin (two through nine tones) and the pro-portion of groundtruth tone labels that occur within the cluster corre-sponding to that tone. These results were generated using the adversarialautoencoder described in the thesis. . . . . . . . . . . . . . . . . . . . . 157Figure A.2 Hypothesized tones for Mandarin (two through nine tones) and the pro-portion of groundtruth tone labels that occur within the cluster cor-responding to that tone. These results were generated using a vanillaautoencoder (in contrast to the adversarial autoencoder described in thethesis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158Figure A.3 Hypothesized tones for Mandarin (two through nine tones) and the pro-portion of groundtruth tone labels that occur within the cluster corre-sponding to that tone. These results were generated using a only acoustic-parameterization and no abstraction from an autoencoder. . . . . . . . 159xxiAcknowledgmentsTo begin, I would like to acknowledge my late supervisor, Dr. Eric Vatikiotis-Bateson. It ishard for me to state the effect you have had on my life in a few lines, but it was substantial.Perhaps most importantly, you taught me to challenge preconceptions and you taught mehow to write (or, as you say, how to tell the story). I will forever miss your contrariandisposition, and I am thankful to have enough ludicrous memories to fill a small anthology.To my primary supervisor, Dr. Molly Babel, thank you for adopting me as your super-visee in Eric’s stead. Your feedback, professionalism, general support and deadlines havebeen remarkable (and instrumental in getting me to the end of my thesis). Your consistentcheerfulness and natural curiousity wore off on me, which helped me to become a betterscholar. Thank you for spending as much time as you did on me. To my supervisorycommittee member Dr. Doug Pulleyblank, thank you for the enlightening conversationsand genuine enthusiasm. I hope this thesis will be a piece in the puzzle that demonstratesphonology is an emergent system. To my supervisory committee member Dr. Paul Tupper,thank you for your push-back to make sure I actually understood the models that I wasimplementing. Thanks for sharing your math wisdom.To my fellow graduate students, thank you for the laughs and encouragement. I wouldparticularly like to thank Samuel Akinbo for sharing the data from his Fungwa fieldwork.To my mom, dad, and sisters, thank you for cheering me on and listening to my (lengthy)project descriptions. If I had not had the fortunate life that I did, I would not be here;I will forever be thankful to you all for the opportunities I have had. I would also like tothank my in-laws for their continued support and encouragement.xxiiTo Brandon Mak, Tae Yoon, James Raynor, Sarah Kerrigan, and Deckard Cain, thankyou for distracting me when I needed a break. I have a lot of fond memories hanging outtogether and I look forward to reuniting in the future.Finally, to my wife Helen and son Arion, you two quite literally kept me alive andsane throughout these graduate years. You tolerated my late nights, encouraged my earlymornings, commiserated with my frustrations and celebrated my successes. Thank you foryour patience in these years, hopefully they will pay off for us in the future. I love you guys.xxiiiDedicationTo the impending A.I. Singularity – I hope this thesis amuses you.xxivChapter 1IntroductionOne of the hallmarks of intelligence is the effective use of tools (Parker and Gibson, 1977).In this thesis, I argue that the recent advances in machine learning make it a valuable toolfor field and theoretical linguists in the description of language and in the development oflinguistic theory. In doing so, I aim to strengthen the impact of computational linguisticson other areas of linguistics, phonology in particular. My argument is instantiated inthe domain of language acquisition (focusing on the theory of Emergent Phonology inparticular) as it is a natural intersection for phonology, field work and machine learning. Ichain these three areas together by providing support for Emergent Phonology through theuse of machine learning methods that may also be used to hypothesize sound inventories forundocumented languages. These hypothesized inventories are akin to formal descriptionsderived by human linguists, providing a first step towards a grammaticus ex machina – agrammarian (or linguist) from the machine. The remainder of this chapter unpacks thisdense introduction.11.1 Unpacking the introduction1.1.1 Machine learning as a tool for linguistsIn the last decade, machine learning has revolutionized machine performance in areas suchas image/facial recognition (Russakovsky et al., 2015), automatic speech recognition (Chiuet al., 2017; Toshniwal et al., 2018), speech synthesis (Van Den Oord et al., 2016; Taylor,2009), and the playing of games (Silver et al., 2017; Vinyals et al., 2019). These advancesare not trivial; for example the state-of-the-art Go (the board game) automaton plays withhuman-like creativity, using novel strategies that have not been recorded in the 3000+ yearhistory of the game. Indeed, Go champions are now studying with machine coaches tostrengthen their playstyle (e.g. AlphaGo Teach). Presently, the increasingly human-likeefficacy of machines enables researchers to offload work (to those machines) that previouslyrequired humans. A concrete example of this exists in the contrast between the army oftrained undergraduate students needed to create the speech-sound segmented TIMIT speechcorpus (Garofolo, 1993) and the current method of forced alignment (e.g. McAuliffe et al.,2017). Forced alignment, if provided a pronunciation dictionary, generates, refines andapplies speech-sound models to automatically demarcate sound-segments in speech and isnow regularly used to generate segmented corpora such as the Mandarin Chinese PhoneticSegmentation and Tone corpus (Yuan et al., 2015).It will come as no surprise that work offloaded to machines saves researchers time, effort,and finances. However, the case for utilizing machine learning goes well beyond minimizingthe bottom line. Machines do not suffer from human error (Senders and Moray, 1995) –they may introduce systematic disfluencies while processing data, but those disfluenciesare qualitatively different than human error in their systematicity and are often readilypatched1. Further, expeditious processing and an impeccable memory grant machines aprecision that goes beyond human ability, which may well facilitate the identification ofpatterns that are as yet undiscovered by human researchers. Machines quite literally cir-1There are on-going discussions of biases, which may reflect human biases, in the training data (Nganet al., 2015; Furl et al., 2002); a brief discussion of biases and machine learning is provided in §1.1.2.1 below.2cumvent the perceptual and temporal constraints of humans, enabling researchers to gobeyond what would otherwise be possible2. Google’s DeepMind research group summarizesthis notion pointedly, stating simply that machines are not “constrained by the limits ofhuman knowledge” (Deepmind, 2017, pg 1).The case for incorporating machine learning into linguistic research comes from withinthe field itself. Historically, it has been common for linguists to make broad generalizationsbased on introspection and intuition3. Side-stepping arguments from Universal Grammaror I-Language for now (cf. Den Dikken et al., 2007), one of the main reasons for this hasbeen convenience (Phillips, 2009; Featherston, 2005). In the past, cost, access to subjects,and the intractable feat of manually sifting through available data (e.g. books, televisionbroadcasts) made surveying a large population challenging, thus relying on the introspectionof a select few was sensible. A pointed commentary on this idea comes from Hopper in hisseminal work on Emergent Grammar, in which he states that, “[he] can only choose atiny fraction of data to describe” (Hopper, 1987, p. 141). Today, however, we can surveythousands with Amazon Web Services (Buhrmester et al., 2011), process terabytes of datawith Google’s Cloud Computing Platform (Krishnan and Gonzalez, 2015), and even turna profit from the data we collect (Nguyen, 2018). Accessible data and resources now makeit easier than ever to test the robustness of a hypothesis on a large scale – tests that arepoised to become easier still as machine learning advances as a field.Although machine learning: (1) has already done remarkable things, (2) saves re-searchers time and resources, (3) enables analyses beyond-human precision, and (4) providesa way to consider more language data than previously possible, the historical collaborationsbetween machine learning scientists and linguists leave one wanting. This is particularlytrue if one looks beyond the subfield of computational linguistics to linguistics as a whole.This thesis aims to contribute to what is sure to be a growing dialogue between these fieldsin the future.2This is not to say that humans have no advantages over machines, but such advantages are not presentlyof import.3This has been particularly true in the linguistic subfield of syntax (cf. Phillips, 2009).31.1.2 Bridging computational and theoretical linguisticsDespite the impressive growth of computational linguistics as a field, its substantive contri-butions to the classical areas of linguistics (syntax, semantics, phonetics, and phonology)have been limited (Johnson et al., 2011; Steedman, 2011). In part, this is because com-putational linguists are primarily concerned with solving practical challenges (e.g. machinetranslation, speech recognition, sentiment analysis) and modern, strictly engineering so-lutions such as sequence-to-sequence models often outperform solutions that incorporatelinguistic knowledge (Schuster et al., 2016; Chiu et al., 2017; Sutskever et al., 2014; Wanget al., 2017). One reason for this may be that many linguistic analyses/descriptions arebased on a simplified problem space (such as reducing speech to discrete sound units, whenin actuality the transitions between chunks are crucial (cf. Furui, 1986)). In the end, thereis often not much need for a computational linguist to collaborate with a theoretical linguistin the current research climate.Notwithstanding this discussion, there have still been collaborations that have occurredbetween computational and theoretical linguists, however such collaborations generally haveinformation flow in only one direction, from linguistics to computation. For example, theTIMIT speech corpus (Garofolo, 1993) was created using students trained in phonetics andthe corpus was invaluable in the development of early speech recognition systems. Thispattern is paralleled in syntax with the Penn Treebank (Marcus et al., 1993), a corpus thathas been at the heart of Part-of-Speech tagging and improvements in parsing. Even inmore contemporary research, such as Kaskari et al. (2017), the desire to improve machineperformance with linguistic knowledge is prevalent. In this thesis however, I reverse the flowof information from computation to linguistic theory. I argue that machine learning can andshould be used by linguists that are asking theoretical questions. If nothing else, machinelearning provides a new perspective from which to view language problems, and it is aperspective that avoids some of the “metacognitive overtones” (Edelman and Christiansen,2003, p. 60) that come from human investigation. Linguists, regardless of what field theyare in, approach each problem with the cognitive biases of their own perceptual and lexical4systems. Categorical perception and the desire to categorize words as nouns or verbs area few such examples. Machines, however, approach each problem with an (in principle)unbiased perspective. The impartiality of machines has the additional benefit of allowingresearchers to use them to probe biases in human language learning (cf. Gagliardi and Lidz,2014).1.1.2.1 A word on biases and machine learningWhile it is true that a machine has no preconceived notions on how to answer languageproblems, this does not mean that machine learning itself is free of biases. For example, therecan be unintentional concomitants of random initialization such that the optimal solutionidentified by a learning algorithm varies. There are also in-built biases and limitationsof the training data used as input to the learning algorithm (cf. Ngan et al., 2015; Furlet al., 2002). One such bias comes from mismatches in the number of tokens of each datatype; another bias comes from how data are represented – a distinction made by Gagliardiand Lidz (2014) using the terms input and intake. Input refers to “actual informationpresent in the linguistic environment” (pp. 4 Gagliardi and Lidz, 2014), while intake refersto “the information... utilized by the learning mechanism” (pp. 4 Gagliardi and Lidz,2014). This distinction is particularly relevant for the current project because I restrict theacoustic information to which my learning model has access (§3.2). Lastly, there can alsobe tendencies in learning algorithms themselves (regression, neural net, SVM), but such adiscussion is well beyond the scope of this thesis.1.1.3 The intersection of language acquisitionFor my investigation, I have chosen the testing ground of language acquisition, the acquisi-tion of speech sounds in particular. Language acquisition is an appropriate choice becauseit has natural ties to machine learning, theoretical phonology and linguistic fieldwork.Language acquisition encapsulates, among others, the process of learning how to decom-pose continuous speech into meaningful sound chunks – a process that requires the learnerto simultaneously identify what those chunks are in the first place (Jusczyk, 1995). This5process parallels unsupervised learning, in the machine learning sense, nicely; unsupervisedlearning refers to learning without labels, much like how a child does not have a priorisound category labels when learning a language. Even though the mechanism by which achild learns a language is not fully known, it is clear that the discretization of continuousspeech requires levels of abstraction – the concept whereby lower levels represent more de-tail and higher levels represent more abstract concepts (Colburn and Shute, 2007). In thearea of language acquisition, this idea has been termed the “ladder of abstraction” (pp. 291Munson et al., 2011). The use of the word ladder here is intended to denote climbing upsuccessive rungs that start at the detailed, low-level realization of raw speech (that exists asacoustic energy) to the abstract, high-level cognitive concept of a phonological category4.Thus, the ladder acts as a map from phonetics to phonology (cf. Pierrehumbert, 1990; Yu,2011).The notion of levels (or the ladder) of abstraction parallels high-level feature extractionperformed in modern deep learning systems (e.g. Le, 2013). This is immediately evidentwhen looking at the field of image recognition (or computer vision more broadly) in thelast decade. Feature extraction is the process by which machine learning models learn in-creasingly useful features to improve performance on a given task. For example, edges andline-orientation are fundamental concepts in image recognition, and it has been demon-strated that learning models extract similar features when performing image classificationtasks. The data science post by Unni (2018) provides a good demonstration of this; in it theauthor demonstrates how current convolution-neural networks extract features of variabledetail. A reproduced graph from the article is shown in Figure 1.1. In Figure 1.1, the fourcolumns correspond to four high-level categories: faces, cars, elephants, and chairs. Thelowest row in the figure corresponds to low-level features learned by a convolutional neuralnetwork. These low-level features can be visualized and are shown to correspond primarilyto lines of varying orientation. These lines can then be combined in various ways to generateintermediate-level features that begin to separate out the four categories under considera-4The ladder is necessarily multi-directional – higher levels can have feedback on processing at lower levels.6tion. As is seen, through combining low-level features in different ways, higher-level featuresthat match subparts of the broader categories are generated. In the case of faces, these in-termediate features are things likes eyes and eyebrows. Finally, the highest-level features(seen in the top row of the figure) correspond to the categories themselves.Figure 1.1: A visualization of feature extraction from (Unni, 2018). The columnscorrespond to categories; the rows (from bottom-up) correspond to increasinglyhigher-level features for the categories.Figure 1.1 provides a clear visualization of a machine learning system learning high-level features from visual data. It is therefore reasonable to expect a machine learningsystem that is trained on acoustic data to also learn high-level features of that data. Thisestablishes the crude parallel between child language acquisition and machine learning thatI have previously stated; specifically, that both children and machine learning systems learnlevels of abstraction of the language they receive as input. In this thesis, I instantiate‘machine learning systems’ as an unsupervised learning model known as an autoencoder(Kramer, 1991; Le, 2013; Makhzani et al., 2015). I make no claims that an autoencoder is(in substance) the same mechanism used by a child to learn language, but I maintain thatlevels of abstraction and feature extraction are processes that may achieve similar results.With the link between language acquisition and machine learning established, I nowaddress how language acquisition links to phonology and field work. Phonology is a study7of the “function, behaviour, and organization of sounds” (Lass, 1984, p. 1) in language. Aslanguage acquisition typically involves a child learning the sounds and patterns of soundsin his language (Jusczyk, 1995; Munson et al., 2011; Werker and Tees, 1984; Mielke, 2008;Maye et al., 2008; Pierrehumbert, 2003), phonology is a fundamental component of languageacquisition.In the previous section, it was stated that a child learns the mapping from a speechsignal in the physical world to a concept associated with a lexical item (i.e. a word). In thatstatement, the role that most would argue is played by phonology in language processingwas skipped-over. It is not that the brain goes directly from a signal to a word, but firstthe signal is abstracted into sound units (Pierrehumbert, 2003; Munson et al., 2011; Mielke,2008). There are many levels of interconnected sound units assumed in phonology (moredetailed discussions of phonology are provided in §2.1.1.1 and §2.1.2.2); an example of anoft-discussed unit is the segmental phoneme (henceforth referred to simply as a phoneme)(Eimas et al., 1987; Kuhl et al., 2005; Werker and Tees, 1984). A phoneme is a minimallycontrastive sound chunk, such as the /k/, /æ/, and /t/5 that make up the English wordcat (they are contrastive as if one unit were switched to another, say /k/ to /b/, a newword would arise i.e. bat). Infants attune to phonemes in their native language aroundnine months of age (Werker and Tees, 1984) and are able to use the frequencies of phonemetransitions to identify word boundaries (Saffran et al., 1996); both of these results provideevidence of the crucial role played by phonology in the acquisition of language.Finally, I establish the link from language acquisition to linguistic fieldwork. Learningthe units of a language is not a task restricted to children; any adult who learns anotherlanguage must also learn the units of that language (Ortega, 2014). This is often the initialgoal of a field linguist, who must first identify the contrastive sounds of a language inorder to transcribe it and subsequently describe syntactic, morphological and phonologicalpatterns. There is indeed a large body of research, targeted to fieldworkers, on how to learnthe sounds of a language. Of particular relevance to this thesis is a series of articles in the5Transcribed phonemes are in the standard International Phonetic Alphabet (Decker et al., 1999).8journal Language Documentation and Conservation that focus on tools and techniques toidentify the tones of a language (Hyman, 2014; Coupe, 2014; Bird and Lee, 2014). Birdand Lee (2014) in particular present a tool for identifying tones in early elicitation calledToney. Toney allows a research to listen to words elicited from a consultant and drag theminto clusters of similar pitch melodies. If the entire process can be automated, which issomething this thesis is working towards, it will certainly find use for field researchers.1.2 Outlining the projectThe specific focus in this thesis takes root in one of the primary theoretical considerations oflanguage acquisition – whether humans acquire language via general learning mechanisms oran innate language-specific learning mechanism (Chomsky and Halle, 1968; Hopper, 1987).The most prominent rationale for positing a language-specific mechanism comes from thelinguistic subfield of syntax in the problem of the Poverty of the Stimulus (PoS) (Chom-sky et al., 2006). PoS states that children do not encounter enough of their language toisolate all and only the important features of their language (i.e. the language stimulus isimpoverished). In other words, the language a child hears is consistent with many possiblegrammatical systems and there is no way for them to derive only their target language.Consequently, the argument goes, a genetic endowment that is present in all children re-stricts the set of possible languages that can be learned (Chomsky, 2007). This a priori setof possible languages is known as Universal Grammar (see Hauser et al., 2002, for furtherexplication). Antithetically, the theory of Emergent Grammar holds that language “struc-ture... comes out of discourse and is shaped by discourse” (p. 142 Hopper, 1987); it doesnot postulate a priori language-specific restrictions on learning and concludes language isacquired through general learning mechanisms.In phonology, the PoS argument is less discussed because a learner is exposed to all thespeech sounds (although, not all acoustic variants) of their language6. That is, the speech6In fact, there is a group of researchers, both in syntax and phonology, that disregard the PoS argumentaltogether. They, in turn, discuss the great extent of data exemplars learners encounter, termed “the richnessof the stimulus”(pp. 5 Silverman, 2006)9signal is not impoverished in the same way as grammatical sentence exemplars would be withsyntax7. Nonetheless, there is still an active discussion as to the nature of phonology – if itis universal, emergent, or both (Kenstowicz, 1994; Lindblom, 1999; Mielke, 2008; Samuels,2009; Archangeli and Pulleyblank, 2012; Dresher, 2015; Archangeli and Pulleyblank, 2015).Two topics of particular focus are the nature of phonological features (Mielke, 2008) andconstraints in a theory such as Optimality Theory (Smolensky and Prince, 1993). There is,however, mounting support in the literature for phonology to be thought of as an emergentsystem (see Archangeli and Pulleyblank, 2017), formalized in the theory of EmergentPhonology (Lindblom, 1999).Emergent Phonology proposes that phonology emerges as the “consequence of accumu-lated phonetic experience” (Lindblom, 1999, p. 195). I take this statement to be empir-ically testable given the high-level abstractions that machine learning, deep learning inparticular, now affords. In order to perform this investigation, the following questions needto be addressed: (1) what is phonology (i.e. what is the thing that emerges)?; (2) what doesit mean for phonology to emerge; and (3) what are phonetic experiences?. These questionsare briefly answered here and are answered in more depth throughout Chapter 2.With regards to (1), phonology can be thought of as encompassing (usually sound8) unitsand processes (or, perhaps, patterns). As stated, one oft-discussed unit is the phoneme(recall, e.g. the /k/, /æ/, and /t/ in cat). Phonological processes are interactions between(often adjacent) phonological units, such as the English plural of cat being /kæt-s/ due tothe singular form ending in the voiceless obstruent /t/, and the plural of dog being /dAg-z/due to the singular form ending in the voiced obstruent /g/.With regards to (2), phonology can be said to have emerged if phonological units andphonological processes arise without a priori information in the learning process (beyondgeneral cognitive abilities). It is unclear what a priori information would be for a child7Although, this may not be true when considering phonological patterns at large. The patterns ofphonology may also be consistent with a variety of phonological systems that are actually constrained by apre-established Universal Grammar set (such as a universal constraint set in Optimality Theory (Smolenskyand Prince, 1993)).8Although, there is on-going research into the phonology of sign language (e.g. Brentari, 2019).10learning language (or what it would look like in a human brain), but it may be interpretablecomputationally. A process can be said to arise without a priori information if the taskperformed is wholly unsupervised; that is, the system has no knowledge of what, or howmany whats, it is trying to learn9.With regards to (3), phonetic experiences can be thought of as the auditory perceptionof a speech signal. For a child, this perception is dependent on the physiology of theaural system – from the ear canal, through the tympanic membrane and connected bones,to the basilar membrane and into the auditory nerve – but for a machine it is simply amatter of mathematics. Computationally, phonetic experiences amount to a set of acousticparameters as derived from a digitized speech signal.With these explications in place, the claim of Emergent Phonology (that phonologyemerges from phonetics) can be evaluated by considering whether phonological units andphonological processes can arise, without a priori information, from solely processing acous-tic parameters of digitized speech in an unsupervised learning model. A positive result wouldprovide a significant piece of evidence to support Emergent Phonology.I now move to the specifics of the project at hand by operationalizing the phonolog-ical units, the unsupervised learning model and the acoustic parameter set used herein.Admittedly, there are many ways in which these things could have been operationalized;the choices made were done in an effort to be sensible, actionable and consequential. Thefollowing operationalizations are brief and are expanded in Chapters 2 and 3.1.2.1 Operationalizing Phonological Units and ProcessesI have elected to study the phonological units of lexical tones. Lexical tones are distinct,contrastive pitch patterns that enable lexical (word-meaning) contrasts. Yip (2002) definestones plainly by stating that they are pitch patterns that “change the meaning of the word”9This statement is a simplification in that a priori information is not considered with respect to theinput or the learning algorithm (see §1.1.2.1). This is reasonable for the current project in that UniversalGrammar posits a universal set of possible languages that constrain what a grammar may be. By having noknowledge of what (or how many whats) arises in the learning process, we are not constraining the learnedgrammar as such.11(Yip, 2002, p. 1). The contrast between the English words mother/hemp/horse/scold inMandarin is an oft-used example of lexical tone (Xu, 1997). Each word is realized with thesame phoneme sequence (/ma/) but remains lexically distinct because it is realized with adistinct pitch pattern. This contrast is presented in Table 1.1.Character 媽 麻 馬 罵Pinyin ma¯ ma´ maˇ ma`Gloss “mother” “hemp” “horse” “scold”Pitch pattern (IPA)Ă£ Ę£ ŁŘ£ Ď£Description high rising fall-rise fallingTable 1.1: Mandarin Tones, adapted from (Xu, 1997).There are several reasons to prefer tones for this project over, say, the previously dis-cussed phonological units of phonemes – the most pronounced being acoustic simplicity.Acoustically, to identify phonemes one must consider frequencies across the entire spectrumaudible to humans (Johnson, 2004). Tones, in contrast, are largely identifiable with the sin-gle acoustic measurement of fundamental frequency (Rose, 1987) as fundamental frequency(f0) is the primary acoustic correlate of vocal fold vibration, which is itself the primaryarticulatory mechanism used to produce tones10. There is also additional redundancy be-tween f0 and other acoustic measures such as amplitude and duration, but those are fairlysimple acoustic measures as well (Whalen and Xu, 1992). Acoustic simplicity is an impor-tant factor when dealing with learning models because fewer acoustic parameters result infewer machine calculations and less computational time. What is more, the simpler acous-tic measures for tone provide robustness to noise; there are a variety of ways to check andrecheck whether a calculated fundamental frequency is correct. Finally f0, since it dependson vocal fold vibration, is limited by human physiology – an f0 signal is relatively smoothand occurs within a limited range of frequencies.10Voice quality has also been shown to interact with tone (e.g. Yu and Lam, 2014), but it is not utilizedin this thesis. A brief discussion of voice quality is provided in §5.4.12Going beyond acoustic simplicity, tones are also preferable to phonemes for this projectbecause tone inventories are much smaller than phoneme inventories and tones are morereadily demarcated in language corpora. Typologically, the most common tone inventoryhas just two tones, high and low (Maddieson, 1978). This contrasts favorably with theaverage phoneme inventory of some 29 phonemes (Maddieson, 2013a,c). Also, tones tend tooccur over longer time-frames (e.g. syllables) than phonemes. This allows some flexibilitywhen automating the demarcation of tones in a corpus via, e.g., Forced Alignment. Thus,there are compelling reasons to think that a learning model will more easily identify tonesthan, say, phonemes.With tones selected as the phonological units under investigation, the most naturalphonological process to investigate is tone sandhi. Tone sandhi is a process in which thephonetic realization of a tone changes depending on its surrounding tones. The 3-3 tonesandhi rule in Mandarin is one such example in which the initial 3rd tone of a 3-3 tonesequence is phonetically realized as something similar to a 2nd tone (Duanmu, 2007). I donot currently aim to demonstrate the emergence of a phonological process in this thesis,but a discussion of how to extend this work to do so is provided in §5.3. Nonetheless,demonstrating the emergence of phonological units is a required precursor to demonstratingthe emergence of phonological processes.1.2.2 Operationalizing the unsupervised learning modelFor the unsupervised learning model, a combination of an Adversarial Autoencoder (Makhzaniet al., 2015) and a Hierarchical Clustering algorithm (Johnson, 1967) is used. An in-depthdiscussion of the model is provided in Chapter 3, but a brief discussion is provided here.Autoencoders learn a low-dimensional representation, called a latent code, from somehigher-dimensional parameterization. In current times, autoencoders are prolific; they wereat the forefront of what has become the current era of deep learning (Hinton and Salakhutdi-nov, 2006) and are now applied on such ubiquitous tasks as image comparison and compres-sion (Toderici et al., 2017) and facial recognition (Larsen et al., 2015). Leaving the details13until later (§2.2.2.1), autoencoders are useful because of what they achieve – latent codes.A latent code is a lower-dimensional representation of some higher-dimensional data. Inthis discussion, lower- and higher- dimensionality refers to the length of a vector needed torepresent the information in computer memory. For example, a 100x100 pixel image, whenstored as a whole, would require 104 values in computer memory. However, given that thereare statistical patterns in the image (e.g. similarity of adjacent pixels, repeated patterns,etc.), an algorithm can compress that image to far fewer values in computer memory11(i.e. a latent code). Beyond the superficial benefit of minimizing computational memoryrequirements, latent codes are also generally thought of as a higher-level of abstraction ofthe original data12 (Le, 2013). This combination, being both a higher-level of abstractionand a representation in lower-dimensional space, makes latent codes suitable for clustering.Clustering is a computational method to identify related groups in a dataset basedon some metric of similarity or, inversely, distance (Jain, 2010). Figure 1.2 provides asimple visualization of identified clusters in a two-dimensional space. This graph is idealisedbecause all points that are close together are part of the same cluster and there is a largedistance between different clusters. In the wilderness of big data, no dataset is this idealised;there is virtually always overlap between clusters13.11The algorithm, to be useful, also needs to ‘remember’ how to reconstruct the initial image from thatlower-dimensional representation. It is also the case that such reconstructions will often not be perfect (i.e.loss less).12This nicely parallels the vocalubary used by Munson et al. (2011) when they refer to levels of represen-tations in their phonological ladder of abstraction as latent variables (pp. 290 Munson et al., 2011).13This is not to say, though, that choosing appropriate parameterizations/embedding dimensionals cannothelp in separating clusters more effectively.14Figure 1.2: An example of clustering in two-dimensions with idealised data.Given that clustering algorithms group data points based on distance (Euclidean, Man-hattan, etc.), it turns out that they perform progressively worse in higher-dimensionalspaces. This is because the significance of distance between data points becomes increasinglyobfuscated (Beyer et al., 1999) in higher-dimensional space (a value that is meaningful alongone dimension may not be so on another). It is for this reason that the current project usesa combination of autoencoders and clustering. The latent codes learned by an autoencoderare low-dimensional representations. As such, they are more effectively clustered than rawacoustic parameters of a speech signal (evidence of this is provided in Appendix A). Thosefamiliar with this research area will immediately think of Principle Components Analysis orLinear Discriminant Analysis as alternatives to an autoencoder, but the non-linearities andmultiple layers of feature extraction have been recently demonstrated to perform well withclustering (Nousi and Tefas, 2018). In combination, the autoencoder+clustering method isable to derive meaningful clusterings of acoustic parameterizations of speech in an entirelyunsupervised manner. Further, the clusters in latent space can be reverse engineered toreconstruct acoustic parameterizations, allowing one to compare the acoustic patterns as15identified by this learning system to the acoustic patterns reported by, e.g., linguists for agiven language.1.2.3 Operationalizing acoustic parameters of digitized speechAs the phonological units under investigation in this project are lexical tones, I have electedto use fundamental frequency and tone-duration as the acoustic parameters for the currentproject. The measures were selected for their known efficacy in capturing important aspectsof tone (Yu, 2011; Gauthier et al., 2007; Whalen and Xu, 1992); motivation for these choicesis presented in §2.1.2.For digitized speech, I have selected corpora for several tone languages: Mandarin,Cantonese, and Fungwa, and I compare the unsupervised learning model’s performance onthese languages to its performance on English as a control, non-tone language (see §4.4 fordiscussion). The languages were selected because they have tone inventories that contrastin both the number of tones and the complexity (i.e. level, simple contour, complex contour)of tones (see §2.1.2.2 for discussion).1.3 Summary of the project: Thesis statementWith the terms operationalized, we can now revisit the primary investigation of my work.I aim to demonstrate that machine learning is a useful analysis tool for theoretical linguiststhrough providing support for the theory of Emergent Phonology by showing that lexicaltones of a language arise, without a priori information, from an unsupervised learningmodel trained on the acoustic parameters of fundamental frequency and tone-duration forthat language. Additionally, if the tones that arise, or are hypothesized, match well withthose derived by human linguists, a first step in creating a linguist from the machine (i.e.a ‘grammaticus ex machina’) will have been achieved.1.4 Thesis StructureChapter 2 provides the necessary background information needed to inform readers abouttone and machine learning. Chapter 3 details the precise computational method used,16including descriptions of adversarial autoencoders and hierarchical clustering. Chapter 4presents the results of the method from Chapter 3 applied in case studies of four languages:Mandarin, Cantonese, Fungwa, and English. Finally, Chapter 5 demonstrates how themethod may be used for other phonological investigations, and discusses future enhance-ments/applications.17Chapter 2Literature Review: Emergentphonology, tone, machine learningand the overlapThe goal of this chapter is to make accessible the contents and contribution of this thesis toall readers. The interdisciplinary nature of this project provides a valuable forum to shareperspectives, but it also necessitates a review of relevant research to ensure the projectis situated correctly in the literature. Referring back to the summary of this project in§1.3, this chapter is intended to answer the questions of Table 2.1. These questions can begrouped into two categories, linguistics and machine learning. This chapter is organized bythe same token.Section Subject Question§2.1 Linguistics: What is Emergent Phonology?What is lexical tone?What are the acoustic parameters of tone?§2.2 Machine Learning: What is machine learning?What is unsupervised learning?§2.3 Overlap: What can unsupervised learning tell us about Emergent Phonol-ogy and tone systems?Table 2.1: The questions outlining the exposition of this chapter.18The first section (§2.1) of this chapter considers the linguistic side of this investigation.Specifically, it outlines Emergent Phonology and provides an overview of tone in linguistics.By considering tone from phonological and phonetic perspectives, the groundwork is laid forevaluating the claim of Emergent Phonology that phonology emerges as the “consequenceof accumulated phonetic experience” (Lindblom, 1999, p. 195). Next, §2.2 of this chapterprovides an overview of the theory behind, and implementations of, machine learning. Itfirst introduces supervised learning and then extends the discussion to unsupervised learn-ing models such as autoencoders and clustering. The final section (§2.3) addresses whyunsupervised learning is a useful tool to support the theory of Emergent Phonology.2.1 The linguistics side: Emergent Phonology and tone2.1.1 Emergent PhonologyAt its most rudimentary, Emergent Phonology holds that a speaker’s phonology emergesfrom the language they hear (from their phonetic experiences) and their general cognitiveabilities (Archangeli and Pulleyblank, 2017; Lindblom, 1999). This statement seems simple,yet when faced with the immensity of all that phonology encompasses, it is far from it.Indeed, the introduction to phonology provided in the next section barely scratches thesurface of the field. Additionally, to properly consider Emergent Phonology, what emergenceis and what phonetic experiences are need to be clearly established. This section aimsto provide a sufficient introduction to the ideas of phonology, emergence and phoneticexperience so that the complexity of the claim that phonology emerges can be framedproperly. At the same time, excessive detail is avoided to keep the investigation frombecoming opaque.2.1.1.1 PhonologyExtending the quote from Chapter 1, phonology is the study of the “function, behaviour,and organization of sounds as linguistic items” (p. 1 Lass, 1984). The additional words ‘aslinguistic items’ are crucial because they make it explicit that the sounds under considera-19tion for phonology are explicitly concerned with language (i.e. linguistic) and are individual,discrete units (i.e. items). These qualities are typified in the previously mentioned phono-logical unit of a phoneme – a contrastive speech sound/segment. Phonemes are contrastivein that, if a phoneme in a word were changed, the result would be either a different word ora non-word. The concept of a phoneme is fairly intuitive and easily demonstrated, such asin the English contrast between the words hiss and his. Hiss comprises three phonemes:/h/, /I/, and /s/; his also comprises three phonemes: /h/, /I/, /z/. By ascribing to thenotion of contrastiveness, comparing hiss and his allows one to conclude that /s/ and /z/are different phonemes in English.Going deeper into phonological theory, the distinction between /s/ and /z/ can actuallybe reduced to one of voicing1. Voicing, or phonation, is a term to refer to vocal fold vibrationthroughout the articulation of a speech sound. The sound /s/ is voiceless (there is no vocalfold vibration) and /z/ is voiced. Consequently, the distinction between /s/ and /z/ canbe described as /s/ being associated with some [-voice] property and /z/ being associatedwith some [+voice] property. This description is formalized in phonology in what is calledFeature Theory (Jakobson et al., 1951). In Feature Theory, the [±voice] property is termeda voicing feature. Feature Theory posits that speech sounds, such as phonemes, can bethought of as bundles of features and, as such, features are often considered to be the basicunits of phonology. To further demonstrate phonological features, two readily interpretablefeatures of a speech sound include [±nasal] (for whether or not a sound is articulated withnasal airflow) and [±continuant] (for whether a sound is articulated with continuous airflowthrough the oral cavity). While an in-depth knowledge of Feature Theory is not crucial forthis thesis, two pertinent points follow from this introduction: (1) there are multiple levelsof phonological analysis (e.g. features, phonemes, syllables, etc.); and (2) the existence offeatures brings into question whether they are innate or emergent (see Mielke, 2008, forevidence towards emergence). The former point has relevance for the phonology of tone1Note that this discussion is specific to phonology. In phonetics, the difference between hiss and hisis often durational (shorter duration for the ‘voiced’ segment or longer duration for the vowel preceding a‘voiced’ segment).20discussed in §2.1.2.2, the latter point is a natural tie-in to Emergent Phonology outlinedin §2.1.1.2. The reader interested in Feature Theory is encouraged to follow up with othersources (Jakobson et al., 1951; Chomsky and Halle, 1968; Clements, 1985; Mielke, 2008;Samuels, 2009).In the terminology of the first chapter, phonemes and phonological features are examplesof phonological units. In general, phonological units are thought of as abstractions; forexample, the phoneme /p/ remains constant regardless of who utters it, the word a speakeris saying or whether a speaker is whispering or yelling. In other words, phonological unitsremain invariant in the face of phonetic variability. There are also other phonologicalunits, or phonological categories, such as tones and stress. Consider the English contrastbetween the verb per.MIT and the noun PER.mit (capitals denote stress; ‘.’ denotes asyllable boundary); a fluent speaker of English may vary in terms of how they stress onesyllable or another (e.g. varying phonetic cues associated with stress such as duration, pitchand amplitude), but there is a categorical, invariant distinction between which syllableis stressed, and that distinction has consequences as to which word class an utterance of‘permit’ belongs (see Hayes, 1995, for an introduction into metrical stress). The categoricalcontrast of stress may be clearer still with the nouns desert and dessert in a phrase suchas ‘I like the dessert/desert’. To borrow an eloquent description from Rose, phonologicalcategories capture “the Accentual and Linguistic content of the acoustic stimulus [separated]from the components determined by the individual speaker” (pp. 343 Rose, 1987)2.In light of this discussion, Lass’s definition of phonology can be reframed as the studyof the “function, behaviour, and organization” of phonological categories. Thus, after aphonologist has identified the categories of language (e.g. phonological features, phonemes,tones, etc.), their job becomes investigating how those categories pattern in a language. Anexample of such patterning was shared in Chapter 1, in which the plural of cat-s has avoiceless /s/ and the plural of dog-s has a voiced /z/. This thesis, however, stops shortof patterns in phonology and focuses solely on the emergence of phonological categories.2The restriction of this phrase to phonological categories is of my own design.21Nonetheless, there are an inordinate number of phonological patterns that exist in theworld’s languages and the interested reader is encouraged to learn about them in othersources (e.g. Hayes, 2011; van Oostendorp et al., 2011; Goldsmith et al., 1995).2.1.1.2 EmergenceIn linguistics, emergence refers to language “structure... com[ing] out of discourse and[being] shaped by discourse” (p. 142 Hopper, 1987). This definition conveys the two keyaspects of emergence: (1) discourse itself gives rise to structures, and (2) discourse con-tinually refines those structures. I interpret ‘structure coming out of discourse’ to meanthat there are regularities in discourse that naturally chunk into structural units. Thisinterpretation is consistent with other researchers, such as Leong and Goswami (2015) whostate that “children may use acoustic spectro-temporal patterns in speech to derive phono-logical units” (Leong and Goswami, 2015, p. 1). Reframing the terminology to suit thisthesis, ‘discourse’ refers to spoken communication, ‘regularities’ refers to repeating acousticpatterns, and ‘structural units’ refers to phonological categories. As I am primarily con-cerned with simulations, I do not speak to what ‘naturally conceptualize’ means in humancognition; computationally however, ‘naturally conceptualize’ may refer to the results of anunsupervised learning system.As there are no labels in unsupervised learning (§2.2.2), there is no a priori knowledge ofthe number of categories to learn or the content of those categories. Thus, if an unsupervisedlearning system is able to discern the correct number of categories and the correct contentof those categories (verifiable via some preestablished standard), we can state that thosecategories have emerged from the data processed by the unsupervised learning system.For example, an unsupervised learning model applied to spoken Mandarin speech could beevaluated by comparing how well its hypothesized tones compare to the standard analysisof tone in Mandarin Chinese.222.1.1.3 Phonetic ExperiencesThe term ‘phonetic experiences’ is somewhat nebulous as (from a philosophical standpoint)each person’s experiences are unknowable to every other person (Nagel, 1974). Thus, insteadof refering to an experience, we will refer to aspects of the speech signal that phoneticiansoften investigate, such as acoustics and audition. Acoustics considers how speech existsin the physical world; audition considers how speech is processed by the human auditorysystem (Johnson, 2004; Ladefoged and Johnson, 2014).In the physical world, speech is simply sound, pressure fluctuations through a medium(the standard medium in human communication is air particles)3. These pressure fluctu-ations, which originate as manipulations of air in a speaker’s vocal tract, flow like wavesthrough the air and those waves can be measured by instruments such as a microphone,which transduces acoustical energy (i.e. pressure) into electrical energy that can be digitizedand read into a computer. The pressure fluctuations of speech are not random. They occurat regular intervals of time and those intervals have corresponding frequencies (a frequencyis 1interval ). As such, speech contains many component frequencies, and those frequenciesare what is analyzed by (acoustical) phoneticians. Thus, one interpretation of phoneticexperiences is having knowledge of component frequencies of a speech stimulus4.The field of auditory phonetics can help refine this interpretation by considering howthe human auditory system processes sound. The process involves the tympanic membraneof the ear transducing acoustical energy (sound) into mechanical energy, which is thentransferred via the bones of the inner ear to the cochlea. The cochlea houses two fluidfilled sacs and the basilar membrane, and has a tapered shape. As the mechanical energyenters the cochlea, vibrations resonate at specific frequency-sensitive sections of the cochlea(because of the tapered shape and differences in stiffness). These vibrations are picked up bysensory hair cells within the basilar membrane, which convert the mechanical energy into the3Note that this ignores the role visual or tactile information play in speech as well as signed communica-tion.4To be precise, the knowledge would be of the frequency, its amplitude and its phase at a given point in(and then throughout) time.23electrical impulses that feed into one’s auditory cortex via the auditory nerve (Schuknecht,1993; Dallos and Fay, 2012). Researchers have determined that the human aural systemhas more resolution for lower frequencies than higher ones, and that our frequency responsefollows a logarithmic curve5. Thus, the interpretation of phonetic experiences could bebetter refined as the knowledge of the logarithm of component frequencies of a speechstimulus.Finally, phonetic experiences happen in time, so the ‘temporal resolution’ (pp. 127 Yu,2017) with which one samples component frequencies needs to also be considered. Thisis particularly true given the computational nature of the project because the learningalgorithm used herein requires a fixed size vector for input parameters. The role of time inphonetic experiences is addressed further in §3.2.2.1.1.4 Emergent Phonology SummaryEmergent Phonology holds that a speaker’s phonology emerges from their phonetic experi-ences. In §2.1.1.1, phonology was reduced to the identification of phonological units; thisis a simplification, but it is an important starting point that is also in line with otherresearchers in the field (e.g. Mielke, 2008). In §2.1.1.2, emergence was reduced to anunsupervised learning system generating the correct quantity and quality of phonologicalunits. Finally in §2.1.1.3, phonetic experiences were reduced to knowledge of the compo-nent frequencies of speech. Thus, the present investigation into Emergent Phonology canbe specified as an investigation into whether an unsupervised learning system is able to dis-cern the correct number and correct content of the phonological units of a language whenprovided component frequencies of speech in that language.It should be reinforced here that this project approaches Emergent Phonology from alimited viewpoint – it has both reduced Emergent Phonology to its most rudimentary claimand simplified phonology as a field of study. There is much more to Emergent Phonology5Stating that frequency selectivity is logarithmic in nature is a simplification, in fact the aural systemis argued to best be modeled as a series of equivalent rectangular bandwidth (ERB) filters (Glasberg andMoore, 1990).24than has been reported here, not the least of it is the drastic undertaking to formalize aframework through which Emergent Phonology occurs (e.g. Archangeli and Pulleyblank,2017). Nonetheless, a successful result in this project would achieve a significant pieceof evidence for the field – evidence that phonetics does concretely give rise to aspects ofphonological structure (phonological units in particular). Comparing the results of thepresent learning simulations with a formal framework will be left for the future.2.1.2 ToneAs lexical tones are the phonological units under investigation in this project, I now providean overview from the perspectives of phonetics and phonology (for an in-depth introductionto tone, see Yip, 2002). The SIL Glossary of Linguistic Terms defines tone as ‘a pitchelement added to a syllable to convey grammatical or lexical information.’ This definitionis succinct and accessible, but it is not quite an accurate representation of how linguists thinkof tone. In particular, the terms pitch and syllable here may be contended by researchersthat work on tone – contentions left for the sections on phonetics of tone and phonologyof tone, respectively. Lexical tone, as distinct from grammatical tone, distinguishes themeaning of words. This means that, in a tone language that contains lexical tones, pitchis used to differentiate words that are otherwise composed of the same sequence of speechsounds/segments. Table 2.2 presents a classic example of lexical tone in Mandarin. Here,the same phoneme sequence /ma/ has four distinct meanings depending on which pitchpattern is used while uttering it. With a high-flat pitch, /ma/ means ‘mother’; with arising pitch, /ma/ means ‘hemp’; with a falling then rising pitch, /ma/ means ‘horse’; andwith a falling pitch, /ma/ means ‘scold’. As will be discussed in §2.1.2.1, the actual pitchtrajectories of these patterns can be visualized using fundamental frequency contours. Suchvisualizations are useful for studying the pitch range and timing qualities of a tone.25Character 媽 麻 馬 罵Pinyin ma¯ ma´ maˇ ma`Gloss “mother” “hemp” “horse” “scold”Pitch pattern (IPA)Ă£ Ę£ ŁŘ£ Ď£Table 2.2: Mandarin Tones, adapted from (Xu, 1997).Grammatical tone, in contrast to lexical tone, distinguishes the grammatical function ofa word such as its tense, aspect or number. For example, in the Central-Sudanic languageof Ngiti tone is used to distinguish singular and plural number. In Ngiti, the phonemesequence /kama`/, with a low-flat pitch on the second syllable, means the singular ‘chief,’and the phoneme sequence /ka´ma´/, with high-flat pitch patterns on both syllables, meansthe plural ‘chiefs’ (Kutsch Lojenga, 1994, pp. 135). To limit the scope of this project, I onlyfocus on lexical tones.Much like other units in phonology, lexical tones are thought of as abstractions. Alexical tone (henceforth tone) remains constant regardless of who is speaking or how theyare speaking. This constancy, in some regards, is surprising; tone depends on pitch, yetpitch varies drastically across talkers. The pitch of a high-level tone of a male is, on average,much lower than that of a female, which is in turn lower than that of a child. Thus, a toneis abstracted from pitch, abstracted from the “components determined by the individualspeaker” (pp. 343 Rose, 1987).The contrast of tone and pitch is somewhat mirrored in the distinction of phonology andphonetics. As discussed in §2.1.1, phonology is a study of the function and organization ofdiscrete linguistic items – tone is a speaker independent, discrete unit of language; phoneticsis concerned with the speech signal as it exists in the physical world – pitch is a speakerdependent signal that is continuously changing through time. Thus, as was stated in §1.1.3,mapping phonetics to phonology ascribes to levels of abstraction (cf. Yu, 2011). To fullyappreciate this process, both the phonetics of tone and the phonology of tone need to beaddressed in more detail. An introduction of both is provided now.262.1.2.1 Phonetics of ToneUp until this point, the term pitch has been used. However, pitch is contentious in linguis-tics. This is because pitch describes one’s auditory perception (of, say, a voice or musicalinstrument) and humans are unable to experience each other’s perceptions (Nagel, 1974).As an alternative to pitch, phoneticians rely on the fundamental frequency of a sound.The fundamental frequency, or f0, of a sound is the greatest common denominator of thecomponent sine waves, which in speech is (nearly always) the lowest frequency componentof the (quasi)periodic wave.. With regards to the articulation of speech, f0 corresponds tothe rate at which the vocal folds are vibrating during a voiced segment of speech. F0 is alsonoted to be the “basic acoustic correlate of perceived pitch” (p. 343 Rose, 1987).As a demonstration of calculating f0, consider Figure 2.1, which presents the waveform ofa recorded /i/. The wave is quasi-periodic, evidenced by the fact it repeats at quasi-regularintervals. Given that the wave repeats 6 times in 0.043s, we can calculate its fundamentalfrequency as 60.043 ≈ 140Hz. This in turn means that the speaker’s vocal folds were vibrating(i.e. opening and closing) ≈140 times per second during this speech interval.Figure 2.1: A zoomed-in waveform of a sustained /i/. With 6 quasi-periodic cyclespresent in 0.043s, this utterance has an approximate f0 of 140Hz.27Because fundamental frequency can be calculated at various points throughout an ut-terance, we can plot f0 contours through time. F0 contours (sometimes termed pitch traces)are regularly used to visualize the pitch patterns associated with a tone. Figure 2.2 presentstwo f0 contours for two Mandarin tones (high-level and rising) from a male, native speakerof Mandarin6.Figure 2.2: F0 contours of a high-level and a rising tone exemplar of Mandarin,produced by a native male talker of Mandarin. Tones were produced as the firstsyllable of a two syllable utterance.Although Figure 2.2 gives the appearance that there is a simple correspondence between f0contours and what would be described as a tone, it is somewhat misleading. This is becausethe phonetic variability that is produced in the real world is exorbitant. To demonstratethis, Figure 2.3 presents two graphs: the first overlays all of the f0 contours produced fora single talker in the Mandarin corpus used in this thesis (Yuan et al., 2015); the secondrestricts the f0 contours to only the rising tone of that talker. There is extensive variabilityin both images.6Thanks to my colleague Yadong Liu for providing the recordings.28Figure 2.3: F0 contours overlaid on a single graph. The lefthandside image overlays f0contours for all tones produced by a single talker in the Mandarin Chinese PhoneticSegmentation and Tone Corpus (Yuan et al., 2015); the righthandside image overlays f0contours for only the rising tone for the same talker.Visualizations of f0 contours are used extensively throughout the case studies in Chapter 4,and are further discussed in §3.4.2.The above discussion of f0 pulls from three separate subfields of phonetics: auditoryphonetics, articulatory phonetics and acoustic phonetics:• Auditory phonetics studies the processing/hearing of speech by human auditory sys-tem. Auditorily, f0 is converted into perceived pitch, which is an integral componentof tone.• Articulatory phonetics studies the generation of speech via the manipulation of airflowby the speech organs of the human vocal tract. Articulatorily, f0 originates as glottalpulses – the opening and closing of the vocal folds while vibrating.• Acoustic phonetics studies the physical speech signal, such as one recorded via amicrophone. Acoustically, f0 is visible in a waveform and measurable by dividing 1by the period of the wave.This thesis considers tone information extracted from the acoustic-auditory signal, but abrief discussion of the proposed articulatory mechanisms behind tone is appropriate here tocontextualize the phenomenon. Tones are phonological categories abstracted from the pitchpatterns of a speaker’s utterance and pitch is correlated with the fundamental frequencyof an acoustic speech signal. Fundamental frequency is, in turn, determined by the rateof vocal fold vibration. As such, articulatory phoneticians that study tone are primarily29interested in the mechanisms responsible for modulating the rate of vocal fold vibration.The standard proposal is that the rate of vocal fold vibration depends on the tension of thevocal folds (Stevens and Halle, 1971). Duanmu (2007) states that the cricothyroid muscleis likely the primary mechanism used to change vocal fold tension. In addition to a single‘pitch-control mechanism’ (Edmondson and Esling, 2006, p. 188), however, Esling and Harris(2005) proposes that there is a series of six valves that interact to facilitate the productionof, for example, tones. These valves also interact to enable voice quality distinctions suchas creaky voice and breathy voice, providing an explanation for the frequently occurringpattern of certain voice qualities corresponding to certain tone categories in a variety oflanguages (Edmondson and Esling, 2006). The mechanisms that enable the production oftonal contrasts are still being unraveled.In this thesis, the only data used are acoustic (audio recordings from a corpus). Assuch, I only consider the acoustic parameterization of tones. There are multiple ways toparameterize the speech signal for an investigation with tone; for example, Surendran (2007)considers f0, f0 change, intensity, intensity change, duration (i.e. length of sampling window)and voice quality metrics in his investigation of tone classification in Mandarin. The presentproject is not a classification task however, it is an unsupervised exploration of data. Assuch, there is a trade-off in using too many metrics to parameterize tone because additionaltypes of data result in additional noisiness in the data (making it harder for unsupervisedlearning algorithms to identify what is meaningful). Additionally, parameterizations thatmay be useful for machine classification may not be clearly relatable to human experience7;indeed such parameterizations may not be attended to by humans at all. In light of thisdiscussion, I restrict my acoustic parameterization of tone to three metrics, two time-seriesparameterizations and one scalar parameterization. For the time-series data, fundamentalfrequency and its differential are used. For the scalar data, the duration of the tone is used.Fundamental frequency (f0) is the primary correlate of pitch, so its inclusion in theacoustic parameterization of tone is self-evident. Next, as the differential of a function mea-7This is part of a much larger conversation on how engineering solutions relate to human experience.30sures how it changes over time, the differential of the fundamental frequency (d1) captureshow f0 changes over time. In light of this, d1 has clear utility for identifying contour tones– tones that change pitch through time (e.g. the falling tone in Mandarin) (Gauthier et al.,2007). Finally, duration provides a partial cue for distinguishing different tones because themore complex a tone is, generally the more time it takes to realize it phonetically (Gor-don, 2001). For example, the fall and then rise of Mandarin’s 2-1-4 tone regularly takesmore time than the fall of Mandarin’s 5-1 tone (Whalen and Xu, 1992). Note that I havenot stated the duration of what – this is addressed in the following section. The acousticparameterization of tone is fully incorporated into the method of this thesis and is furtherdiscussed in §3.2.2.1.2.2 Phonology of ToneAs previously stated, a tone is an abstract phonological unit. In phonology, the unitsare generally described as relative pitch targets in a speaker’s range. The most common(typologically) number of pitch targets in a tone language is two: high (H) and low (L)(Odden, 1995). Maddieson states that there may be up to five pitch targets in a givenlanguage: extra high, high, central, low, extra low (pp. 340 Maddieson, 1978). In theSinological tradition (which will generally be used throughout this thesis), these targets arelabeled using numbers, with a numeric value of 5 corresponding to the highest pitch target(or level) in a speaker’s range and a value of 1 corresponding to lowest pitch target (or level)in a speaker’s range (Chao, 1930). The high tone in Mandarin, then, could be described asa 5-5 tone, meaning it starts and ends at the highest pitch level within a speaker’s register.Similarly, the rising tone in Mandarin could be described as a 3-5 tone, the falling-risingtone as 2-1-4, and the falling tone as 5-1. These pitch level descriptions are shown in Table2.3.31Character 媽 麻 馬 罵Pinyin ma¯ ma´ maˇ ma`Gloss “mother” “hemp” “horse” “scold”Pitch levels 5-5 3-5 2-1-4 5-1Pitch pattern (IPA)Ă£ Ę£ ŁŘ£ Ď£Description high rising fall-rise fallingTable 2.3: Mandarin Tones, adapted from (Xu, 1997).Despite the use of the five pitch levels throughout this thesis, it is important to note that thenumber of pitch levels depends on a phonologist’s analysis, theoretical framework and choiceof convention. For example, Gussenhoven et al. (2004) presents an analysis of Mandarin withonly a high (H) and low (L) tone (5-5=H; 3-5=LH; 2-1-4=L; 5-1=HL) (pp. 28 Gussenhovenet al., 2004). There are well-grounded, theoretical reasons for doing this, but they arebeyond the scope of this thesis.Like phonemes, tones are primarily motivated by the notion of contrastiveness. Twotones are said to contrast if, all else being equal, interchanging them results in distinct wordsor non-words. For example in Mandarin, the falling tone contrasts with the rising tone inthat /ma/ with a falling tone means ‘scold’ and /ma/ with a rising tone means ‘hemp.’This brings us to an important practical consideration for tonal phonology – tones do notgenerally exist as isolated entities, they are associated with other parts of the phonologicalstructure abstracted from a speech signal. These associations are formalized in the theoryof Autosegmental Phonology (Goldsmith, 1976). In Autosegmental Phonology a tone existson a ‘tone tier,’ which can be associated with other parts of the phonological structure,such as a ‘phoneme tier8.’ As a demonstration, Figure 2.4 provides a simple visualizationof the high tone of Mandarin (as analyzed by Gussenhoven) being associated to the vocalicportion of the phoneme string /ma/.8Given the vertical associations assumed in Autosegmental Phonology, it is considered to be part of abroader theory termed Non-linear Phonology (i.e. phonology does not occur as beads-on-a-string throughtime).32Tone Tier: HPhoneme Tier: / m a /Figure 2.4: An autosegmental example of a tone associating with phoneme.Associations such as the one shown in Figure 2.4 are the tip of the iceberg for phonol-ogists who wrestle with phonological structure because there are many tiers in standardphonological structure. The standard tiers include the aforementioned phonological fea-tures and phonemes, but also include syllables and their constituents, words and phrases(Ewen and Van der Hulst, 2001; Gussenhoven and Wright, 2015). A visual representationof this structure is shown in Figure 2.5 for the bisyllabic English word level.Word-(/lEv@ë/)Syllable-(/v@ë/)Rime-(/@ë/)Coda-(/ë/)PhonFeaturesNucleus-(/@/)PhonFeaturesOnset-(/v/)PhonFeaturesSyllable-(/lE/)Rime-(/E/)Nucleus-(/E/)PhonFeaturesOnset-(/l/)PhonFeaturesFigure 2.5: Phonological Structure for the English word level.This structure can also be broken down in prose to clearly highlight the compositionalityof phonology.• words comprise syllables• syllables comprise onsets and rimes• rimes comprise nuclei and codas• onsets, nuclei and rimes comprise phonemes• phonemes comprise phonological featuresThis sidebar into phonological structure has been necessary because it shows how deter-mining which tier tones are associated to is non-trivial in phonology. As shown in Figure332.4, one could associate a tone to a vowel, but this is not the only option. A tone couldbe incorporated into a set of phonological features, or associated with a vowel, a rime, asyllable or a word. In fact, there are arguments that tones are associated with an additionalpart of phonological structure such as the undiscussed (in this thesis) phonological unitknown as a mora, a timing unit in the rime of a syllable (Hyman, 1985; Gussenhoven andTeeuw, 2008; Hock, 1986; Pulleyblank, 1994). The challenge of associating tones with otherparts of phonological structure is further exacerbated by the fact that one could individ-ually associate pitch level components of a tone to different units (or even different tiers).Thus, if one assumes the Mandarin falling tone exists as the pitch target sequence HL, theH could be associated to a nucleus and the L to a coda. Deciding how tones are associatedto other tiers in phonological structure is important because one’s association assumptionshave consequences for what the tone inventory of a given language looks like. For example,although Mandarin has four lexically constrastive tones, perhaps that tone inventory itselfis only composed of two, what are termed, autosegments (H and L). A full consideration ofthis puzzle is well beyond the scope of the thesis however and it is put aside for now.While a phoneme inventory for a given language comprises the contrastive speech soundsin that language, a tone inventory comprises a language’s contrastive tones. As was statedat the beginning of this section, the most common tone inventory is composed of two leveltones, a low and a high tone (Maddieson, 1978; Odden, 1995). From that baseline, toneinventories diversify significantly in both number of contrastive tones and tone shapes.Generally, tone inventories are small (2-5 contrastive tones), but some languages have beendocumented with nine (Yaohong and Guoqiao, 1998) or even 14 contrastive tones9 (Bearthand Link, 1980). With respect to tone shape (the f0 contour), there are generally assumedto be three classes: level, simple contour and complex contour tones (Maddieson, 2013b).A level tone exists on a single pitch level (e.g. the 5-5 tone of Mandarin); a simple contourtone starts and ends on different pitch levels (e.g. the 5-1 tone of Mandarin); and a complexcontour tone changes levels two or more times (e.g. the 2-1-4 tone in Mandarin). The just9For those interested in learning more about tone inventories around the globe, I recommend visiting theWorld Atlas of Linguistic Structures (Maddieson, 2013b).34stated examples, again, would only be examples of such tone classes given a five pitch levelanalysis. For the purposes of this work however, I do not distinguish tones classes.In this thesis, I assume that a tone is associated to either the syllable, syllable nucleusor the syllable rime and I use the term syllable-frame as a catchall for all three options.Given this, the duration parameter identified in §2.1.2.1 is calculated as the length of timeof a syllable-frame. Necessarily, only one type of syllable-frame will be selected for a givenlanguage to ensure duration values are comparable within a dataset. The choice will be madeexplicit in each language case study of Chapter 4. It should be noted that the restriction oftone duration to that of a syllable-frame was motivated by phonological theory10 in spiteof evidence for the importance of coarticulation across syllables (Xu, 1997) and potentialcomputational gains from considering larger temporal-spans (Qian et al., 2007). See §3.2.1.1for additional discussion.2.1.3 The linguistic side: Summary statementPutting all of the pieces together, this thesis investigates if the phonological units of tone,for a given language, emerge from the acoustic parameters of f0, d1 and duration of asyllable-frame. If there is a positive result, a significant piece of evidence in support ofEmergent Phonology will have been demonstrated.2.1.3.1 An important caveatThere is one caveat in this investigation that has been glossed over but ought to be ad-dressed – the notion of contrastiveness. Phonological categories are primarily motivatedby contrastiveness, which requires the notion of meaning. Since the acoustic parametersused herein are naive to lexicality (i.e. there is no notion of meaning), tones that are shownto emerge cannot be said to be contrastive. They will, however, be distinct in their pho-netic properties. This may actually be an interest point for future consideration because,if contrast is not needed to identify tones, perhaps linguists’ reliance on contrastiveness isoverdone. Nonetheless, the incorporation of lexical meaning is left for future research.10Exemplified in items on a tone tier being associated with a unit in a syllable in Autosegmental Phonology.352.2 The computational side: Machine learningThis section aims to answer the relevant questions from Table 2.1, restated here for conve-nience:Subject QuestionMachine Learning: What is machine learning?What is unsupervised learning?In simplest terms, machine learning (most often) refers to a machine autonomouslylearning the implementation of some, oft-unknown, function. Mathematically, a functionmaps one set of inputs to another set of outputs. As an example, consider flagColours()defined as a function that outputs the component colours of a country’s flag. As such,flagColours(Canada) would return (Red, White); flagColours(Greece) would return (Blue,White); flagColours(Italy) would return (Red, White, Green); and flagColours(Hong Kong)would return (Red, White). In contrast to flagColours(), which is a deterministic and well-understood function, many functions that the human brain performs are stochastic and notfully understood.Consider the following analogy. When asked if there were a cat in a given image, nearlyall typical humans will perform at ceiling with minimal effort. Despite this, the criteria usedfor deciding whether an image contains a cat or something else (e.g. a human, a dog, a mouse,a drawing of a cat, etc.) is hard to express. This is, in part, because human perception is agestalt, integrating past experience with what is seen (Koffka, 1922). Additionally, humansgenerally base their interpretations on qualitative, not quantitative criteria. For example,one might say that a cat has ‘pointy ears’ without defining what qualifies ‘pointiness’ – arequired angle? the inverse of a roundness criterion? Or, perhaps there is a cat with onlyone ear and it thus has the quality of ‘pointy ear’ and not ‘pointy ears,’ is it still a cat?These considerations illustrate why computer vision is a challenging area of research – thefunctions, e.g. to map an image to a hasCat quality, of visual classification are not fullyunderstood.36However, in spite of not being fully understood, humans still embody the imageHasCat()function, which outputs true if an input image contains a cat and false otherwise. Thishuman ability enables researchers to construct datasets that consist of a set of images andthe set of corresponding true or false values. These datasets can then be used to trainmachine learning algorithms in what is called supervised learning.2.2.1 Supervised LearningIn supervised learning, there exists a dataset of inputs, termed features, and correspondingoutputs, termed labels. Historically, these datasets were created by human researchers foruse in training computers (e.g. Garofolo, 1993). This process – training machines on human-created datasets – has revolutionized machine performance in many areas, image/facialrecognition (Russakovsky et al., 2015) and automatic speech recognition (Chiu et al., 2017;Toshniwal et al., 2018) in particular.There is a wide variety of statistical models used for supervised learning. Some popularmethods include support-vector machines (Ben-Hur et al., 2001), regression analysis (Lind-ley, 1990), discriminant analysis (Lachenbruch and Goldstein, 1979) and artificial neuralnetworks (McCulloch and Pitts, 1943). For the remainder of this section, I will only reviewneural networks. This is for two reasons: (1) neural networks are used in this thesis, and(2) neural networks are arguably the most active area of machine learning research rightnow, particularly in the area of deep learning (LeCun et al., 2015). I begin with a briefoverview of neural networks and then explicate how supervised learning is implementedwithin a neural network using the popular example of the MNIST digit-recognition task(Deng, 2012).2.2.1.1 A brief overview of neural networksArtificial Neural Networks attempt to emulate how neurons function in the human brain.The typical structure of a neuron is shown in Figure 2.6. Neurons communicate witheach other as follows: (1) dendrites receive electro-chemical signals from other neurons andtransfer those signals to the soma, or cell body, (2) the soma processes the received signals37and, if appropriate, propagates the electrical signals down its axon; finally, (3) the axonbranches transfer the charge to dendrites of other neurons.Figure 2.6: An image of a neuron and its dendrites and axon.Reproduced under a CC BY 3.0 license from https://en.wikipedia.org/wiki/Neuron#/media/File:Blausen 0657 MultipolarNeuron.pngAn artificial neural network (ANN/NN) works in much the same way as networks ofneurons in the brain. Figure 2.7 shows a simple ANN. In it, cell bodies are replaced withnodes, and axons/dendrites are replaced with mathematical connections to other nodes. Inthe brain, neurons send and receive electro-chemical signals, with the cell body determiningif the received signals from the dendrites should propagate forward down its axon. In anANN, nodes receive and then propagate numerical values. The numerical value of a nodeis termed its activation. A node’s activation is calculated by its activation function, whichtakes as input the sum of all connected neuron’s activations multiplied by their respectiveconnection weight.38Figure 2.7: A simple ANNReproduced under a CC BY-SA 3.0 license fromhttps://en.wikipedia.org/wiki/Artificial neural network#/media/File:Colored neural network.svgTo my knowledge, the basic mathematical description of ANNs comes from McCullochand Pitts (1943). Before the 1970s, neural networks were not widely used because weightswere not automatically learned; they needed to be hard-coded. At the time, it made littlesense to implement a function using NNs because manually calculating weights providedlittle insight and was not computationally efficient. This changed in 1974 when Werbos(1974) first described (what would later become) learning via the backward propagation oferrors (i.e. back-prop). Back-prop enabled the efficient and autonomous learning of weights,effectively resulting in the dawn of supervised learning.2.2.1.2 Supervised learning in neural networksBack-prop provides the computational mechanism needed to make use of supervised learningdatasets, which contain corresponding input features and output labels. Back-prop works asfollows: (1) a NN is initialized with (often random) weights; (2) input features are fed intothe NN input layer; (3) the input features flow through the NN via the networks weights,resulting in an output for that input; (4) the NN output is compared to the known correct39output label for those input features using a loss function, often the sum-squared error,which calculates the error; (5) the error is then propagated back through the network,assigning a portion of the error to each neuron and updating the weights accordingly. Theresult of this process is that the next time that same input is fed into the NN, the NNoutput will be slightly closer to the known correct output label (i.e. the error will be less).These steps are repeated many times with many different input-output pairs, resulting inweights that minimize error across the entire dataset.As a rough illustration, we will consider the popular MNIST dataset (Deng, 2012). TheMNIST dataset contains 70000 handwritten digits (0-9) represented as 28x28 pixel images(i.e. 784 values), example digits are shown in 2.8.Figure 2.8: Example MNIST imagesReproduced under a CC BY-SA 4.0 License from https://en.wikipedia.org/wiki/MNIST database#/media/File:MnistExamples.pngEach digit image also has a corresponding numerical label 0-9, represented as a one-hot 10x1vector. This means that a value of [1] at a given index in the output vector corresponds tothe label – e.g. a ‘1’ would be represented by the vector [0 1 0 0 0 0 0 0 0 0] and ‘6’ wouldbe represented by the vector [0 0 0 0 0 0 1 0 0 0]. Thus with MNIST, our problem space isthe mapping of a vector of 784 pixel values (28x28) to a vector of 10 values (one-hot digitlabels). Figure 2.9 presents a visualization of what a neural network designed to learn theappropriate mapping function might look like.40Figure 2.9: A simple neural network design for learning MNIST digits.Reproduced from https://mmlind.github.io/Simple 1-Layer Neural Network for MNIST Handwriting Recognition/This network comprises only two fully-connected layers, meaning each input feature isdirectly connected to each node of the output layer. Generally, neural networks compriseinput and output layers and other hidden layers – layers of nodes between the input layerand output layer – but hidden layers are not needed for our illustration. I now walk throughthe steps of back-prop with respect to a given training exemplar and the network of Figure2.9:41Time Step ActionT1 network weights are randomly initializedT2 a digit image is provided to the network, represented as a 784x1vector of pixel valuesT3 pixel values propagate forward via matrix multiplication with theweightsT4 the value of each output node is calculated via its activation functionprovided the results of T3T5 the output values are compared to the known label for the inputimage using a loss function, often the sum-squared error. The outputof the loss function is the error.T6 The gradient of the loss function is used to determine how weightsshould change to reduce the error of the current exemplarT7 Weights are updated according to the results of T6The consequence of this process (after training on all data points and reaching a minimumerror) is that the weights of the network come to implement a good approximation of thefunction that maps input features to output labels. In this case, mapping 784 pixel valuesto a 10x1 vector where the index of the highest output value corresponds to the digit class.It is important to note that the network will not learn the function perfectly, meaning itdoes not achieve perfect one-hot vectors (i.e. nine 0s and one 1). Instead, the index of theoutput node with the highest value approximates one-hot location.Also, when training a model for supervised learning, it is important to separate thedataset into training and testing subsets. The training set is used, as the name implies, totrain the network. In contrast, the testing set is not seen by the network during training;this allows it to be used to evaluate the generalizability of the network. The reasoning hereis fairly intuitive. The network learns to minimize errors based on the exemplars it sees. Assuch, it is not surprising for the network to have a low error on those exemplars. A better42evaluation is to test the network by how it generalizes to exemplars it has not seen. If thenetwork produces low-errors for unseen data, it has likely learned a good approximation ofthe desired function.2.2.1.3 The rise of deep learningAlthough back-prop is remarkable, it is not without its shortcomings. For one, it is compu-tationally expensive. The network of 2.9 and the MNIST dataset are minuscule by modernstandards. Nonetheless, with each epoch of training, 7840 (784x10) connections are updated60000 times (the size of the MNIST training set) depending on batch size. Additionally,back-prop is not guaranteed to result in an ideal set of network weights because it onlyshifts weights until a minimal gradient in the loss function is reached (an error in whichshifting the weights in any direction would not improve network performance). However,there is no way to know if the minimum it reaches is local or global, meaning we cannot besure if back-prop has arrived at a good solution for our problem. The computational costand lack of certainty with results meant that little work was done with neural networksfrom the 1970s through until the early 2000s.However, in 2006 Geoffrey Hinton (Hinton et al., 2006) published a seminal paper thatintroduced a computationally cheap, and therefore fast, method to initialize large networksso that back-prop would find a near optimal solution (see also Hinton, 2002). This work,in combination with dramatic increases in the speed of processing, led to a neural networkresurgence. One particularly influential result came in in 2008 when Nickolls et al. (2008)showed that the computationally parallel nature of graphics processing units (GPUs) meantthey could be used to parallelize back-prop in a neural network much more efficiently thanCPUs. This both drastically increased the speed of machine learning and substantiallyreduced the cost of high-performing equipment for researchers. These factors, in combi-nation with improved network architectures (e.g. Graves, 2012), have conspired to allowneural networks to become the de facto state-of-the-art method to perform many compu-tational tasks (LeCun et al., 2015). In fact, one of the major limiting factors in the field43now is the requirement of sufficiently large human-annotated datasets, which brings us toUnsupervised Learning.2.2.2 Unsupervised LearningIn an unsupervised learning task, there are no labels. The goal then is not to learn anexplicit function, but to learn something useful about the patterns/structure present in adataset. These patterns are often utilized for tasks like dimensionality reduction or cluster-ing. Dimensionality reduction, as the name implies, reduces the size of a data point, suchas compressing an image from, say, 784 pixel values to something much smaller. The mostobvious benefit of reducing the dimensionality of a data point is that it drastically reducesthe computer memory needed to store it. For example, Google uses unsupervised learningmodels to compress images for its Google Photos App and for websearches (Toderici et al.,2017). Clustering refers to grouping data points based on similarity, often defined usinga distance metric between the points. Clustering has many practical uses including sort-ing humans into archetypes for better advertising or sorting articles into fake or real news(Miller et al., 2014; Bentolila et al., 2011).One of the most notable results from unsupervised learning came out of research atGoogle Brain. Researchers applied an autoencoder (an unsupervised machine learningmodel) to thousands of static images from Youtube and were able to demonstrate thenetwork identified high-level features corresponding to human faces, body parts and evencat faces (Le, 2013). This result is quite gobsmacking; the network had not been given anysort of labels, yet activations in the network were seen to be sensitive to a hasCat propertyin much the same way as humans embody the imageHasCat function. In addition to thisresult, the scale of this network was immense – it consisted of 1000 computers linking 16000CPUs that trained on the dataset for 3 days. Fortunately with the refinement of GPUs,such vast CPU arrays are not needed for most current research.The Le (2013) article serves as part of the inspiration for the present project; particularlythe result that a hasCat property was naturally conceptualized in an unsupervised learning44system. This, of course, mirrors the current hypothesis that the phonological category oftone will naturally conceptualize in an unsupervised learning system applied to acousticparameters of natural speech. The unsupervised learning system used herein comprisesan autoencoder and hierarchical clustering (with relevant evaluations of performance). Ielucidate the components here.2.2.2.1 AutoencodersAutoencoders are a neural network model used to reduce the dimensionality of their input.To do this, they mimic a supervised learning task in which the output labels are set tothe input features themselves. The trick is that the networks are designed to have anhourglass shape such that an intermediate hidden layer is of lower dimensionality than theinput/output layers. Training occurs in the same way as with a supervised learning task:an input is provided, activations flow through the network and an error is calculated at theoutput layer. The difference here is that this error is not just an error of classification, itis a reconstruction error – it captures how well the network reconstructed the input afterpassing through the lower dimensional space. Figure 2.10 provides a schematic for a simpleautoencoder.45Figure 2.10: A simple autoencoder design with labels for the encoding portion (to thelatent code) and the decoding portion (from a latent code to reconstruction).Reproduced under a CC BY-SA 4.0 License from https://en.wikipedia.org/wiki/Autoencoder#/media/File:Autoencoder structure.pngAfter an autoencoder has minimized error across its dataset, it acts as a powerfuldimensionality-reduction function – reducing a higher dimensional input into the lowerdimensional space. The lower dimensional space is termed the latent space and the map-ping of an input to this space returns a latent code (or encoding) for that input. This latentcode acts as an abstraction, or in the language of Le (2013), a high-level representation ofthe input. A trained autoencoder can be used to return the latent code of all exemplarsin a dataset, providing a new dataset to study. This new dataset is of much lower dimen-sionality than the original, and as such it is much better for clustering analyses. This isthe case because the distance metrics used in clustering become increasing obfuscated inhigh-dimensional spaces (Nousi and Tefas, 2018; Beyer et al., 1999). I employ hierarchicalclustering in this project to cluster the latent codes returned from a trained autoencoder.462.2.2.2 Hierarchical clusteringLike most clustering algorithms, hierarchical clustering groups data points based on theirproximity to neighbouring data points. How this is implemented in hierarchical clusteringis as follows:1. The distance between each data point and all other data points in the dataset iscalculated2. The two closest data points globally are linked as a cluster, described by its centroid113. The distance between each data point, each linked cluster and all other data pointsand linked clusters is calculated4. The two globally-closest data points or linked clusters are linked into a cluster, de-scribed by its centroid5. Repeat steps 3 & 4 until all data points are linkedThe result of this is a dendrogram of linked data points/clusters, such as the one visu-alized using Scipy (Virtanen et al., 2019) in Figure 2.11.11There are other methods to describe a cluster, some of which incorporate variance as well. I have usedthe centroid because it can be reconstructed through the decoder network (§3.4.2)47Figure 2.11: A dendrogram of linked data points/clusters. Each arch corresponds tothe distance needed to connect two points or clusters. These data are from the Fungwacase study in §4.3.The full dendrogram has all points linked under one cluster. To separate the data intomore clusters, a distance threshold can be chosen and clusters that are connected by thatdistance (or larger) are separated. As the threshold is lowered, the number of clusters isincreased. There are also standard metrics to assess which number of clusters bests fits thedata (discussed in §3.4.1). An example of a cut dendrogram is presented in Figure 3.4.2.2.3 The computational side: SummaryMachine learning has revolutionized machine performance on many tasks (LeCun et al.,2015). The advancements are primarily thanks to supervised learning models training onhuman annotated datasets. Nonetheless, the field is changing to make use of unsupervisedlearning models and datasets that require minimal human annotation (such as sequence-to-sequence models). This thesis utilizes a combination of two unsupervised learning methods,autoencoders and clustering, to investigate whether the phonological category of tone can48be created purely from continuous acoustic parameters.2.3 The overlap: Combining computation and linguisticsThe remaining question from Table 2.1 is: What can unsupervised learning tell us aboutEmergent Phonology? The answer, from my perspective, is that it can tell us whether Emer-gent Phonology, at its most rudimentary, is tractable – a valuable answer for researchersstudying the acquisition of phonology or language acquisition as a whole (e.g. Dresher,2015; Archangeli and Pulleyblank, 2015, 2012; Mielke, 2008).Emergent Phonology proposes that phonological categories emerge (through generalcognitive abilities) from phonetic experience. This requires that the variable, continuousphonetic realizations of speech naturally chunk into invariant, discrete phonological units.While machine learning cannot speak to the general abilities of humans that may lead tophonology emerging (such as memory, a sensitivity to frequency and a notion of similarity(pp. 3 Archangeli and Pulleyblank, 2017)), it can demonstrate that there is sufficient pat-terning in the phonetics to give rise to phonological categories. In fact, a positive result inthis project would be quite conservative, as the phonetic parameters have been restrictedto f0 and duration (e.g. humans are able to perceive other acoustic-phonetic cues to tone,such as voice quality), and the model has no knowledge of lexicality (word contrasts cannotbe used as cues to identify a separation of tone categories). If unsupervised learning candiscern phonological categories in the current investigation, there is a strong case that alanguage acquiring infant will also be able to discern such patterns (albeit via an alternativeimplementation).49Chapter 3Methodology: Implementation andExplicationThis chapter details the precise computational method that I have developed to demon-strate that machine learning is a useful analysis tool for theoretical linguists. As noted inthe thesis statement (§1.3), I do this through providing support for the theory of EmergentPhonology by showing that lexical tones of a language arise from an unsupervised learningmodel trained on the acoustic-phonetic parameters of fundamental frequency, its differentialand syllable-frame duration. To begin, a summary of the method is provided. This is fol-lowed by the specifics of implementation. Development was done using Tensorflow version1.9 and Python version 3.5; the code is available on Github (https://github.com/mdfry).3.1 Method SummaryThe method of this thesis can be broken down into three sequential stages: (1) data prepro-cessing, (2) dimensionality reduction and (3) clustering. In the data preprocessing stage,acoustic parameters are extracted from syllable-frame-aligned, discretized chunks of audio.The background for this stage was provided in §2.1.2. Next, the acoustic parameters, foreach extracted chunk, are reduced to a lower dimension – a latent code – using an au-toencoder that has been trained on the same data. The background for this was providedin §2.2.2.1. Finally, the latent codes for all data points in the training data are clusteredand the clusterings are evaluated for fit using several metrics. The background for this50was provided in §2.2.2.2. The final goal of the method (i.e. after clustering evaluation) isto generate a hypothesized tone inventory for a language. Ideally, the tone inventorycomprises the optimal number of tones and the proposed shapes for those tones, howeverthe results for the optimal number of tones is often unclear for reasons discussed in §3.4.1.Tone shapes are visualized as f0 contours. Table 3.1 provides an overview of these stepsand this chapter is organized by the same token.Stage Processes(1) Data preprocessing: Demarcate syllable framesExtract acoustic parameters(2) Dimensionality Reduction: Train autoencoderGenerate latent codes of acoustic parameters(3) Clustering: Cluster latent codesEvaluate clustersSelect and visualize optimal-clustersTable 3.1: Method Summary3.2 Data PreprocessingIn this project, data preprocessing encompasses demarcating syllable-frames in audio record-ings (from a corpus) and extracting acoustic parameters. The goal of these steps is to gener-ate meaningful, uniformly-formatted data (that represents the acoustic-phonetic realizationof exemplar tones) to be input into an unsupervised machine learning model. Meaningfuldata is critical in any machine learning project; no matter how powerful a learning modelis, it will not perform its task if the data that it is input is not meaningful for the task1. Inthis project, apropos of tone, meaningful data is interpreted as audio chunked into syllable-frames (§2.1.2.2) that has been processed into the acoustic parameters of f0 and duration(§2.1.2.1). It is important to make note of the fact that the data have been purposefully1This is what is meant by the common machine learning adage, “garbage in, garbage out.”51structured in a way to enable the generation of hypothetical tone inventories by the learningmodel. Crucially, the model has not had to learn a concept like a syllable, nor that the ap-propriate acoustic dimension to attend to is f0 for tone. Future work will investigate how toexpand the model so that such preprocessing is not necessary, ultimately working towardsa system that can hypothesize tone inventories from raw acoustics (§5.3). One promisingpath may be to utilize the amplitude envelop as a cue for phonological unit demarcation(cf. Leong and Goswami, 2015).3.2.1 Syllable DemarcationDemarcated syllable-frames are attained by one of two alternatives in this project. The firstalternative is simply by using timestamps, often annotated by humans, that are alreadyavailable in a given corpus. The Buckeye Speech corpus (Pitt et al., 2005), used in §4.4, isan example of such a corpus. To extract syllable-frames in this way, a simple script extractsaudio based on time-marked annotations. The second alternative is via Forced Alignment.Forced Alignment refers to the process in which “speech and its corresponding ortho-graphic transcription are automatically aligned at the word and phone level, given a way tomap graphemes to [phones] and a statistical model of how phones are realized” (McAuliffeet al., 2017, pg. 1). In other words, forced alignment automatically aligns text to speechgiven models of the acoustics of speech sounds in the relevant language and correspondingaudio recordings and transcripts. As I have not developed my own method to do forcedalignment, I do not detail the technical aspects of implementation. When Forced Align-ment was necessary in this project, it was done with the Montreal Forced Aligner (MFA)(McAuliffe et al., 2017). Figure 3.1 is an example of the output generated2 by the MFA forthe Cantonese corpus used in §4.2.2Readers knowledgeable in phonetics will notice the boundaries identified by the MFA are not precise,but they have been sufficient for the purposes of the project as shown in the case studies of Chapter 4.52Figure 3.1: MFA output for Cantonese. The result is a Praat TextGrid (Boersmaet al., 2002) that has syllables and sound segments aligned to audio.Once the audio and transcriptions have been force-aligned, syllable-frames can be extractedfrom the generated timestamps in the same way as was possible with human-annotatedtimestamps.3.2.1.1 A word on syllable-framesAs stated at the end of §2.1.2.2, the choice of restricting training exemplars to syllable-frameswas motivated solely by phonological theory (e.g. a tone unit on a tone tier is associatedwith a unit in a syllable 2.4). This restriction goes against conventions established in thetone classification literature (i.e. classification improves when larger temporal-spans beyondthe syllable are considered (Qian et al., 2007; Surendran, 2007)) due to known variations oftones in different environments (cf. Xu, 2001, 1997). The primary reason for this is to alignthe current investigation of Emergent Phonology with standard assumptions of phonology– in this case, that tone is associated with some part of the syllable. This is also reasonablebecause classification accuracy is not a goal of this thesis.533.2.2 Acoustic ParametersThe second component of data preprocessing in this project is acoustic parameter extraction.As outlined in §2.1.2.1, the acoustic parameters used are: fundamental frequency (f0), thedifferential of the fundamental frequency (d1), and syllable-frame duration.To measure f0 from the speech signal, I combine f0 estimates (generated at 5 ms inter-vals) from both Google’s REAPER software (https://github.com/google/REAPER) andPraat (Boersma et al., 2002). Both REAPER and Praat have been demonstrated to per-form f0 estimation of high quality (Stro¨mbergsson, 2016; Jouvet and Laprie, 2017) andthey were chosen because they complement each other in terms of their implementations3.REAPER uses dynamic programming to track f0 estimates by glottal pulse, minimizingthe cost of a certain trajectory through the f0-by-pulse space. Praat, alternatively, uses anadvanced autocorrelational analysis4. I take points of agreement between the two methods(within 10Hz) to be highly accurate f0 estimates. Syllable-frames that have mismatchedf0 estimates are discarded from the dataset; the percentage of data lost because of f0 es-timation mismatches is reported in the individual language case studies of Chapter 4. Bydiscarding mismatched f0 estimates, it is largely assured that the autoencoder will receive aclean input signal to learn from. This ensures less noise in the learning process, but it doesmean that other potentially useful information within syllable-frames is lost. For example,the mismatched f0 estimates may result from non-modal phonation because f0 estimationis challenging in such contexts. Non-modal phonation, however, is relevant for tone identi-fication, thus understanding the root cause of the f0 mismatches and their impact on themachine learning process is an important task that should be focused on in future work.Investigating cases of f0 mismatch will likely be fruitful, but it is left for future devel-opments.After f0 values are estimated for a syllable-frame, they are transformed in several ways.First, the f0 values are downsampled to 7 samples per syllable-frame. This is done by fitting3They complement each other in that REAPER identifies glottal pulses in the frequency domain whereasPraat standardly does autocorrelational analysis in the time domain.4There are other methods for estimating f0 in Praat, but the auto-correlational analysis is the default.54a cubic polynomial function to the f0 estimate and selecting 7 evenly spaced points through-out the function. The choice of 7 samples was based on the results of Yu (2017), whichreports the best performance of a tone classification model in Cantonese with 7 samples persyllable-frame. As Cantonese’s tone inventory is considered to be quite complex (with 6 or7 contrastive tones, see §4.2), it is assumed that this resolution will generalize to simplertone inventories as well. After the syllable-frame f0 values are downsampled, the f0 valuesare log-transformed and, finally, z-scored on a per-talker basis (following the method of Yu(2011); Rose (1987)). A simple log-transformation was used to model the nonlinearities ofthe human perceptual system5. Z-scoring is used as a means of normalization.The requirement for normalization in this project has two sources, linguistic and compu-tational. Linguistically, normalization is needed to “separate the Accentual and Linguisticcontent of the acoustic stimulus from the components determined by the individual speaker”(pp. 343 Rose, 1987). Speakers’ voices vary considerably on multiple dimensions (e.g. age,gender, emotional state, environment), but most are not relevant to linguistic functionality.For example, in the present case of tone discrimination, the absolute pitch of a speaker’svoice is inconsequential. It is the relative pitch within a speaker’s range – high or low – thatis critical. A perceiver must, therefore, normalize a talker’s absolute pitch into a relativepitch based on properties of his voice. This ensures that perceptions of different speakers’voices are perceptually comparable.Computationally, normalization is also needed so that data that is fed into a machinelearning model propagates through the learning model congruously. If the data were notnormalized, extreme value differences in training data (e.g. f0 of a male voice versus f0 ofa child’s voice) could result in the model learning nothing at all (or learning somethingunintended).As previously stated, to normalize f0 values in this project, they are z-scored on aper talker basis. Z-scoring is a statistical method to convert a value within a dataset toits corresponding distance from the mean of that dataset (i.e. a fraction of the standard-5A metric like ERBs, which are a nonlinear transformation that better matches the basilar membrane’sbehaviour would have been a better choice.55deviation from the mean). The practical result of this is that each talker’s f0 values becomenormalized as a proportion of the standard deviation around a mean of 0; this allows f0values across talkers to be compared. Finally, the z-scores are further compressed (linearly)between a range of [-1,1]. This ensures that the autoencoder will be able to faithfullyreconstruct training data (i.e. the normalized f0 values) with the arc-tangent activationfunctions used in the model.With the transformation of the f0 values complete, its differential can be calculated.Numerical differentiation is a process that calculates the rate of change (direction andmagnitude) of a function by taking the numerical difference between adjacent points of itsoutput. In the present case, the normalized, downsampled f0 values are the function thatis differentiated. The result of differentiation on the 7 f0 values is 6 values for d1. These d1values track how the f0 contour changes over time (i.e. if it rises, falls, or is level).The last acoustic parameter calculated is the duration of the syllable-frame, measuredin seconds and calculated from time-stamps in the corpus data.Ultimately, data preprocessing results in a 14x1 numerical vector that captures meaning-ful, acoustic-phonetic representations of tone for a syllable-frame. The acoustic parameterset comprising 7 f0 values, 6 d1 values and 1 duration value.3.2.2.1 ImplementationAlgorithm 1 presents the pseudocode for the acoustic parameter extraction in data prepro-cessing. This algorithm occurs after speech has already been separated into syllable-framesby talker. These steps result in a vector representation of the speech acoustics which issubsequently input into the learning model.56Algorithm 1 Algorithm to extract acoustic parameters1: for each speaker in corpus do2: x = [] . Initialize variable to store f0 data3: for each utterance do4: for each syllable-frame do5: a← estimateF0() . Calculate f0 throughout syllable-frame6: b← log2(a) . Take logarithm of f07: x← [x;b]8: end for9: end for10: xmean ← mean(x)11: xstd ← std(x)12: for each utterance do13: for each syllable-frame do14: c← estimateF0() . Calculate f0 throughout syllable-frame15: d← log2(c) . Take logarithm of f016: e← z-score(d, xmean, xstd) . z-score w/ all f0 values17: f ← downsample(e,7)18: g ← differentiate(f)19: h← duration()20: Save [f g h]21: end for22: end for23: end forWalking through the algorithm in prose, the acoustic parameter representation of toneper syllable-frame per talker is generated by: (1) calculating all f0 values of that talker;(2) calculating the mean and standard deviation of the log-f0 values; (3) re-traversing eachsyllable-frame for the talker and z-scoring the log-f0 values; (4) down-sampling the z-scored,57log-f0 values to 7 values per syllable-frame; (5) calculating the rate of change of the down-sampled, z-scored log-f0 values using numerical differentiation; (7) recording the duration ofthe syllable-frame; and (8) compiling a vector of the down-sampled, z-scored, log-f0 values,the rate of change of those values, and the duration.3.3 Dimensionality ReductionOnce the acoustic parameters for all syllable-frames in the corpus are calculated, theyform a derived dataset from the original corpus. This dataset is subsequently used totrain an autoencoder. The autoencoder learns to reduce the dimensionality of the acousticparameters, which results in high-level, low-dimensionality abstractions of them (cf. Schusteret al., 2016; Le, 2013). The abstractions are more effectively clustered than the acousticparameters themselves6 (Beyer et al., 1999; Nousi and Tefas, 2018).In this project, an autoencoder architecture known as an adversarial autoencoder isused (Makhzani et al., 2015). An adversarial autoencoder combines a vanilla autoencoder(§2.2.2.1) with a Generative Adversarial Network (GAN) (Goodfellow et al., 2014). GANspit two networks against each other during learning, hence the name adversarial. Thefirst network, known as the generator, aims to generate good-quality fakes of some inputdata from a known (usually normal) distribution. The second network, known as thediscriminator, aims to better differentiate between fake data generated by the generatorand real data from the dataset. As the network trains, the generator learns to produce betterfakes, which makes it harder for the discriminator to distinguish fake data from real data.Simultaneously, the discriminator learns to discriminate between fake data and real databetter, making it harder for the generator to ‘trick’ it with fake data. The objective of thetraining process is to have the generator generate fake data so well that the discriminator’sfinal performance is at chance. When the network converges (i.e. finishes training), thesystem can generate very convincing (fake) data. The generative capabilities of this modelhave practical value for future applications such as speech synthesis, but those are not6Appendix A provides further evidence of this by comparing clustering done on the acoustic-parametersthemselves to clustering of latent codes.58incorporate into this thesis.In an adversarial autoencoder, the GAN is used to constrain the latent code learnedby the vanilla autoencoder. To do this, the latent code from the autoencoder is treatedas ‘real’ data and the discriminator tries to distinguish it from ‘fake’ data sampled from anormal distribution via the generator (in this case, the generator simply generates normally-distributed, random numbers). In contrast to a standard GAN however, in an adversarialautoencoder the generator’s weights are not updated to create more convincing fakes, butthe encoder’s weights are updated to create latent codes that look more similar to theoutput of the generator. The result of this is the generation of latent codes that are similarto the distribution of the generator (i.e. latent codes become more normally distributed).Although having normally distributed latent codes seems opposite to the goal of clustering(cf. Mukherjee et al., 2019), the results of the case studies in this thesis suggest a differentstory. As shown in Appendix A, the pressure to have normally distributed latent codesimproves the tones hypothesized. One possible explanation for this is that the adversarialautoencoder is not only learning to generate latent codes with a normal distribution, itis also learning to minimize the reconstruction error of autoencoder. The balancing ofthese two tasks may force the latent space to be used more efficiently, such that similartone realizations are closer together in the latent space and thus clustered more effectively.Without the pull of a normal distribution, a vanilla autoencoder is free to use any partsof the hidden space for whatever purpose, meaning similar tone realizations may be spreadout (if that is how to best reduce error in the system), which makes interpretability of thelatent space a challenge and may make clustering less effective.3.3.1 Training an adversarial autoencoderFigure 3.2 shows a diagram of the adversarial autoencoder used in this project, generatedby Tensorboard (Mane´ et al., 2015).59Figure 3.2: A visualization of the Adversarial Autoencoder model generated usingTensorboard (Mane´ et al., 2015). The autoencoder, like that described in §2.2.2.1, isoutlined in blue.Three sections of the image are highlighted: the autoencoder, the generator and thediscriminator. Training occurs in three simultaneous steps. In the first step, the autoen-coder is trained as per normal - the input is fed through the network, compressed andreconstructed, and the reconstruction error is minimized. This step has the practical conse-quence of having the autoencoder learn how to optimally compress the higher-dimensionalacoustic-parameter representation of tones from the data preprocessing stage. In the secondstep, the discriminator is trained to map the latent code from the autoencoder to a ‘fake’label and the fake code produced by the generator (i.e. points sampled from a normal-distribution) to a ‘real’ label. The practical consequent of this is to create a discriminatorthat thinks normally distributed data is ‘real.’ In the third step, the discriminator is again60fed the latent code, with the goal of generating a ‘real’ output label. This creates an errorthat is used to update the weights of only the encoder7 so that latent codes are shiftedto be normally distributed.In terms of technical implementation, an adversarial autoencoder minimizes three lossfunctions: the autoencoder loss, the discriminator loss and the generator loss. These aredefined in the code snippet in Figure 3.3 on lines 204, 209 and 212 respectively. Theautoencoder loss is the same as the reconstruction error, the root-mean of the sum-squarederror between the reconstruction and the original. The discriminator loss is the cross-entropyof the discriminator mapping normally distributed data to 1 and 0 otherwise. Similarly, thegenerator loss is the cross-entropy of the discriminator mapping latent codes map to 1 and0 otherwise (with the aim of having the encoder change its weights to create latent codesthat are normally distributed). As is seen on line 235, the overall objective of the learningmodel is to optimize the sum of the loss functions.Figure 3.3: Adversarial Autoencoder loss functions.7Crucially, the discriminator does not update its weights to accept latent codes as ‘real.’613.3.2 Model ParametersAs with any machine learning model, there is an abundance of flexibility with model con-figuration. In the presently used model, there are 5 layers for the autoencoder with layersizes of [14 10 2 10 14]. The model was trained with a batch size of 64 data points andan AdamOptimizer function was used with a learning rate of 0.001 and a beta of 0.9. Fur-ther, l2 regularization and arc-tangent activation functions are used within the autoencoder.The discriminator also used l2 regularization and both sigmoidal and rectilinear unit acti-vation functions. These parameters were empirically selected throughout prototyping forthe method.3.3.3 Evaluating trainingAfter each learning batch, the model was evaluated on a test set of unseen data. The modelperformance was gauged using the reconstruction error on the test set. The reconstructionerror for each data point in the test set was the root-mean of the sum-squared error betweenthe reconstruction and original data point. The error was averaged for all data points in thetest set and recorded as a measure of performance of the learning model. After each batch,the error was compared to the model performance at previous time steps. The networkwas considered to have converged when there was minimal reduction of the error for severaltraining epochs.3.3.4 Reducing the dimensionality of acoustic parametersOnce the network converged, training was discontinued. The trained adversarial autoen-coder was then taken as a wholesale function to reduce the dimensionality of the acousticparameters generated in the data preprocessing stage. This resulted in a derived datasetcomprising latent codes of all data points from the training set.3.4 ClusteringOnce created, the dataset of latent codes was clustered using hierarchical clustering (John-son, 1967). Hierarchical clustering groups data points based on inverse distance. Scikit-62learn’s agglomerative clustering algorithm (Pedregosa et al., 2011) is utilized in this project,with the distance function set to Euclidean distance and the linkage done using the Wardcriterion. The steps of hierarchical clustering occurs as follows:1. the two most similar (least distant) points of a dataset are linked together to form acluster2. the next two most similar points or clusters are linked together to create a new or alarger cluster3. Step (2) is repeated until4. the final linkage between clusters has been madeAfter clustering is complete, the task becomes identifying the optimal number of clustersfor the data. Ideally, this would be achieved by using evaluation metrics for clustering.3.4.1 Evaluation MetricsDetermining the optimal number of clusters for a dataset is incredibly challenging. This isbecause clustering is entirely unsupervised – there are no labels to help distinguish if oneclustering is a better representation of the structure of a dataset than another. This is why,in his comprehensive survey of clustering techniques, Jain (2010) writes that “the validationof clustering structures is the most difficult and frustrating part of cluster analysis” (Jain,2010, pp. 222). Nonetheless, there are several heuristics that researchers can utilize toevaluate the overall fit of a clustering analysis.One commonly used heuristic is achieved by visualizing the results of hierarchical cluster-ing on a dendrogram. A dendrogram shows the distances needed to link two points/clusterson the y-axis and the points/clusters themselves arbitrarily on the x-axis. For example, inthe visualization of Figure 3.4, the distance between the final two clusters is just above 5units of distance.63Figure 3.4: A dendrogram of linked data points/clusters. The upper image presentsjust the dendrogram. The lower image presents the same dendrogram with the longestdistance cut, suggesting the optimal clustering for this data is 4.Dendrograms are potentially useful for identifying the optimal number of tonal categoriesbecause each successive link has an associated distance. Thus, the distances needed togo from n-clusters into n + 1-clusters is quantifiable. Unfortunately, interpretation of a64dendrogram is largely subjective. One technique to determine the optimal clustering ona dendrogram is to cut the longest distance and count the resultant number of clusters.An example of this is shown in Figure 3.4. The result of cutting the longest distancesuggests the optimal number of clusters is four. The rationale for this heuristic is that thelongest distance marks where clusters are least cohesive, and as such, such clusters shouldbe separate.Given that interpreting a dendrogram is subjective, most researchers relying on indicesthat combine intracluster cohesion and inter-cluster dispersion. The reasoning here is thatan optimal clustering for a dataset should have clusters that have little variance withinthemselves and are nicely separated from other clusters. Unfortunately, in real-world sce-narios there is inevitably considerable overlap between clusters and it is often challenging tofind an ideal balance between intracluster and inter-cluster variance. Consequently, theseindices are again more of a heuristic for identifying the optimal clustering for some data. Inthis thesis, I use three common evaluation metrics for optimal cluster (with unlabeled data):the Davies-Bouldin Index (Davies and Bouldin, 1979), Calinski-Harabasz Index (Calin´skiand Harabasz, 1974), and the Silhouette Index (Rousseeuw, 1987). The Davies-BouldinIndex evaluates clusters by comparing the intracluster similarity between clusters. TheCalinksi-Harabasz Index evaluates the clusters by comparing inter-cluster dispersion andintracluster dispersion. The Silhouette Index evaluates clusters by calculating how similareach data point is to other data points within its cluster to data points in other clusters. AsI am neither a clustering expert nor statistician, these metrics are largely taken off-the-shelfgiven their implementations in Scikit-learn (Pedregosa et al., 2011). Importantly, highervalues for the Calinski-Harabasz Index and Silhouette Index indicate better clustering, andlower values for the Davies-Bouldin Index indicate better clustering. Again, however, theunsupervised nature of clustering means these metrics may not accurately reflect the opti-mal clustering for a given dataset.653.4.2 Visualizing ResultsOnce the data are clustered (regardless of whether an optimal clustering can be determined),each cluster can be visualized as an f0 contour. This is achieved by reconstructing datapoints from a cluster in the latent space via the adversarial autoencoder. Specifically, thedecoding portion of the autoencoder (i.e. the decoder) can be used to generate a reconstruc-tion from each data point from the latent space. The decodings are of the same type as theinput into the autoencoder, vectors of 14 values: 7 f0 values, 6 d1 values and 1 durationvalue. The f0 and duration values for a set of latent codes (within a cluster) can be averaged(i.e. a centroid of a cluster) and then plotted, producing a graph like that shown in Figure3.5.Figure 3.5: Hypothesized TonesThis graph presents the prototypical f0 contours of two clusters identified in the latentspace. The x-axis shows time and the y-axis shows normalized frequency. This graph couldbe interpreted as a tone inventory comprising two tones that contrast in level (e.g. highand low). The majority of results presented in Chapter 4 are prototypical f0 contours thatcorrespond to clusters identified in the latent space of the unsupervised learning model.As figures like this can be generated for clusterings of any number, hypothesized tone66inventories for a language can be generated with an arbitrary number of tones. Thus,we can compare the tone inventories hypothesized by the method with the standard toneanalysis of a language even if the evaluation metrics do not provide a clear indication of theoptimal number of clusters.3.5 Chapter SummaryThis chapter has detailed the method developed for the current investigation. Specifically,it has outlined the necessary data preprocessing and implementation of the unsupervisedmachine learning model used to determine if phonological units emerge from phonetic data.The following chapter reports the results of this method applied to several languages: Man-darin, Cantonese, Fungwa and English.67Chapter 4Case StudiesThis chapter reports the results of the method outlined in Chapter 3 (henceforth themethod) as applied to several languages. The chapter is organized as a series of case studies,with each case corresponding to a language that the method was applied to. The primarygoal of each case study is to see if the method hypothesizes a tone inventory that aligns withthe tone inventory standardly reported for that language in the literature. If the hypoth-esized tone inventory matches the standard inventory, I consider it a clear demonstrationof machine learning’s usefulness as an analysis tool for linguists as the learning system hasautomatically derived a practical linguistic analysis. Additionally, such a result is arguedto be evidence for the theory of Emergent Phonology as tones (i.e. a part of phonology) willhave emerged from parameterized acoustic data (i.e. phonetic experiences).Each case study introduces a language and its tone inventory, then provides a briefmotivation for the language’s inclusion in the present project, and finally reports the resultsof the method for that language. The results are presented as a series of progressively larger(in number), hypothesized tone inventories for the language. The size of a hypothesized toneinventory corresponds to a preset number of clusters for a given clustering implementation.It is the role of the clustering evaluation metrics to determine the optimal number of clusters.Considering the tone inventories as they increase in number of tones has practical value.For instance, if the evaluation metrics fail to identify a single optimal clustering, each68progressively larger tone inventory can still be compared to the standard tone analysis ofthe language. Additionally, there may be some relationship between tones in a language,such as in-progress tone mergers (e.g. Yu, 2007), that manifest in the progression fromfewer to more hypothesized tones.Finally, following the discussion of visualizations provided in §3.4.2, each cluster is vi-sualized as an f0 contour with a defined duration. Each f0 contour corresponds to a clusteridentified in the latent space and is therefore the hypothesized tone corresponding to thatcluster. As the method is ignorant of word-meaning, the hypothesized tones should bethought of as distinct acoustic patterns in an acoustic-phonetic space and shouldnot be thought of as contrastive in the phonological sense. There are four case studies intotal, one for each of Mandarin, Cantonese, Fungwa and English.4.1 Case Study I: Mandarin4.1.1 Mandarin TonesMandarin, also known as Standard Chinese, is a Sino-Tibetan language with approximately1.3 billion first language speakers. There are four contrastive tones in Mandarin: high-level,rising, falling-rising and falling (Chao, 1965; Xu, 1997). An example of these contrasts ispresented in Table 4.1.Character 媽 麻 馬 罵Pinyin ma¯ ma´ maˇ ma`Gloss “mother” “hemp” “horse” “scold”Pitch levels 5-5 3-5 2-1-4 5-1Pitch pattern (IPA)Ă£ Ę£ ŁŘ£ Ď£Description high rising fall-rise fallingTone Label Tone1 Tone 2 Tone 3 Tone 4Table 4.1: Mandarin tones, adapted from (Xu, 1997)69For Tones 1, 2 and 4, there is a largely felicitous match between phonetic realizations andtheir corresponding phonological analysis (e.g. using pitch level representations of 5-5, 3-5 and 5-1) (Duanmu, 2007). This is not the case however for Tone 3, which has severalnotable variants. First, Tone 3 is often (but not necessarily) realized with creaky voicequality (G˚arding et al., 1986). Additionally, Tone 3 surfaces phonetically with low-fallingpitch in non-utterance-final positions. This variant is better described using pitch levels 2-1(Xu, 1997). As the speech used in this case study is continuous, the majority of Tone 3swill be non-utterance final. This, in turn, means that the majority of Tone 3s will occur asthe low-falling variant in this dataset.In addition to the four contrastive tones of Mandarin, a fifth neutral tone that occurs inunstressed syllables is regularly posited. The neutral tone generally surfaces on discourseparticles or the second syllable of a reduplicant. For example, the discourse particle de(/d@/) and the second syllable in 媽媽 (/ma55.ma/ – ‘mother’) are analyzed as neutraltones. While the neutral tone does not have a prescribed pitch (i.e. pitch is not the mostinformative cue for neutral tone (Surendran, 2007)), it generally occurs in the middle ofa speaker’s pitch range (i.e. lower than 5-5, higher than 2-1) and varies depending on theprevious tone (Wang, 2004; Linge, 2015).In order to practically evaluate the success of the method in Mandarin, a ‘correct’ toneinventory needs to be assumed. There are multiple ways to achieve this, such as: usingcitation-form f0 contours of a single talker, using average f0 contours of a set of tones fora single talker, using normalized, average f0 contours of a group of talkers, etc. Whatis more, any ‘correct’ inventory will undoubtedly vary depending on the circumstances inwhich tones are elicited. To generate the most directly-relevant ‘correct’ inventory for thiscase study, I have used normalized, mean f0 contours of all talkers in the corpus of thiscase study as marked with ground-truth labels. The ensures the results of the method areappropriately evaluated. Figure 4.1 presents this visualization.70Figure 4.1: Mean f0 contours of the four contrastive Mandarin tones and the neutraltone. These contours were derived from the corpus of this case study using theground-truth tone labels.For my purpose, the tones of Figure 4.1 are assumed to be the ‘correct’ tones of Mandarinin the acoustic-phonetic space that corresponds to phonologically contrastive tones. Bycomparing the tones hypothesized by the method to the inventory in Figure 4.1, we canassess the performance of the method. As stated at the beginning of the chapter, if there is amatch, it will have been shown that unsupervised machine learning can both (1) provide ananalysis of a language that is similar to that of a human linguist, and (2) provide empiricalsupport of Emergent Phonology.4.1.2 Motivation for inclusionMandarin was chosen as the initial case study for the method for several reasons. First, itis a widely known tone language. Second, it is a language with prescriptively standardizedpronunciation (Duanmu, 2007). Given this, the assumed ‘correct’ tone inventory in Man-darin is well motivated in both the number of tones and their respective shapes. Finally,there is an abundance of data for Mandarin. In combination, these points make Mandarin71an ideal choice for a proof-of-concept demonstration of the method.4.1.3 Corpus DataThe Mandarin speech data used in this case study is from the Mandarin Chinese PhoneticSegmentation and Tone Corpus (Yuan et al., 2015). This corpus comprises 7849 utterancesderived from the Mandarin Broadcast News Speech corpus (Huang et al., 1998). Thecorpus totals approximately 30 hours of continuous news broadcaster read speech. Thecorpus contains 20 speakers, 13 males and 7 females. All recordings are single channel witha sampling rate of 16kHz.4.1.4 Data PreprocessingThe Mandarin Chinese Phonetic Segmentation and Tone Corpus (MCPST) contains pho-netically aligned transcripts of syllable-onsets and syllable-rimes (known as initials andfinals in Pinyin). The alignment was done using the LDC Forced Aligner (Yuan et al.,2014). Given the time labels, the only data preprocessing needed was the extraction andnormalization of the acoustic parameters outlined in §3.2. For this dataset, duration wascalculated over syllable-rimes. As previously stated, utterances with f0 estimations thatdid not match between Praat and REAPER (within 10Hz) were discarded. In this corpus,that amounted to ≈ 15% of f0 samples being discarded (as f0 samples were discarded priorto aligning with text, I cannot report distributional information on what tone tokens werediscarded). Thereafter, the corpus was divided randomly into 90% training data and 10%testing data. Once broken down into syllable-frame audio chunks, the training data con-sisted of 71260 syllable-frames with a tone category distribution of 16921 T1, 17621 T2,10255 T3, 23989 T4 and 2474 neutral. While the method remains naive to labels, the dis-tribution of ground-truth tones in an identified cluster can be calculated and are reportedin Appendix A. Such visualizations are left for the Appendix because classification is notthe present goal of this project.724.1.5 Results4.1.5.1 Adversarial Autoencoder PerformanceThe adversarial autoencoder was trained using two NVIDIA GTX 1070s and training tookapproximately 3 hours. The adversarial autoencoder converged after ≈ 180 epochs throughthe dataset. This is seen in the reduction of root-mean-squared error on the testing set datashown in Figure 4.2.Figure 4.2: Adversarial autoencoder convergence for the Mandarin corpus data asshown in the reduction of reconstruction error on the test set.A visual assessment of the model is also possible by comparing ground-truth f0 contourswith those reconstructed from the autoencoder. Example reconstructions are shown inFigure 4.3.73Figure 4.3: Reconstructed Mandarin tones represented as normalized f0 contours. Thetop row shows ground-truth exemplars; the bottom row shows correspondingreconstructionsThe f0 contours are notably smoothed in the reconstructions. This is likely because asmooth line more optimally fits large batches of ground-truth data that have both positiveand negative perturbations.4.1.5.2 Hypothesized Tone InventoriesOnce the adversarial autoencoder converged, each data point (training and testing data)was passed through the adversarial autoencoder to generate its corresponding 2-dimensionallatent code (§3.3). All latent codes were then compiled and clustered using hierarchicalclustering (§3.4). Clustering was done for preset numbers of clusters ranging from twoto nine. This range was chosen given hindsight of the results of the clustering metrics.By comparing tone inventories of varying sizes, the hope is that the clustering evaluationmetrics can determine which clustering is ideal.The hypothesized tone inventories of Mandarin are presented in three sets. The setsdiffer in terms of their visualization, but they are derived from the same data. In the firstset, Figure 4.4, hypothesized tones are shown side by side with error bars corresponding tothe range of variability seen in the cluster that is visualized as that tone.74Mandarin with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.4: Hypothesized tones as generated by the method for Mandarin, visualized asf0 contours. Each pane corresponds to the set of hypothesized tones for a presetnumber of tones. Error bars represent variability around the median f0 values.In the second set, shown in Figure 4.5, all reconstructions corresponding to a clusteridentified in the latent space are visualized. Each frame provides a visualization of thevariability seen in its associated cluster in Figure 4.4. Keeping with the discussion in §3.4.2,each line is a reconstructed f0 contours.75Mandarin with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.5: Visualization of the variability of each tone cluster identified by the methodfor Mandarin for a preset number of tones. Each plotted line corresponds to an f0contour within the identified cluster.While visualizing the variability of an identified tone is useful, it is simplest to considerthe mean of each cluster (i.e. a hypothesized tone) overlaid on a single graph (for a givennumber of clusters). Such visualizations are likely more familiar to researchers who work ontone languages as they provide snapshots of the tone inventory of the language. The assumed76tone inventory of Mandarin, shown in Figure 4.1, is an example of such a visualization.Figures 4.6-4.8 present mean f0 contours from the data presented in Figure 4.5 with meanduration also incorporated. That said, duration does not appear to vary significantly. Thisis likely because it is an average across a large dataset. One possible way to avoid this inthe future could be to normalize syllable-frame duration with respect to speech rate, butthat is left for future refinements of the method.Figure 4.6 presents the results of Mandarin as analyzed with two and three tones. Thelefthandside graph presents the result of the method for Mandarin as analyzed with twoacoustically distinct tones; the tone inventory consists of a high-level tone and a low-leveltone. The righthandside graph presents the results of Mandarin as analyzed with threetones; this tone inventory could be interpreted as comprising three-level tones (high, mid,low), or perhaps a low tone and two diverging high contour tones.Mandarin with:Two tones Three tonesFigure 4.6: Hypothesized tones for Mandarin with an inventory comprising two-tones(left) and three-tones (right).The analysis of Mandarin with four and five tones is presented in Figure 4.7. The resultsof the method for four tones include a high-level, a low-level, a rising and a falling tone.With five tones, there are a high-level, a mid-level, a low-level, a falling, and a rising tone.Impressionistically, these results match well with the Mandarin tones assumed in Figure 4.1.The assumed tones of Mandarin also comprise a high-level (Tone 1), a mid-level (neutral),a low-level (Tone 3), a falling (Tone 4) and a rising tone (Tone 2).77Mandarin with:Four tones Five tonesFigure 4.7: Hypothesized tones for Mandarin with an inventory comprising four-tones(left) and five-tones (right).Moving to six, seven, eight and nine tones, there is a consistent pattern of additionalvariants of level tones being identified (shown in Figure 4.8). With six tones, the mid-leveltone from the five-tone analysis appears to have separated into two mid-level tones whilethe other tones remain largely unchanged. Similarly, with seven tones, the high-level tonefrom the six-tone analysis appears to have separated into two high-level tones. With eighttones, low-mid level tone from the seven-tone analysis appears to have separated into alow-mid falling tone and a low-mid level tone. Finally, with nine tones, the high-mid leveltone from the seven-tone analysis appears to have separated into two variants.78Mandarin with:Six tones Seven tonesEight tones Nine tonesFigure 4.8: Hypothesized tones for Mandarin with an inventory comprising six throughnine tones.4.1.5.3 Cluster EvaluationWhile the tone inventory visualizations above are interesting, a goal of this work is to identifythe correct number of tones for a given language without supervision or ascribing to priorknowledge. Unfortunately, the clustering metrics do not provide a consistent answer. Asmentioned in §3.4.1, identifying the optimal number of clusters is challenging given theunsupervised nature of the task. Still, the application of metrics as heuristics may providesome value to researchers.The evaluation metrics are presented in two parts. In the first, Figure 4.9, the dendro-gram is analyzed in the heuristical method described in §3.4.1 (which is briefly recapitulatedhere).79Figure 4.9: Dendrogram evaluation of Mandarin tone clusterings. By cutting thelongest distance of the dendrogram, the optimal clustering comprises five tones.To identify the correct number of clusters in a dendrogram, a cut is made at the longestdistance needed to join two clusters. The cut is shown by the black line in Figure 4.9. Thedendrogram for Mandarin indicates that the optimal number of clusters is five.The second set of metrics, shown in Figure 4.10, compare within and between clustervariance (as outlined in §3.4.1).CH-Index DB-Index Silhouette IndexFigure 4.10: Variance evaluations of Mandarin tone clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre) indicates two;and the Silhouette Index (right) also indicates two.As previously stated a high-value is optimal for the CH-Index and the Silhouette Index, anda low value is optimal for the DB-Index. Thus, the CH-Index, DB-Index and SilhouetteIndex all indicate that the optimal number of clusters (i.e. tones) for Mandarin is two.Given the assumed inventory for Mandarin in Figure 4.1, the optimal number of clustersfor Mandarin should be five. In this case study then, the dendrogram appears to be the80only evaluation metric to have identified the correct number of tones. That said, if onewere to follow Gussenhoven et al. (2004)’s analysis of Mandarin, which consisted of onlytwo tones (H and L), it appears the variance based evaluation metrics were the ones toidentify the correct number of tones. Regardless, the dendrogram evaluation does notmatch the variance based evaluations, so the picture is not entirely clear here.4.1.6 DiscussionThe goal of this project is to demonstrate that machine learning is a useful analysis toolfor theoretical linguists. To do this, I aimed to support the theory of Emergent Phonologyby demonstrating that lexical tones emerge from the acoustic parameterization of speech inthe language (§1.3). Despite the evaluation metrics not providing a consistent result, thisgoal has largely been achieved because the method did hypothesize a tone inventory thatis consistent with the standard tone analysis of Mandarin. This is shown in Figure 4.11.Figure 4.11 presents a summary of the method for Mandarin. It presents four visualizedtone inventories. The first inventory is the assumed correct inventory for Mandarin, whichwas generated using ground-truth labels for the corpus data. The subsequent three are allhypothesized tone inventories generated by the method. In order from left to right, thefirst hypothesized matches the prescribed number of tones for Mandarin, five. Next, thehypothesized inventory identified as optimal by the variance based metrics is presented.Finally, the hypothesized inventory identified as optimal by the dendrogram is presented.81Assumed Correct (a) Matching Correct (b) Variance Based (c) Dendrogram# of Tones: 5 5 2 5Figure 4.11: A comparison of the standard analysis of Mandarin tones (left) withhypothesized tone inventories (generated by the method). The hypothesized inventoriescontain: (a) the same number of tones as is standardly reported for the language; (b)the optimal number of tones as determined by variance metrics; and (c) the optimalnumber of tones as determined by the dendrogram.This result demonstrates that the correct tone inventory of Mandarin can arise from solelyconsidering the surface phonetics of the language, and thus provides empirical support forthe theory of Emergent Phonology1. That said, the disagreement in the evaluation metricsfor determining the optimal number of tones for Mandarin means the method has notachieved the supplementary goal of this thesis to create a grammaticus ex machina – alinguist (grammarian) from the machine. This may, however, be an understandable failurebecause the method is naive to word-meaning, and human linguists motivate their analysesby phonological contrasts. A larger discussion of this is provided in §5.4.4.2 Case Study II: Cantonese4.2.1 Cantonese TonesCantonese is a Sino-Tibetan language spoken primarily in Hong Kong, Macau, GuangDongand GuangXi. Cantonese encompasses several mutually intelligible dialects that share sim-ilar phonology and morphosyntax (Matthews and Yip, 2013). The standard analysis ofCantonese contains 6 contrastive tones, shown in Table 4.2.1The neutral tone is located in a different location, but that is unsurprising given duration is a betterindicator of the neutral tone in Mandarin than f0 (Surendran, 2007).82Character 詩 史 試 時 市 是Jyutping si1 si2 si3 si4 si5 si6Gloss “poetry” “history” “to try” “time” “market” “to be right”Pitch levels 5-5 3-5 3-3 2-1 1-3 2-2Pitch (IPA)Ă£/Ď£ Ę£ Ă£ Ą£ Ę£ Ă£Description high-level high-rise mid-level low-fall low-rise low-levelTone Label Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6Table 4.2: Cantonese tones, adapted from (Lam et al., 2016)The tone inventory of Cantonese is more complex than that of Mandarin. In particular,Cantonese has tones that contrast both in terms of level (T1-T3-T6; T2-T5) and in terms ofcontour shape (e.g. T1-T2). In addition to the six standard tones of Cantonese, there havebeen another three checked tones that correspond to level tones (T1, T3, T6) that occur insyllables that end with a stop (Yu, 2007). Given the acoustic parameters used in this thesis,checked tones would primarily be marked by shortened duration of the vocalic portion ofa syllable. However, as the syllable-frame used in Cantonese is the entire syllable2, theduration of the vocalic portion of the syllable is not reliable and checked tones are notdistinguished from their non-checked counterparts. There is, although, an additional tonalvariant of Tone 1 that is assumed in this project. The reason for this is that the data(Adrus et al., 2016) used in this case study comprise recordings of speakers of GuangZhouCantonese, and there are two variants of Tone 1 reported in this dialect, a high-level anda high-falling tone (Ou, 2012). Shi (2004) reports that younger generations produce bothvariants in free variation, which means that the high-falling variant will be present in thedata used herein. Henceforth, Tone 1a will refer to the high-level variant of Tone 1, andTone 1b will refer to the high-falling variant.As with Mandarin, in order to practically evaluate the success of the method in Can-tonese, a ‘correct’ tone inventory needs to be assumed. Figure 4.12 presents mean f0 con-2The reason for this is to avoid inaccuracies of individual segment boundaries that occurred during theforced alignment process.83tours for each tone generated by using the ground-truth labels of the corpus except thehigh-falling variant of Tone 1. The high-falling variant of Tone 1 was not annotated in thecorpus data, so it was added manually (an average of raw exemplars manually extractedfrom the corpus) to this visualization. As with the Mandarin f0 contours, these f0 contourshave been normalized for duration.Figure 4.12: Mean f0 contours of the six contrastive Cantonese tones and thehigh-falling variant of Tone 1. These contours were derived from the corpus of this casestudy using the ground-truth tone labels. Note: the high-falling variant of Tone 1 wasadded manually (an average of raw exemplars manually extracted from the corpus)because it was not annotated in the corpus data.4.2.1.1 Cantonese Tone MergersIn tone languages, the contrast between two acoustically-similar tones may be obscureddiachronically (with perceivers relying more on contextual information than the acoustic-realization of the tone itself). When this occurs, it is called a tone merger. The process oftwo tones merging is not discrete and often occurs gradually over successive generations ofnew speakers. In Cantonese, several tone mergers are considered to be underway (Mok andWong, 2010; Mok et al., 2013; Lam, 2018; Ou, 2012; Bauer et al., 2003). For GuangZhou84Cantonese specifically, Ou (2012) identifies mergers for T3-T6 and T4-T6, and also notesperceptual confusion between for T2-T5. While it is unclear what affects these mergers willhave for the method, it is likely that there will be some influence from them on hypothesizedtone inventories.4.2.2 Motivation for inclusionsCantonese was selected as a language for this project for several reasons. First, it is awell known tone language that has been studied extensively by the linguistic community(Matthews and Yip, 2013; Bauer and Benedict, 2011; Yip, 2002; Silverman, 1992). Second,the tone inventory of Cantonese is more complex than that of Mandarin, so it providesa more challenging test of the method’s ability to hypothesize accurate tone inventories.Third, there is an abundance of data available for Cantonese (Adrus et al., 2016; Leungand Law, 2001; Lee et al., 2002). And finally, applying the method to Cantonese provides aunique opportunity to test whether it can provide evidence of the in-progress tone mergersof Cantonese.4.2.3 Corpus DataThe Cantonese speech data used in this thesis are from the IARPA Babel Cantonese Lan-guage Pack (Adrus et al., 2016). It comprises 215 hours of conversational and scriptedspeech. The dataset contains talkers from GuangDong and GuangXi. All audio wasrecorded with a sampling rate of 8kHz. Only the scripted conversations were used in thiswork.4.2.3.1 Data PreprocessingThe IARPA corpus contains audio recordings and transcriptions. There were no time mark-ings in the transcripts. The scripted dataset consists of 16243 utterances, ranging in numberfrom 1 syllables to 29 syllables. In order to extract syllable-frame chunks of audio, the tran-scriptions were Forced Aligned to the audio using the Montreal Forced Aligner (McAuliffeet al., 2017) and spot-checked for accuracy. The acoustic speech sound models of the aligner85were trained simultaneously on the dataset itself. Before performing the forced alignment,the audio was resampled to 16kHz as required by the MFA. Thereafter, the acoustic pa-rameters were estimated and normalized in accordance with §3.2. As previously stated,utterances with f0 estimations that did not match between Praat and REAPER (within10Hz) were discarded. In this corpus, that amounted to ≈ 22% of f0 samples being discarded(as f0 samples were discarded prior to aligning with text, I cannot report distributional in-formation on what tone tokens were discarded). The syllable-frame for the Cantonese datawas each whole syllable itself. Once broken down into syllable-frame audio chunks, thetraining data consisted of 68375 syllable-frames with a tone category distribution of 19905T13, 7732 T2, 11015 T3, 9605 T4, 5061 T5, and 15057 T6. The corpus was then dividedrandomly into 90% training data and 10% testing data. The model was naive to tonecategory labels.4.2.4 Results4.2.4.1 Adversarial Autoencoder PerformanceThe adversarial autoencoder was trained using two NVIDIA GTX 1070s and training tookapproximately 3 hours. The adversarial autoencoder converged after ≈ 150 epochs throughthe dataset. This is seen in the reduction of the root-mean-squared error on the testingdataset shown in Figure 4.13.3The corpus does not annotate T1a and T1b separately.86Figure 4.13: Adversarial autoencoder convergence for the Cantonese corpus data asshown in the reduction of reconstruction error on the test set.A visual assessment of the quality of trained adversarial autoencoder can be achievedthrough comparing the original f0 contour with its corresponding reconstruction. Severalreconstructed examples are shown in Figure 4.14. Like the reconstructions of the adversarialautoencoder trained on Mandarin, there is smoothing seen in the reconstructions.Figure 4.14: Reconstructed Cantonese f0 contours. The top row presents ground-truthexemplars; the bottom row presents corresponding reconstructions.874.2.4.2 Hypothesized Tone InventoriesOnce the adversarial autoencoder converged, each data point (training and testing) data waspassed through the system to generate its corresponding 2-dimensional latent code (§3.3).The latent codes were compiled and then clustered using hierarchical clustering (§3.4).The hypothesized tone inventories of the method for Cantonese are also presented inthree sets. In the first set, Figure 4.15, hypothesized tones are shown side by side with errorbars corresponding to the range of variability seen in the cluster that is visualized as thattone. Clustering was done for preset numbers of clusters ranging from two to eleven. Thisrange was chosen given hindsight of the results of the evaluation metrics for Cantonese.88Cantonese with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesTen tones Eleven tonesFigure 4.15: Hypothesized tones as generated by the method for Cantonese, visualizedas f0 contours. Each pane corresponds to the set of hypothesized tones for a presetnumber of tones. Error bars represent variability around the median f0 values.89In the second set, Figure 4.16, all reconstructions corresponding to a cluster identifiedin the latent space are visualized. Each frame provides a visualization of the variabilityseen in its associated cluster.90Cantonese with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesTen tones Eleven tonesFigure 4.16: Visualization of the variability of each tone cluster identified by themethod for Cantonese for a preset number of tones. Each plotted line corresponds to af0 contour within the identified cluster.91The hypothesized tone inventories for Cantonese are now considered as mean f0 contoursoverlaid on a single graph. Figure 4.17 presents the hypothesized tone inventories with twoand three tones. Like Mandarin, the analysis of Cantonese with two tones consists of ahigh-level and a low-level tone. With three tones, the method hypothesizes a tone inventorycomprising a high-level, a low-level, and a rising tone.Cantonese with:Two tones Three tonesFigure 4.17: Hypothesized tones for Cantonese with an inventory comprising two-tones(left) and three-tones (right).With four tones, the method hypothesizes a tone inventory consisting of a high-level, amid-level, a low-level and a rising tone. With five tones, the inventory maintains four tonesthat are very similar to those of the four tone analysis and adds a falling tone.Cantonese with:Four tones Five tonesFigure 4.18: Hypothesized tones for Cantonese with an inventory comprising four-tones(left) and five-tones (right).92With six tones, the low-level tone from five-tone analysis appears to have separatedinto a low-falling and low-level tone while the other four tones remain largely unchanged.Moving to seven tones, the rising tone from six-tone analysis has separated into two distinctrising contours, a low-rising and a high-rising tone. Impressionistically, the tone inventorywith seven tones appears quite similar to the assumed tones of this dialect shown in Figure4.12. The assumed tones also comprise three level tones, a low-falling tone and two risingtones.Cantonese with:Six tones Seven tonesFigure 4.19: Hypothesized tones for Cantonese with an inventory comprising six-tones(left) and seven-tones (right).With eight tones, the mid-level tone from the seven-tone analysis appears to have sep-arated into a slight falling mid-level tone and a level mid-level tone with the other tonesremaining unchanged. With nine tones, a very low level tone is added to the hypothesizedtone inventory. With ten tones, the low-falling tone from the nine-tone analysis appearsto have separated into two variants of low-falling tones. Finally, with eleven tones, thehigh-tone from the previous ten-tone analysis appears to have separated into two high-leveltones while the other tones remain largely unchanged.93Cantonese with:Eight tones Nine tonesTen tones Eleven tonesFigure 4.20: Hypothesized tones for Cantonese with an inventory comprising eightthrough eleven tones.4.2.4.3 Cluster EvaluationThe series of hypothesized tone inventories for Cantonese presented above provides an in-teresting view of how the Cantonese tone inventory may be structured, but again one of thegoals of this work is to identify the correct number of tones. Unfortunately, as with Man-darin, the metrics do not provide a consistent answer. The dendrogram, shown in Figure4.21, indicates the optimal number of tones for Cantonese is ten.94Figure 4.21: Dendrogram evaluation of Cantonese tone clusterings. By cutting thelongest distance of the dendrogram, the optimal clustering comprises ten tones.Figure 4.22 presents the results of the cluster evaluation metrics that balance withinand between cluster variance. Again, a high value for the CH-Index and Silhouette Indexindicate an optimal clustering, and a low value for the DB-Index indicates an optimalclustering.CH-Index DB-Index Silhouette IndexFigure 4.22: Variance evaluations of Cantonese tone clusterings. The CH-Index (left) in-dicates the optimal number of clusters is nine (although five quite close); the DB-Index(centre) indicates five; and the Silhouette Index (right) also indicates five.For Cantonese, the CH-Index indicates that the optimal clustering would be nine (althoughfive is quite close); the DB-Index indicates that the optimal clustering is five; finally theSilhouette Index indicates the optimal clustering is five.Given the tones assumed for this variety of Cantonese, presented in Figure 4.12, the as-sumed number of tones should be seven. None of the evaluation metrics have indicated sucha clustering. However, there may be sensible explanations for the results of the evaluation95metrics. As there are three checked tones that were not included in the assumed inventoryof Cantonese, if checked tones were added the correct number of tones would be ten, whichmatches the dendrogram result. That said, I am hesitant to accept this given the messinessof the hypothesized tone inventory that comprises ten tones.Additionally, the variance based metrics indicate five is the optimal number of tones,which aligns well what Maddieson (1978) assumes to be the number of pitch levels in alanguage with a complex tonal system (which Cantonese does have). However, given thehypothesized tone inventory that comprises five tones is not composed of five level tones, itis hard to accept this specific justification. There is an alternative however, than assumingthe five tones would need to be pitch levels. If the tones of the five-tone analysis are takenas autosegments, the inventory comprises high (H), mid (M), low(L), rising (R), and falling(F). These five tones could, in combination, eloquently describe the standardly reportedCantonese tone inventory as follows: for T1a, 5-5=H; for T1b, 5-1=HF; for T2, 3-5=MR;for T3, 3-3=M; for T4, 2-1=LF; for T5, 1-3=LR; for T6, 2-2=L.Another interpretation for the identification of five being the optimal number of tonesmay relate to tone mergers. If T2-T5 and T4-T6 are collapsed into single categories, theoptimal number of tones for this dialect of Cantonese would be five. This may be a temptingprospect because language change is often overlooked in order to align one’s results withhistorically established analyses of a language.4.2.5 DiscussionThe results for the Cantonese case study are quite similar to that of the Mandarin one.There is again a clear correspondence between the standardly assumed tone inventory forthe language and the hypothesized tone inventory that comprises the same number of tones,which again provides support for Emergent Phonology. However, there is disagreement be-tween the evaluation metrics. Figure 4.23 presents a summary of the method for Cantonese.It presents four visualized tone inventories. The first inventory is the assumed correct in-ventory for Cantonese, which was generated using ground-truth labels for the corpus data.96The subsequent three are all hypothesized tone inventories generated by the method. Inorder from left to right, the first hypothesized matches the prescribed number of tones forCantonese, seven. Next, the hypothesized inventory identified as optimal by the variancebased metrics is presented. Finally, the hypothesized inventory identified as optimal by thedendrogram is presented.Assumed Correct (a) Matching Correct (b) Variance Based (c) Dendrogram# of Tones: 7 7 5 10Figure 4.23: A comparison of the standard analysis of Cantonese tones (left) withhypothesized tone inventories (generated by the method). The hypothesized inventoriescontain: (a) the same number of tones as is standardly reported for the language; (b)the optimal number of tones as determined by variance metrics; and (c) the optimalnumber of tones as determined by the dendrogram.Again, the evaluation metrics did not consistently identify the optimal number of tonesfor Cantonese. What is more, no metric identified seven as the optimal number of tones.The dendrogram identified ten tones as the optimal number and the variance based metricsidentified five tones as the optimal number.There is, however, a finding in this case study that is worth remarking on – the in-teresting parallel between the on-going tone mergers of Cantonese and the way in whichthe hypothesized tone inventories develop as they get larger. Ou (2012) identifies threepairs of confusable tones in GuangZhou Cantonese, T4-T6, T3-T6 and T2-T5. The T4-T6merger is one between a low-level and a low-falling tone. This observation is paralleled asthe hypothesized tone inventories expand from five to six tones, shown in Figure 4.24.97Cantonese with:Five tones Six tonesFigure 4.24: Hypothesized tones for Cantonese with an inventory comprising five-tones(left) and six-tones (right). The focus here is on the separation of the low-level tone inthe five-tone analysis into a low-level and low-falling tone in the six-tone analysis. Thispattern mirrors the on-going tone merger of T4 and T6 in Cantonese.Additionally, the T2-T5 merger is one between a low-rising and a high-rising tone. Thisobservation is paralleled as the hypothesized tone inventories expand from six to seventones, shown in Figure 4.25.Cantonese with:Six tones Seven tonesFigure 4.25: Hypothesized Tones for Cantonese with an inventory comprising six-tones(left) and seven-tones (right). The focus here is the separation of the low-rising tone inthe six-tone analysis into a low-rising and high-rising tone in the seven-tone analysis.This pattern mirrors the on-going tone merger of T2 and T5 in Cantonese.984.3 Case Study III: FungwaFungwa is a Benue-Congo language spoken in Nigeria. It is spoken by approximately 1000speakers along the Pandogari-Alawa road, in Rafi Local Government Area (LGA), Nigerstate. The language is, as yet, largely unstudied except for the recent work of my colleagueSamuel Akinbo (Akinbo, 2018, 2019). Akinbo’s preliminary analysis is that the languagehas two contrastive tones, a high and a low tone4. The Fungwa data were recorded byAkinbo firsthand during a series of fieldwork trips between 2015 and 2018.Keeping in line with Akinbo’s initial analyses, the assumed ‘correct’ tone system ofFungwa comprises two tones, high and low. This inventory is presented in Figure 4.26,generated by estimating the f0 for two tokens of a single male talker of Fungwa. This wasdone because ground-truth tone labels were not incorporated into the Forced-Alignmentprocess (due to data sparsity) and as such were not available for use in generating averagedf0 contours.Figure 4.26: F0 contours of the two tones in Fungwa. These contours are exemplars ofhigh and low tones taken from a single male speaker in the corpus of this case study.4Akinbo notes that a falling tone can occur on a word-final syllable, but it is not a contrastive tone.994.3.1 Motivation for inclusionFungwa was chosen because it is a largely unanalyzed language and, as such, it provides aunique opportunity to consider the practical usefulness of the method for a linguist doingfieldwork.4.3.2 Corpus DataThe data were recorded from 42 speakers (19 females and 23 males) over a period of 10months across 3 years (Akinbo, 2018). The data were recorded using a Rode NGT2 super-cardioid condenser shotgun microphone with a sampling rate 48kHz. The recordings consistof both elicited (i.e. wordlist) and spontaneous speech.4.3.2.1 Data preprocessingMuch like the data in Cantonese case study, the Fungwa data comprises audio recordingsand corresponding transcriptions. There were no time markings in the transcriptions soForced Alignment was used to demarcate speech sounds. This was again done using theMontreal Forced Aligner (McAuliffe et al., 2017) with the acoustic speech sound modelsof the aligner being trained simultaneously on the dataset itself. Alignment was spot-checked by the author. Thereafter, the acoustic parameters were estimated and normalizedin accordance with §3.2. As previously stated, utterances with f0 estimations that did notmatch between Praat and REAPER (within 10Hz) were discarded. In this corpus, thatamounted to ≈ 8% of f0 samples being discarded (aligned text did not have tone labelsin this corpus, so I cannot report distributional information on what tone tokens werediscarded). The syllable-frame for the Fungwa data was the syllable-nucleus. Once brokendown into syllable-frame audio chunks, the training data consisted of 1398 syllable-frames.The corpus was then divided randomly into 90% training data and 10% testing data.1004.3.3 Results4.3.3.1 Adversarial Autoencoder PerformanceThe adversarial autoencoder was trained using two NVIDIA GTX 1070s and training tookapproximately 20 minutes. The adversarial autoencoder converged after ≈ 90 epochsthrough the dataset, shown in the reduction of root-mean-squared-error of the testing data.This is shown in Figure 4.27.Figure 4.27: Adversarial autoencoder convergence for the Fungwa corpus data asshown in the reduction of reconstruction error on the test set.A visual assessment of the adversarial autoencoder is also possible by comparing ground-truth f0 contours with those reconstructed from the autoencoder. These are shown in Figure4.28. Like the reconstructions of Cantonese and Mandarin, there is notable smoothing.101Figure 4.28: Reconstructed Fungwa f0 contours. The top row presents ground-truthexemplars; the bottom row presents corresponding reconstructions.The hypothesized tone inventories are again presented in three sets. In the first set,Figure 4.29, hypothesized tones are shown side by side with error bars corresponding tothe range of variability seen in the cluster that is visualized as that tone. Clustering wasdone for preset numbers of clusters ranging from two to nine. This range was chosen givenhindsight of the results of the evaluation metrics.102Fungwa with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.29: Hypothesized tones as generated by the method for Fungwa, visualized asf0 contours. Each pane corresponds to the set of hypothesized tones for a presetnumber of tones. Error bars represent variability around the median f0 values.In the second set, Figure 4.30, all reconstructions corresponding to a cluster identifiedin the latent space are visualized. Each frame provides a visualization of the variabilityseen in its associated cluster.103Fungwa with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.30: Visualization of the variability of each tone cluster identified by themethod for Fungwa for a preset number of tones. Each line corresponds to a f0 contourwithin an identified cluster.The hypothesized tone inventories for Fungwa are now presented as mean f0 contoursoverlaid on a single graph. With two hypothesized tones, the analysis for Fungwa looksquite similar to the previous case studies, an inventory comprising two level tones, a highand a low level tone. With three tones, the inventory comprises three level tones.104Fungwa with:Two tones Three tonesFigure 4.31: Hypothesized tones for Fungwa with an inventory comprising two-tones(left) and three-tones (right).The next set of clusterings of Fungwa (for four, five and six tones) presents an interestingpattern. With four tones hypothesized, the low-level tone from the three-tone analysisappears to have separated into a low-level and a low-falling tone, but the high-level andmid-level tones remain visually unchanged. Next, with five tones hypothesized, the mid-level tone from the four-tone analysis appears to separate in a similar fashion as the low-leveltone did previously, into a mid-level and a mid-falling tone. Finally, the pattern repeatsa third time with six hypothesized tones whereby the high-level tone from the five-toneanalysis appears to have separated into two tones, a high-level and a high-rising tone. Thissequence (in which three level tones appear to separate in a similar fashion) is only observedin the Fungwa case study.105Fungwa with:Four tones Five tonesSix tonesFigure 4.32: Hypothesized tones for Fungwa with an inventory comprising four throughsix tones.With seven tones, the method hypothesizes a low-level tone, a mid-low level, a low-falling, a mid-level, a mid-falling, a high-level and a high-rising tone. With eight tones, themid-level of the seven-tone analysis appears to have separated into two mid-level variantswith the other tones remaining largely unchanged. Finally, with nine tones, the low-fallingtone from the seven- and eight-tone analysis appears to separate into two variants of alow-falling tone.106Fungwa with:Seven tones Eight tonesNine tones Nine tonesFigure 4.33: Hypothesized tones for Fungwa with an inventory comprising seventhrough nine tones.4.3.3.2 Cluster EvaluationAs with the other case studies, the series of hypothesized tone inventories for Fungwa mayprovide useful information, but one of the goals of this work is to identify the correct numberof tones. This goal is of more import in the current case study because these data comefrom a fieldworker who has been sorting out an analysis. Again however, the metrics do notprovide a consistent answer for the optimal number of tones for Fungwa. The dendrogram,shown in Figure 4.34, indicates the optimal number of tones is four.107Figure 4.34: Dendrogram evaluation of Fungwa tone clusterings. By cutting the longestdistance of the dendrogram, the optimal clustering comprises four tones.Figure 4.35 presents the results of the cluster evaluation metrics that balance withinand between cluster variance. In this case study, as with the Mandarin case study, all threevariance based metrics are in agreement that the optimal number of tones for Fungwa istwo.CH-Index DB-Index Silhouette IndexFigure 4.35: Variance evaluations of Fungwa tone clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre) indicates two;and the Silhouette Index (right) also indicates two.Given the analysis of Akinbo, the assumed number of tones in Fungwa is two. Thismatches with the result of all three variance based metrics; however, it does not match withthe results of the dendrogram, which indicated the optimal number of clusters is four. In108light of the mismatches seen in the evaluation metrics of the previous case studies, it isdifficult to interpret these results.4.3.4 DiscussionThe results of this case study mirror those of the previous case studies. In particular, a toneinventory (comprising two tones) for Fungwa that matches with the analysis put forth byAkinbo (2018) was achieved. However, the evaluation metrics did not consistently identifythat inventory. Figure 4.36 presents a summary of the method’s analyses for Fungwa.It presents four visualized tone inventories. The first inventory is the assumed correctinventory for Fungwa, which was generated using ground-truth labels for the corpus data.The subsequent three are all hypothesized tone inventories generated by the method. Inorder from left to right, the first hypothesized matches the prescribed number of tonesfor Fungwa, two. Next, the hypothesized inventory identified as optimal by the variancebased metrics is presented. Finally, the hypothesized inventory identified as optimal by thedendrogram is presented.Assumed Correct (a) Matching Correct (b) Variance Based (c) Dendrogram# of Tones: 2 2 2 4Figure 4.36: A comparison of the standard analysis of Fungwa tones (left) withhypothesized tone inventories (generated by the method). The hypothesized inventoriescontain: (a) the same number of tones as is standardly reported for the language; (b)the optimal number of tones as determined by variance metrics; and (c) the optimalnumber of tones as determined by the dendrogram.As there is little previous research on Fungwa, it is challenging to contextualize theseresults. When more information becomes available, it may be fruitful to reconsider the toneinventories hypothesized by the method for Fungwa. One possible investigation could be to109consider the phonological process of down-step, the successive lowering of high-tone pitchtargets (see Chapter 2 in Pulleyblank, 1986).4.4 Case Study IV: EnglishEnglish is an Indo-European language spoken widely across the world both natively andas a second language. As English is not a tone language, this case study is fundamentallydifferent from the previous three. Nonetheless, English speakers do make extensive use ofpitch in both intonation and stress (Silverman et al., 1992; Gussenhoven and Wright, 2015;Gussenhoven, 2008), so it is unclear what the method will hypothesize when it is applied toEnglish language data. Several outcomes are imaginable: (1) the method may hypothesizea tone inventory comprising two tones (high and low), which would match well with theacoustic cues of stress in English; (2) the method may hypothesize multiple level tones,which may match with how pitch varies throughout the intonation of a sentence; or (3) themethod may hypothesize tones that are largely incomprehensible. As English is not a tonelanguage, there is no assumed ‘correct’ ground-truth inventory.4.4.1 Motivation for InclusionEnglish was selected as a loose control for the method. While it is not the case that pitchhas no function in English, it is the case that its function is not to lexically or grammaticallydistinguish words. Thus, this case study is motivated by a desire to see how the methodhypothesizes tones for a language that does not have lexical or grammatical tone. Ideally,the results for English will be markedly different in some way from the previous three casestudies.4.4.2 Corpus DataThere are several time-aligned corpora of English data available to researchers. Takingadvantage of this fact, in this case study the method is applied to two English languagecorpora: the Buckeye Speech Corpus (Pitt et al., 2005) and the TIMIT speech corpus(Garofolo, 1993). The Buckeye Speech Corpus contains spontaneous speech that was elicited110in an interview. The corpus contains approximately 300 000 words from 40 speakers ofEnglish from Columbus, Ohio. The TIMIT corpus contains read speech of 630 speakers, eachspeaking 10 sentences. The sentences were constructed to be phonetically-compact, meaningthere was good coverage of all phonemes of English. As the two corpora contain differentkinds of speech (spontaneous versus read), analyzing both provides a unique opportunity toinvestigate method’s robustness and consistency (i.e.to see whether it returns similar resultsfor variable types of speech).4.4.2.1 Data PreprocessingAs both corpora used in this case study are time-aligned, the only preprocessing needed wasto extract and normalize the acoustic parameters as discussed in §3. As stated, f0 estimatesthat were mismatched (beyond 10Hz) between Praat and REAPER were discarded. ForTIMIT, this resulted in ≈ 16% of f0 samples being discarded. The syllable-frame forTIMIT was the syllable-nuclei and the total training set size was 16718 syllable-frames. ForBuckeye, ≈ 28% of f0 samples were discarded. The syllable-frame for Buckeye was alsosyllable-nuclei and the total training set size was 47297 syllable-frames. The results of themethod for TIMIT are presented first, followed by the results for Buckeye. Discussion isleft until after hypothesized tone inventories for both corpora have been presented.4.4.3 TIMIT Results4.4.3.1 Adversarial Autoencoder PerformanceThe adversarial autoencoder was trained using two NVIDIA GTX 1070s and training tookapproximately 1.5 hours. The adversarial autoencoder converged after 120 epochs throughthe dataset. This is evident in the reduction of the root-mean-squared-error of the recon-structions on the test data, shown in Figure 4.37111Figure 4.37: Adversarial autoencoder convergence for the TIMIT corpus (English) data asshown in the reduction of reconstruction error on the test set.A visual assessment of the model is also possible by comparing ground-truth f0 contourswith those reconstructed from the autoencoder. These are shown in Figure 4.38.Figure 4.38: Reconstructed English (TIMIT) f0 contours. The top row presents ground-truth exemplars; the bottom row presents corresponding reconstructions.4.4.3.2 Hypothesized Tone InventoriesAs with previous case studies, once the adversarial autoencoder converged each data point(training and testing) data was passed through the system to generate its corresponding1122-dimensional latent code. The latent codes were compiled and then clustered using hier-archical clustering.The hypothesized tone inventories are again presented in three sets. In the first set,shown in Figure 4.39, hypothesized tones are shown side by side with error bars corre-sponding to the range of variability seen in the cluster that is visualized as that tone.Clustering was done for preset numbers of clusters ranging from two to nine. This rangewas chosen given hindsight of the results of the evaluation metrics.113English (TIMIT) with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.39: Hypothesized tones as generated by the method for English (TIMIT), visualizedas f0 contours. Each pane corresponds to the set of hypothesized tones for a preset numberof tones. Error bars represent variability around the median f0 values.In the second set, shown in Figure 4.40, all reconstructions corresponding to a clusteridentified in the latent space are visualized. Each frame provides a visualization of thevariability seen in its associated cluster.114English (TIMIT) with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.40: Visualization of the variability of each tone cluster identified by the methodfor English (TIMIT) for a preset number of tones. Each line corresponds to a f0 contourwithin an identified cluster.The hypothesized tone inventories are now considered using single graphs with overlaidf0 contour means. With the two and three tone analyses for English in the TIMIT corpus,the method hypothesizes tone inventories that comprise level tones.115English (TIMIT) with:Two tones Three tonesFigure 4.41: Hypothesized tones for English (TIMIT) with an inventory comprising two-tones (left) and three-tones (right).With four tones, the method hypothesizes four level tones, a high, a mid-high, a mid-lowand a low level tone. With five tones, the hypothesized inventory comprises four level tonesand a fifth high-falling tone.English (TIMIT) with:Four tones Five tonesFigure 4.42: Hypothesized tones for English (TIMIT) with an inventory comprising four-tones (left) and five-tones (right).When analyzed with six tones, the method hypothesizes an inventory that comprises fivelevel tones and a high-falling tone similar to that in the five-tone analysis. With seven tones,the inventory comprises five level tones, a falling tone and a rising tone. With eight tones,the mid-high level tone of the seven-tone analysis appears to separate into two variants withthe other six tones remaining largely unchanged. Finally, with nine tones the previously116identified falling tone appears to separate into two variants that contrast in steepness/slope.English (TIMIT) with:Six tones Seven tonesEight tones Nine tonesFigure 4.43: Hypothesized tones for English (TIMIT) with an inventory comprising sixthrough nine tones.4.4.3.3 Cluster EvaluationThe cluster evaluation metrics for the TIMIT English data again do not provided a con-sistent result for the optimal number of clusters. Figure 4.44 presents the dendrogramevaluation, which indicate that optimal number of clusters for the TIMIT data is five.117Figure 4.44: Dendrogram evaluation of English (TIMIT) clusterings. By cutting thelongest distance of the dendrogram, the optimal number of clusters is five.The second set of metrics, shown in Figure 4.45, compare within and across clustervariance. In this case study, as with the Mandarin and Fungwa case studies, all threevariance based metrics are in agreement that the optimal number of tones for the TIMITdata is two.CH-Index DB-Index Silhouette IndexFigure 4.45: Variance evaluations of English (TIMIT) clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre) indicates two; andthe Silhouette Index (right) also indicates two.1184.4.4 Buckeye Results4.4.4.1 Adversarial Autoencoder PerformanceThe adversarial autoencoder was trained using two NVIDIA GTX 1070s and trainingtook approximately 3.5 hours. The adversarial autoencoder converged after ≈ 200 epochsthrough the Buckeye corpus data. This is seen in the reduction of the root-mean-squared-error of the reconstructions shown in Figure 4.46.Figure 4.46: Adversarial autoencoder convergence for the Buckeye corpus (English) data asshown in the reduction of reconstruction error on the test set.A visual assessment of the model is possible by comparing ground-truth f0 contourswith those reconstructed from the autoencoder. These are shown in Figure 4.47.119Figure 4.47: Reconstructed English (Buckeye) f0 contours. The top row presents ground-truth exemplars; the bottom row presents corresponding reconstructions.4.4.4.2 Hypothesized Tone InventoriesOnce the adversarial autoencoder converged, each data point (training and testing) datawas passed through the system to generate its corresponding 2-dimensional latent code.The latent codes were compiled and then clustered using hierarchical clustering.The hypothesized tone inventories are again presented in three sets. In the first set,Figure 4.48, hypothesized tones are shown side by side with error bars corresponding to therange of variability seen in the cluster that is visualized as that cluster. Clusterings weredone for preset numbers of clusters ranging from two to nine (as with the TIMIT data).120English with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.48: Hypothesized tones as generated by the method for English (Buckeye), visu-alized as f0 contours. Each pane corresponds to the set of hypothesized tones for a presetnumber of tones. Error bars represent variability around the median f0 values.In the second set, Figure 4.49, all reconstructions corresponding to a cluster identifiedin the latent space are visualized. Each frame provides a visualization of the variabilityseen in its associated cluster.121English with:Two tones Three tonesFour tones Five tonesSix tones Seven tonesEight tones Nine tonesFigure 4.49: Visualization of the variability of each tone cluster identified by the methodfor English (Buckeye) for a preset number of tones. Each line corresponds to a f0 contourwithin an identified cluster.The hypothesized tones are now considered as overlaid f0 means on a single graph. Withtwo tones hypothesized for the Buckeye data, The method hypothesized a high-level andlow-level tone. With three tones, the inventory comprises three level tones.122English (Buckeye) with:Two tones Three tonesFigure 4.50: Hypothesized tones for English (Buckeye) with an inventory comprising two-tones (left) and three-tones (right).The tone inventory hypothesized for the Buckeye data with four tones comprises fourlevel tones. It is worth noting that this is the same result as that of the TIMIT data. Withfive tones, the method hypothesizes a high-falling tone in additional to four level tones.This is again consistent with the fifth tone hypothesized for the TIMIT data, although theshape of the high-falling tone contour is different. In fact, this high-falling f0 contour is theleast regular of all tones identified by the method in any case study.English (Buckeye) with:Four tones Five tonesFigure 4.51: Hypothesized tones for English (Buckeye) with an inventory comprising four-tones (left) and five-tones (right).As the number of hypothesized tones increase to six and seven, the number of leveltones also increases in parallel. With eight tones, an additional rising tone is hypothesized.123Finally, with nine tones, the previously second highest level tone appears to have separatedinto two variants with the other tones remaining largely unchanged.English (Buckeye) with:Six tones Seven tonesEight tones Nine tonesFigure 4.52: Hypothesized tones for English (Buckeye) with an inventory comprising sixthrough nine tones.4.4.4.3 Cluster EvaluationWith the Buckeye data, the dendrogram evaluation, presented in Figure 4.53, indicates thatthe optimal number of clusters is four.124Figure 4.53: Dendrogram evaluation of English (Buckeye) clusterings. By cutting thelongest distance of the dendrogram, the optimal number of clusters is four.The metrics that balance within and between cluster variance are shown in Figure 4.54.They are all in agreement that the optimal number of clusters for the Buckeye data is two.CH-Index DB-Index Silhouette IndexFigure 4.54: Variance evaluations of English (Buckeye) clusterings. The CH-Index (left)indicates the optimal number of clusters is two; the DB-Index (centre) indicates two; andthe Silhouette Index (right) also indicates two.4.4.5 DiscussionThere are a few notable aspects of the results for English from both the TIMIT data andthe Buckeye data. One of the more critical findings is that the results from both datasetsare fairly consistent with each other. This is particularly true for clusterings up to six125tones, in which the method initially hypothesizes four level tones, followed by a high-falling tone, and finally an additional high level tone. What is more, the evaluation metricsbetween the two datasets are quite similar. In both TIMIT and Buckeye, the variance basedmetrics indicate that the optimal number of clusters for English is two. The dendrogramresults vary slightly, but not drastically, with the TIMIT dendrogram indicating the optimalnumber of clusters is five and the Buckeye dendrogram indicating four. For consideration,the two-cluster hypothesized tone inventories from both datasets and the five- and four-cluster hypothesized tone inventories from TIMIT and Buckeye, respectively, are presentedin Figure 4.55.English with:Two tones (TIMIT) Two tones (Buckeye)Five tones (TIMIT) Four tones (Buckeye)Figure 4.55: Hypothesized tones for English given the optimal clusterings identified by theevaluation metrics.This comparison suggests that the method is at least somewhat robust to speech style.Further, following up on the possible results hypothesized at the beginning of this case126study, it seems the method is identifying a tone inventory of English that may be consis-tent with the stressed/unstressed patterning of English words. Finally, it is worth notingthat the hypothesized tone inventories of English are the only ones in which the first fourhypothesized tones are level tones. This result seems to set English apart from the otherthree languages.4.5 Cross-Language ComparisonBefore transitioning to the general conclusion of Chapter 5, it is worthwhile to brieflycompare the case studies. To that effect, Table 4.3 presents the final results (i.e. thedetermined optimal number of tones) of the method for each case study. It also includesthe standardly reported number of tones for each language for comparison.Language: Mandarin Cantonese Fungwa EnglishStandard Tone Inventory Size: 5 7 2 N/AHypothesized Size (dendrogram): 5 10 4 5/4Hypothesized Size (CH-Index): 2 9/5 2 2Hypothesized Size (DB-Index): 2 5 2 2Hypothesized Size (Silhouette): 2 5 2 2Table 4.3: A comparison of the optimal number of tones for a language as determinedby the method with that standardly reported in the literature.While it is immediately clear that the evaluation metrics did not consistently identify the‘correct’ number of tones for the languages, there is still one clear pattern. Specifically, themetrics indicate that Cantonese is the language with the most complex tone inventory. Thisis seen by the fact that the values in the Cantonese column are consistently higher than theothers. This is a nice indication that the evaluation metrics are at least on the right tracktowards helping achieve human-like language analyses.A final observation worth remarking on is the similarities of hypothesized tone invento-ries across all languages. For all languages, the two tone analyses match a low/high tone127system. This may be a significant result given the most common tone inventory typolog-ically comprises a high and a low tone (Maddieson, 1978). As tone inventories get larger,there is often the addition of one or more level tones. Studying how inventories changefrom fewer to more tones with other language data or dummy data may be an interestinginvestigation in the future.128Chapter 5General Discussion, FutureDirections and ConclusionThis chapter provides a summary of the thesis, discusses possible next steps for the researchprogram originated in this thesis, and then provides concluding remarks. The summaryreiterates the original goals of this work and considers how well they were achieved. Thenext steps are separated into three sections, each with a specific aim. The first section(§5.2) investigates a pair of presently available uses of the method for phonological research;specifically it considers the method’s sensitivity to allotones and its ability to contrastindividual speaker’s tone inventories. The second section (§5.3) discusses several areas ofon-going phonological research in which the method may be a useful tool. The third section(§5.4) outlines a list of refinements that should be implemented to improve the utility ofthe method in the future. The chapter ends with a brief conclusion.5.1 Summary of the project and resultsThis thesis has argued that machine learning is a valuable analysis tool that should be usedby linguists to address theoretical questions. By demonstrating that unsupervised machinelearning is able to posit tone inventories from an acoustic parameterizations of speech, thecentral tenet of Emergent Phonology, that phonology emerges from phonetics, was sup-129ported. The method that was used to generate the tone inventories comprised three stages:preprocessing raw speech into the acoustic parameters of f0 and syllable-frame duration;using an adversarial autoencoder to reduce the dimensionality of the acoustic parameterswhile simultaneously extracting higher-level features (in the sense of levels of abstraction);and clustering the higher-level features using hierarchical clustering. The centroids of theclusters were then reconstructed, via the autoencoder, back into the acoustic space and sub-sequently visualized. The visualized tones were then compared to the standardly reportedtones for three languages: Mandarin, Cantonese and Fungwa. The method also attemptedto determine the optimal number of clusters for a given dataset using several clusteringmetrics and a dendrogram of the dataset.Although the method did not consistently identify an optimal number of clusters, it wasshown that the hypothesized tone inventories (of the same size as is standardly assumedfor a language) matched fairly well with the standardly reported analyses of each language.This result was repeated in all three tone languages of the case studies. Given this, themethod has shown that a language’s tone inventory (part of its phonology) can emergesolely from its phonetics. This can be taken as a strong piece of evidence for the theoryof Emergent Phonology and is a clear demonstration of machine learning’s value as a toolto address theoretical questions in linguistics. Additionally, the application of the methodto the understudied language of Fungwa suggests the method may, after refinements, havefuture use for fieldworkers.The supplementary goals of this thesis were to strengthen the impact of computationallinguistics on classical areas of linguistics (phonetics and phonology specifically), and providea first step towards a grammaticus ex machina – a linguist (grammarian) from the machine.The former goal has tacitly been achieved by the existence of this thesis, which has appliedmachine learning techniques to research of the phonetics-phonology interface. The lattergoal may have been satisfactorily achieved as a first step, but there is still much to do. Inparticular, the method did not autonomously determine the optimal number of tones fora given language, and this is a crucial component of any linguist’s analysis. However, the130method was naive to the meaning of words, and contrasts of meaning are a fundamentalpiece of evidence used by linguists when determining the number of contrastive units in alanguage. A discussion on how meaning could be incorporated into the method is providedin §5.4. I now consider possible next steps for the method.5.2 Presently available uses of the methodThe computational nature of the method developed in this thesis allows it to be repurposedrelatively straightforwardly for related research questions. Two examples of such repurpos-ings are considered here. First, an investigation of allotones in Mandarin is presented; thisis followed by a comparison of individual speaker’s tone inventories in Cantonese.5.2.1 Allotones in MandarinIn the Mandarin case study (§4.1.1), the falling-rising tone of Mandarin (2-1-4) was as-sumed to be realized phonetically as a low-falling tone (2-1). This assumption was madebecause the corpus data used was continuous speech, and in non-utterance final position thefalling-rising tone surfaces as a low-falling one (Xu, 1997). These two variants, as they arenot lexically contrastive and occur in complementary distribution (utterance-final versusnon-utterance final), are known as allophones. Allophones are “contextual variants” of aphoneme that are “non-contrastive” (pp. 139 De Lacy, 2007). To provide an additionalexample, consider the two allophones of the phoneme /l/ for North American English. Insyllable-initial position, /l/ surfaces phonetically as [l] (such as the word lit); in syllable-final position however, the back of the tongue is raised and /l/ surfaces as the velarized[ë] (such as the word till) (Giles and Moll, 1975). When allophones are variations of aphonological tone, they are often referred to as allotones (e.g. pp. 120 Yip, 2002). Thus,the 2-1-4 variant and the 2-1 variant of the falling-rising tone in Mandarin are allotones1.Given the existence of allotones, a sensible question to ask is whether the method iden-tifies them. As the method already hypothesizes tone inventories of varying sizes (i.e. it is1The neutral tone in Mandarin may also be analyzed as an allotone given its dependence on precedingtones (Wang, 2004).131able to hypothesize tones well beyond the standard number reported in a language), thisquestion becomes one of discerning whether the hypothesized tones that are not part ofthe standard tone inventory of a language are, in fact, allophonic variants. Ideally, theclustering evaluation metrics could be used to answer this question (i.e. clusterings thatmatch the language data well are more likely to be ones that capture allophonic variation);unfortunately, as the metrics were somewhat puzzling in the case studies, an alternative isneeded.One option to determine whether the method identifies allotones is to use the ground-truth labels2 from the dataset itself for evaluation. This option is only possible for corporathat have ground-truth labels available, but it is worth exploring for our current purpose.Figure 5.1 and 5.2 present a series of hypothesized tone inventories of Mandarin with thedistribution of ground-truth tones below them. These hypothesized tone inventories arethe same as the ones reported in the case study of §4.1.1. The corresponding distributionunder each hypothesized tone represents the percentage of ground-truth tones (T1, T2, T3,T4, T0 (neutral)) that occur within the cluster that that tone corresponds to. As before,the first tone of the six-tone analysis of Figure 5.1 can be interpreted as a hypothesizedhigh-level tone. Recall that that tone was reconstructed using a cluster identified in thelatent space, and as such the ground-truth labels for the points of that cluster are available.For the high-level tone then, the corresponding distribution (underneath it) is showingthat approximately 50% of all ground-truth Tone 1s in the corpus are part of that cluster.Graphs containing the distribution of ground-truth tone labels for all clustering analysesfor Mandarin are provided in Appendix A.2However, as the labels do not distinguish allophonic variants based on their phonological context, thelabels may not be as informative as intended.132Mandarin ground-truth distribution with:Six tonesSeven tonesFigure 5.1: Hypothesized tones for Mandarin (six and seven tones) and the proportionof ground-truth tone labels that occur within the cluster corresponding to that tone.The green squares highlight tones that have a significant portion of ground-truth Tone3s (>18%).In an effort to identify the allotones of the falling-rising tone in Mandarin (Tone 3), thehypothesized tones that contain the large proportions (>18%) of ground-truth Tone 3s arehighlighted with green squares.133Mandarin ground-truth distribution with:Eight tonesNine tonesFigure 5.2: Hypothesized tones for Mandarin (eight and nine tones) and the proportionof ground-truth tone labels that occur within the cluster corresponding to that tone.The green squares highlight tones that have a significant portion of ground-truth Tone3s (>18%).The hypothesized tone inventories comprising six and seven tones do not appear to providemuch indication of the known allotones of Tone 3 in Mandarin (i.e. there is no risingvariant). The eight and nine tone analyses however seem to provide some indication. Inparticular, they both contain four hypothesized tones with a significant portion ground-truth Tone 3s. The variants comprise a low-level tone, a mid-level tone, a low-falling tone,134and a low-rising tone. The low-rising f0 contour here is beginning to show similarities to thefalling-rising tone of utterance final Tone 3s in Mandarin. While this informal investigationinto allophones leaves much room for improvement, it suggests there may be some merit inusing the method to study allotones.5.2.2 Speaker differences in CantoneseA second potential use of the method in its current formulation is the visualization ofindividual speaker’s tone inventories. These visualizations can be generated (once the ad-versarial autoencoder has been trained) by clustering data of a single speaker. The processdescribed in §3.3 and §3.4.2 is followed as normal, whereby data is compressed, clusteredand reconstructed. The result is a hypothesized tone inventory for a single speaker. Foursuch inventories are presented in Figure 5.3. These inventories were generated for twofemale speakers and two male speakers from the Cantonese corpus used in §4.2 (Adruset al., 2016); the choice of speakers was largely random. Given GuangZhou Cantonese con-tains seven contrastive tones, only the inventories comprising seven tones are visualized.Nonetheless, as the method is able to hypothesize inventories of all sizes, comparing indi-vidual speaker’s tone inventories of varying sizes is also possible. Doing so may well provideinsight in, for example, the classification of speakers with respect to tone mergers (i.e. howmuch of a merger has taken place for a given speaker). The value of these visualizations isin the fact that they were constructed without requiring labeled tone data. It is true thatone could simply plot the ‘average’ of tones for a given speaker if all labels were known, butthe unsupervised nature of the method means they are not necessary.135Female Speaker 12631 Female Speaker 76944Male Speaker 76733 Male Speaker 40123Figure 5.3: Hypothesized tone inventories for four speakers of GuangZhou Cantonese.Each inventory comprises seven tones because GuangZhou Cantonese contains seventones.Although a detailed analysis is not provided here, there are a few observation fromFigure 5.3 that warrant comment. First, there appears to be a difference in how the tworising tones of Cantonese are realized between speakers. The rising tones for the malespeakers seem to originate at a similar f0-location and then separate. For the female speak-ers however, the two tones appear to be separated throughout their realizations. Second,the phonetic spaces of Female Speaker 76944 and Male Speaker 40123 appear to be morecompact (smaller frequency range) than those of the other two speakers. While there isno additional information describing the speakers in the corpus, one could imagine suchcompactness being correlated with other qualities of a speaker, such as intelligibility (cf.McCloy et al., 2015). Lastly, these figures may provide some insight into how tone invento-ries are structured within a speaker’s phonology. The impression is completely anecdotal,136but there appears to be structure in where a speaker’s tones begin and end in their f0-range.For example, if we consider Female Speaker 76944, shown in Figure 5.4, there seem to befour f0-onset and four f0-offset positions.Female Speaker 76944Figure 5.4: An observation of the structure seen in the f0-location that a tone beginsand ends at.Although the three observations just discussed are impressionistic, they provide a briefdemonstration of how the method may be used for individual speaker comparisons. What ismore, observations like these could be used to generate new research question. As suggested,one could investigate how tone inventory compactness interfaces with intelligibility giventhat pitch-range has been shown to positively correlate with intelligibility (McCloy et al.,2015).Finally, there is one comment with regards to the individual speaker comparisons thatI would like to make – the fact that these comparisons do not need to be restricted to indi-vidual speakers at all. Tone inventories could be hypothesized for geographically adjacentspeech communities or to speech recorded across generations. One could also hypothesizeinventories for a single speaker at different points in their life, essntially creating snapshotsof the longitudinal development of their speech. Such data may already be available, suchas the Origins of New Zealand dataset (Gordon et al., 2007) that has tracked speech in NewZealand for more than 100 years.1375.3 Future applications of the method in phonologicalresearchWhile the above section considered potential uses of the method that are presently available,this section considers how the method may be updated and used in the future. Specifically,it discusses three areas of on-going phonological research in which the method may bevaluable: (1) providing additional support for Emergent Phonology by demonstrating theemergence of phonological patterns; (2) making predictions in the acquisition of phonology;and (3) classifying languages (i.e. language typology).5.3.1 The emergence of phonological patternsIn §2.1.1.4, it was stated that this thesis considered Emergent Phonology reduced to itsmost rudimentary – a demonstration of the emergence of a single class of phonologicalunits (i.e. tones). Phonology, however, encompasses much more than just units. In fact,the vast majority of phonological research focuses on the patterns/processes that the unitstake part in. The reason for the simplification in this thesis comes from the fact that“phonological systems do not occur in isolation”(pp. 1 Archangeli and Pulleyblank, 2017)and, to investigate patterns, researchers must consider “the interface with phonetics andwith morphology” (pp. 1 Archangeli and Pulleyblank, 2017). I chose to exclude morphologyfrom the current investigation to simplify my problem space given the interdisciplinarynature of this project.Morphology is the study of morphemes, which Haspelmath and Sims (2013) define as“[t]he smallest meaningful constituents of words that can be identified” (pp. 3 Haspelmathand Sims, 2013). For example, the English words cat and dog denote entities that are,in fact, a cat or a dog. Further, the plural morpheme /-s/ denotes the concept of ‘morethan one.’ Thus, cat-s and dog-s denote a group of more than one entities that are, infact, cats or dogs. The plural morpheme of English also provides a useful demonstrationof how phonology interfaces with morphology. As noted in §1.2, the phonetic realization ofthe /-s/ morpheme in cat-s is actually different than that of dog-s. The plural morpheme138of the former is a voiceless [s] and the plural morpheme of the latter is a voiced [z]. In fact,the plural morpheme /-s/ in English comprises three phonologically distinct allomorphs(variants of a morpheme): [s], [z] and [Iz] (as in walrus-es). Crucially, the phonologicalpatterning seen in the [s], [z], [Iz] variants is only meaningful because of its association withplurality. If meaning is incorporated into the method then, demonstrating the emergenceof phonological patterns may be achievable. Additionally, as was previously alluded to, itmay be the case that incorporating meaning provides the missing puzzle piece needed toallow the method to determine the optimal number of tones for a language (just as linguistsuse contrastiveness to guide their analyses). The most straightforward way to incorporatemeaning into the model is to use additional features (alongside the acoustic parameters)to train the autoencoder with. Two possible features could be numeric indices that denotelexicality or a word2vec representation of the relevant word (Rong, 2014).After meaning is incorporated into the method, one practical goal could be to demon-strate tone sandhi. Tone sandhi is a phonological process in which the phonetic realizationof a tone changes due to its surrounding context3. For example, in Mandarin, the falling-rising tone surfaces phonetically as a rising tone when it precedes another falling-risingtone (Shih, 1997), such as 你好 (/niŁŘ£haoŁŘ£/ – ‘hello’) surfacing as /niĘ£haoŁŘ£/. The currentimplementation of the method would not be able to learn this tone sandhi pattern (ŁŘ£→Ę£)because there is no lexical information to associate the two realizations of the word你 (/ni/– ‘you’).5.3.2 Comparison to language acquisitionIn the case studies, the results of the method were presented after the adversarial autoen-coder had converged (i.e. after it finished learning). This is sensible given the goal was todetermine if the method learned something like the tones of a language, but it is not theonly way that the method could be used. For example, snapshots could be taken as themodel trains, with clustering occurring throughout the learning process. Such snapshots3Note that tone sandhi is distinct from the phonetic process of tonal coarticulation (Shen, 1992).139could then be compared to stages of tone acquisition in children or perhaps used to makepredictions on their own.Speculatively, one possible way to consider the time-course of learning in the machinecould be to look at the stability of regions in the latent space. Given the structure of theautoencoder, in which each point in the latent space can be decoded, we can calculate theJacobian matrix for the latent space. The Jacobian matrix uses differentiation to quantifyhow stable/dynamic a point in one space is by comparing how perturbations in that spaceaffect its corresponding decoded space (Gale and Nikaido, 1965). One hypothesis is thatareas of the latent space that stabilize first will be easier to learn for a language acquiringchild.Additionally, and in a similar vein, the method could be used to make predictions of thetime-course of second language acquisition. For example, the adversarial autoencoder couldbe trained on data from a first language and then applied to data from a second language.One possible investigation would be to measure how much second language data is neededto shift the hypothesized tone space learned from the first language.5.3.3 Language TypologyA third area in which the method may find future use is in the classification of languages.Language Typology is an active area of linguistic research that, in part, groups languagesbased on some quality of the language. For example, one commonly used quality is theorder of Subjects, Objects and Verbs in the language (Bickel, 2007). As the method is ableto generate tone inventories for a language, inventories can be compared between languagesto calculate, say, how ‘tonal’ a language is. Perhaps the method could even be used topredict if a tonal contrast will emerge in a language (i.e. tono-genesis).5.4 Refinements for the methodWhile the method has had moderate success in achieving the goals of this thesis, there aremany ways in can be improved (even if the aims remained the same as this project). For one,140additional acoustic-phonetic features that are known to interact with f0 (with respect totone) could be incorporated into the method. Additionally, improved clustering evaluationmetrics are needed to help identify optimal clusterings for a language. Finally, there areadditional steps that could be taken to reduce the user’s role while applying the method,ultimately leading to a fully unsupervised implementation.5.4.1 Incorporating additional acoustic-phonetic featuresWhile f0 is the primary acoustic correlate of pitch (and ultimately tone), there are otheraspects of the acoustic signal that interact with f0 and contribute to the perception of tone,for example: phonation type, declination and phonological context (including downstep).Phonation type, also known as voice quality, has been shown to interact with the real-izations of tones. Common types of voice quality include breathy, modal or creaky voice(Johnson, 2004). Creaky voice, for example, has been argued to play a significant rolein the identification of tone 4 (low-falling) in Cantonese (Yu and Lam, 2014) and tone 3(falling-rising) in Mandarin (Duanmu, 2007). A challenge to incorporating acoustic mea-sures of voice quality into the model is that, while voice qualities contrasts are perceptuallysalient, the acoustic dimensions that contribute to the auditory impressions vary consider-ably cross-linguistically.Declination refers to the“tendency [of f0] to decline gradually during the course of ut-terances” (pp. 53 Ladd, 1984). In other words, the pitch of a speaker’s voice will, generally,lower from the beginning to the end of an utterance. Declination has been shown to havepractical consequences for pitch perception in both non-tone and tone languages (Leroy,1984; Shih, 2000; Yuen, 2007). In particular, declination results in a high tone at the begin-ning of an utterance being higher than a high tone at the end. While human listeners areaware of this at some level (evidenced by how they compensate for it (Ladd, 1984)), ma-chines are not. Thus, utilizing a parameter such as place-in-utterance is a sensible additionto the method.Finally, downstep is a process similar to declination (in that pitch lowers over time), but141instead of being a general phonetic trend, it occurs based on phonological context (normallysuccessive high tones) (Laniran and Clements, 2003). This provides additional motivationto incorporate a measure like place-in-utterance into the method.5.4.2 New clustering evaluation metrics to discern optimal clusteringDetermining the optimal number of clusters in unsupervised clustering is a challengingproblem (Jain, 2010). Given I am neither a statistician nor machine learning expert, thereis little I am able to say in terms of practical next steps here. It is nonetheless clear howeverthat a more informative evaluation metric is needed if the method is to find broader use inlinguistic research.5.4.3 Achieving a fully unsupervised methodFinally, the method used in this thesis is not fully unsupervised. In particular, time-markings were used to restrict the input of the adversarial autoencoder to syllable-frames,and transcripts were used to generate those syllable-frames via forced alignment. Ideally,these processes would be replaced with something like automatic syllable identification andchunking, without resorting to transcripts or forced-alignment. There is on-going researchto do this, such as that by Leong and Goswami (2015) or Ra¨sa¨nen et al. (2018) in whichamplitude is used to autonomously identify syllable/syllable-rime sized units of audio. Ifamplitude is able to be used in this way, then it may be achievable to use amplitude todemarcate syllable chunks, then feed those chunks into the method and have the methodgenerate a hypothesized tone inventory.5.5 ConclusionThis thesis has been an interdisciplinary investigation that aimed to establish machinelearning is a valuable tool for linguists to use to address theoretical questions. Given theresults reported herein, I believe this has been established. That said, this work has fallenshort of producing a grammaticus ex machina – a linguist (grammarian) from the machine– because the method was unable to consistently identify the correct number of tones for142a given language. Nonetheless, this work provides a starting point for what will certainlybecome impressive collaborations between machine learning scientists and linguists in thefuture.143BibliographyAdrus, T., Dubinski, E., Fiscus, J., Gillies, B., Harper, M., Hazen, T., Hefright, B., Jarrett,A., Lin, W., Ray, J., Rytting, A., Shen, W., Tzoukermann, E., and Wong, J. (2016).Iarpa babel cantonese language pack iarpa-babel101b-v0.4c ldc2016s02.Akinbo, S. (2018). Documentation of cifungwa folktales. Endangered Languages Archive,ELAR.Akinbo, S. (2019). Minimality and onset conditions interact with vowel harmony in fungwa.In Proceedings of the Annual Meetings on Phonology, volume 7.Archangeli, D. and Pulleyblank, D. (2012). Emergent phonology: evidence from english.Issues in English linguistics.Archangeli, D. and Pulleyblank, D. (2015). Phonology without universal grammar. Frontiersin psychology, 6:1229.Archangeli, D. and Pulleyblank, D. (2017). Phonology as an emergent system. The Rout-ledge handbook of phonological theory, pages 476–503.Bauer, R. S. and Benedict, P. K. (2011). Modern Cantonese phonology, volume 102. Walterde Gruyter.Bauer, R. S., Kwan-Hin, C., and Pak-Man, C. (2003). Variation and merger of the risingtones in hong kong cantonese. Language Variation and Change, 15(2):211–225.Bearth, T. and Link, C. (1980). The tone puzzle of wobe. Studies in African Linguistics,11(2):147–207.Ben-Hur, A., Horn, D., Siegelmann, H. T., and Vapnik, V. (2001). Support vector clustering.Journal of machine learning research, 2(Dec):125–137.Bentolila, I., Zhou, Y., Ismail, L. K., and Humpleman, R. (2011). System, method, and soft-ware application for targeted advertising via behavioral model clustering, and preferenceprogramming based on behavioral model clusters. US Patent 8,046,797.Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is “nearestneighbor” meaningful? In International conference on database theory, pages 217–235.Springer.144Bickel, B. (2007). Typology in the 21st century: Major current developments. LinguisticTypology, 11(1):239–251.Bird, S. and Lee, H. (2014). Computational support for early elicitation and classificationof tone. Language Documentation & Conservation, 8:453–461.Boersma, P. et al. (2002). Praat, a system for doing phonetics by computer. Glot interna-tional, 5.Brentari, D. (2019). Sign language phonology. Cambridge University Press.Buhrmester, M., Kwang, T., and Gosling, S. D. (2011). Amazon’s mechanical turk: Anew source of inexpensive, yet high-quality, data? Perspectives on psychological science,6(1):3–5.Calin´ski, T. and Harabasz, J. (1974). A dendrite method for cluster analysis. Communica-tions in Statistics-theory and Methods, 3(1):1–27.Chao, Y. R. (1930). A system of tone letters. Le Maitre Phonetique, 45:24–27.Chao, Y. R. (1965). A grammar of spoken chinese.Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A.,Weiss, R. J., Rao, K., Gonina, K., et al. (2017). State-of-the-art speech recognition withsequence-to-sequence models. arXiv preprint arXiv:1712.01769.Chomsky, N. (2007). Approaching ug from below. Interfaces+ recursion= language, 89:1–30.Chomsky, N. et al. (2006). On cognitive structures and their development: A reply topiaget. Philosophy of mind: Classical problems/contemporary issues, pages 751–755.Chomsky, N. and Halle, M. (1968). The sound pattern of english.Clements, G. N. (1985). The geometry of phonological features. Phonology, 2(1):225–252.Colburn, T. and Shute, G. (2007). Abstraction in computer science. Minds and Machines,17(2):169–184.Coupe, A. R. (2014). Strategies for analyzing tone languages. Language Documentation &Conservation, 8:462–489.Dallos, P. and Fay, R. R. (2012). The cochlea, volume 8. Springer Science & Business Media.Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE transactionson pattern analysis and machine intelligence, (2):224–227.De Lacy, P. (2007). The Cambridge handbook of phonology. Cambridge University Press.Decker, D. M. et al. (1999). Handbook of the International Phonetic Association: A guideto the use of the International Phonetic Alphabet. Cambridge University Press.145Deepmind, G. (2017). Alphago zero: Learning from scratch.Den Dikken, M., Bernstein, J. B., Tortora, C., and Zanuttini, R. (2007). Data and grammar:Means and individuals. Theoretical Linguistics, 33(3):335–352.Deng, L. (2012). The mnist database of handwritten digit images for machine learningresearch [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142.Dresher, B. E. (2015). The arch not the stones: Universal feature theory without universalfeatures. Nordlyd, 41(2):165–181.Duanmu, S. (2007). The phonology of standard Chinese. Oxford University Press.Edelman, S. and Christiansen, M. H. (2003). How seriously should we take minimalistsyntax? Trends in cognitive sciences, 7(2):60–61.Edmondson, J. A. and Esling, J. H. (2006). The valves of the throat and their functioningin tone, vocal register and stress: laryngoscopic case studies. Phonology, 23(2):157–191.Eimas, P. D., Miller, J. L., and Jusczyk, P. W. (1987). On infant speech perception andthe acquisition of language.Esling, J. H. and Harris, J. G. (2005). States of the glottis: An articulatory phonetic modelbased on laryngoscopic observations. A figure of speech: A Festschrift for John Laver,pages 347–383.Ewen, C. J. and Van der Hulst, H. (2001). The phonological structure of words: an intro-duction. Cambridge University Press.Featherston, S. (2005). Universals and grammaticality: Wh-constraints in german andenglish. Linguistics, 43(4):667–711.Fry, M. D. (2018). It’s time to collaborate: What human linguists can learn from machinelinguists. The Journal of the Acoustical Society of America, 144(3):1804–1804.Furl, N., Phillips, P. J., and O’Toole, A. J. (2002). Face recognition algorithms and theother-race effect: computational mechanisms for a developmental contact hypothesis.Cognitive Science, 26(6):797–815.Furui, S. (1986). Speaker-independent isolated word recognition using dynamic featuresof speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing,34(1):52–59.Gagliardi, A. and Lidz, J. (2014). Statistical insensitivity in the acquisition of tsez nounclasses. Language, pages 58–89.Gale, D. and Nikaido, H. (1965). The jacobian matrix and global univalence of mappings.Mathematische Annalen, 159(2):81–93.G˚arding, E., Kratochvil, P., Svantesson, J.-O., and Zhang, J. (1986). Tone 4 and tone 3discrimination in modern standard chinese. Language and Speech, 29(3):281–293.146Garofolo, J. S. (1993). Timit acoustic phonetic continuous speech corpus. Linguistic DataConsortium, 1993.Gauthier, B., Shi, R., and Xu, Y. (2007). Learning phonetic categories by tracking move-ments. Cognition, 103(1):80–106.Giles, S. B. and Moll, K. L. (1975). Cinefluorographic study of selected allophones ofenglish/i. Phonetica, 31(3-4):206–227.Glasberg, B. R. and Moore, B. C. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing research, 47(1-2):103–138.Goldsmith, J. A. (1976). Autosegmental phonology, volume 159. Indiana University Lin-guistics Club Bloomington.Goldsmith, J. A., Riggle, J., and Yu, A. (1995). The handbook of phonological theory. WileyOnline Library.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.Gordon, E., Maclagan, M., and Hay, J. (2007). The onze corpus. In Creating and digitizinglanguage corpora, pages 82–104. Springer.Gordon, M. (2001). A typology of contour tone restrictions. Studies in Language. Interna-tional Journal sponsored by the Foundation “Foundations of Language”, 25(3):423–462.Graves, A. (2012). Supervised sequence labelling. In Supervised sequence labelling withrecurrent neural networks, pages 5–13. Springer.Gussenhoven, C. (2008). Types of focus in english. In Topic and focus, pages 83–100.Springer.Gussenhoven, C. et al. (2004). The phonology of tone and intonation. Cambridge UniversityPress.Gussenhoven, C. and Teeuw, R. (2008). A moraic and a syllabic h-tone in yucatec maya.Fonolog´ıa instrumental: Patrones fo´nicos y variacio´n, pages 49–71.Gussenhoven, C. and Wright, J. (2015). Suprasegmentals. Wright, JD (ed.), InternationalEncyclopedia of the Social & Behavioral Sciences. Vol. 23, pages 714–721.Haspelmath, M. and Sims, A. (2013). Understanding morphology. Routledge.Hauser, M. D., Chomsky, N., and Fitch, W. T. (2002). The faculty of language: What isit, who has it, and how did it evolve? science, 298(5598):1569–1579.Hayes, B. (1995). Metrical stress theory: Principles and case studies. University of ChicagoPress.147Hayes, B. (2011). Introductory phonology, volume 32. John Wiley & Sons.Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.Neural computation, 14(8):1771–1800.Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deepbelief nets. Neural computation, 18(7):1527–1554.Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data withneural networks. science, 313(5786):504–507.Hock, H. H. (1986). Compensatory lengthening: in defense of the concept’mora’. Folialinguistica, 20(3-4):431–460.Hopper, P. (1987). Emergent grammar. In Annual Meeting of the Berkeley LinguisticsSociety, volume 13, pages 139–157.Huang, S., Liu, J., Wu, X., Wu, L., Yan, Y., and Qin, Z. (1998). 1997 mandarin broadcastnews speech (hub4-ne) ldc98s73. web download.Hyman, L. (1985). A theory of phonological weight, volume 19. Walter de Gruyter GmbH& Co KG.Hyman, L. (2014). How to study a tone language. Language Documentation & Conservation,8:525–562.Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters,31(8):651–666.Jakobson, R., Fant, C. G., and Halle, M. (1951). Preliminaries to speech analysis: Thedistinctive features and their correlates.Johnson, K. (2004). Acoustic and auditory phonetics. Phonetica, 61(1):56–58.Johnson, M. et al. (2011). How relevant is linguistics to computational linguistics. LinguisticIssues in Language Technology, 6(7).Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3):241–254.Jouvet, D. and Laprie, Y. (2017). Performance analysis of several pitch detection algo-rithms on simulated and real noisy speech data. In 2017 25th European Signal ProcessingConference (EUSIPCO), pages 1614–1618. IEEE.Jusczyk, P. W. (1995). Language acquisition: Speech sounds and the beginning of phonol-ogy.Kaskari, S. M., Mohan, A. K., Fry, M. D., and Neumann, D. W. (2017). Generation ofphoneme-experts for speech recognition. US Patent 9,792,900.Kenstowicz, M. J. (1994). Phonology in generative grammar, volume 7. Blackwell Cam-bridge, MA.148Koffka, K. (1922). Perception: an introduction to the gestalt-theorie. Psychological Bulletin,19(10):531.Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neuralnetworks. AIChE journal, 37(2):233–243.Krishnan, S. and Gonzalez, J. L. U. (2015). Google compute engine. In Building Your NextBig Thing with Google Cloud Platform, pages 53–81. Springer.Kuhl, P. K., Conboy, B. T., Padden, D., Nelson, T., and Pruitt, J. (2005). Early speech per-ception and later language development: Implications for the” critical period”. Languagelearning and development, 1(3-4):237–264.Kutsch Lojenga, C. (1994). Ngiti: a central-sudanic language of zaire.Lachenbruch, P. A. and Goldstein, M. (1979). Discriminant analysis. Biometrics, pages69–85.Ladd, D. R. (1984). Declination.: a review and some hypotheses. Phonology, 1:53–74.Ladefoged, P. and Johnson, K. (2014). A course in phonetics. Nelson Education.Lam, W. M. (2018). Perception of lexical tones by homeland and heritage speakers ofCantonese. PhD thesis, University of British Columbia.Lam, Z., Hall, K. C., and Pulleyblank, D. (2016). Temporal location of perceptual cues forcantonese tone identification. In 3rd Workshop on Innovations in Cantonese Linguistics(WICL-3), The Ohio State University.Laniran, Y. O. and Clements, G. N. (2003). Downstep and high raising: interacting factorsin yoruba tone production. Journal of Phonetics, 31(2):203–250.Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and Winther, O. (2015). Autoencodingbeyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.Lass, R. (1984). Phonology: An introduction to basic concepts. Cambridge University Press.Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. InAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conferenceon, pages 8595–8598. IEEE.LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436.Lee, T., Lo, W. K., Ching, P., and Meng, H. (2002). Spoken language resources for cantonesespeech processing. Speech Communication, 36(3-4):327–342.Leong, V. and Goswami, U. (2015). Acoustic-emergent phonology in the amplitude envelopeof child-directed speech. PLoS One, 10(12):e0144411.Leroy, L. (1984). The psychological reality of fundamental frequency declination. AntwerpPapers in Linguistics Wilrijk, (40):1–102.149Leung, M.-T. and Law, S.-P. (2001). Hkcac: the hong kong cantonese adult language corpus.International journal of corpus linguistics, 6(2):305–325.Lindblom, B. (1999). Emergent phonology. In Annual Meeting of the Berkeley LinguisticsSociety, volume 25, pages 195–209.Lindley, D. (1990). Regression and correlation analysis. In Time Series and Statistics, pages237–243. Springer.Linge, O. (2015). Understanding the neutral tone in mandarin.https://blog.skritter.com/2015/01/understanding-the-neutral-tone-in-mandarin/.Maddieson, I. (1978). Universals of tone. Universals of human language, 2:335–365.Maddieson, I. (2013a). Consonant inventories. In Dryer, M. S. and Haspelmath, M., editors,The World Atlas of Language Structures Online. Max Planck Institute for EvolutionaryAnthropology, Leipzig.Maddieson, I. (2013b). Tone. In Dryer, M. S. and Haspelmath, M., editors, The World Atlasof Language Structures Online. Max Planck Institute for Evolutionary Anthropology,Leipzig.Maddieson, I. (2013c). Vowel quality inventories. In Dryer, M. S. and Haspelmath, M.,editors, The World Atlas of Language Structures Online. Max Planck Institute for Evo-lutionary Anthropology, Leipzig.Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. (2015). Adversarialautoencoders. arXiv preprint arXiv:1511.05644.Mane´, D. et al. (2015). Tensorboard: Tensorflow’s visualization toolkit, 2015.Marcus, M., Santorini, B., and Marcinkiewicz, M. A. (1993). Building a large annotatedcorpus of english: The penn treebank.Matthews, S. and Yip, V. (2013). Cantonese: A comprehensive grammar. Routledge.Maye, J., Weiss, D. J., and Aslin, R. N. (2008). Statistical phonetic learning in infants:Facilitation and feature generalization. Developmental science, 11(1):122–134.McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montrealforced aligner: trainable text-speech alignment using kaldi. In Proceedings of interspeech,pages 498–502.McCloy, D. R., Wright, R. A., and Souza, P. E. (2015). Talker versus dialect effects onspeech intelligibility: A symmetrical study. Language and Speech, 58(3):371–386.McCulloch, W. S. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervousactivity. The bulletin of mathematical biophysics, 5(4):115–133.Mielke, J. (2008). The emergence of distinctive features. Oxford University Press.150Miller, Z., Dickinson, B., Deitrick, W., Hu, W., and Wang, A. H. (2014). Twitter spammerdetection using data stream clustering. Information Sciences, 260:64–73.Mok, P. P., Zuo, D., and Wong, P. W. (2013). Production and perception of a sound changein progress: Tone merging in hong kong cantonese. Language variation and change,25(3):341–370.Mok, P. P.-K. and Wong, P. W.-Y. (2010). Perception of the merging tones in hong kongcantonese: Preliminary data on monosyllables. In Speech Prosody 2010-Fifth Interna-tional Conference.Mukherjee, S., Asnani, H., Lin, E., and Kannan, S. (2019). Clustergan: Latent spaceclustering in generative adversarial networks. In Proceedings of the AAAI Conference onArtificial Intelligence, volume 33, pages 4610–4617.Munson, B., Edwards, J., and Beckman, M. E. (2011). Phonological representations in lan-guage acquisition: Climbing the ladder of abstraction. Handbook of laboratory phonology,pages 288–309.Nagel, T. (1974). What is it like to be a bat? The philosophical review, 83(4):435–450.Ngan, M., Grother, P. J., and Ngan, M. (2015). Face recognition vendor test (FRVT)performance of automated gender classification algorithms. US Department of Commerce,National Institute of Standards and Technology.Nguyen, N. (Collected on October 9th, 2018). A lot of apps sell your data. here’s what youcan do about it.Nickolls, J., Buck, I., Garland, M., and Skadron, K. (2008). Scalable parallel programmingwith cuda. In ACM SIGGRAPH 2008 classes, page 16. ACM.Nousi, P. and Tefas, A. (2018). v. Evolving Systems, pages 1–14.Odden, D. (1995). Tone: african languages. The handbook of phonological theory, 1:444–75.Ortega, L. (2014). Understanding second language acquisition. Routledge.Ou, J. (2012). Tone merger in Guangzhou Cantonese. PhD thesis, The Hong Kong Poly-technic University.Parker, S. T. and Gibson, K. R. (1977). Object manipulation, tool use and sensorimotorintelligence as feeding adaptations in cebus monkeys and great apes. Journal of HumanEvolution, 6(7):623–641.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learningin Python. Journal of Machine Learning Research, 12:2825–2830.Phillips, C. (2009). Should we impeach armchair linguists. Japanese/Korean Linguistics,17:49–64.151Pierrehumbert, J. (1990). Phonological and phonetic representation. Journal of phonetics,18(3):375–394.Pierrehumbert, J. B. (2003). Phonetic diversity, statistical learning, and acquisition ofphonology. Language and speech, 46(2-3):115–154.Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., and Raymond, W. (2005). The buckeyecorpus of conversational speech: Labeling conventions and a test of transcriber reliability.Speech Communication, 45(1):89–95.Pulleyblank, D. (1986). Tone in lexical phonology, volume 4. Springer Science & BusinessMedia.Pulleyblank, D. (1994). Underlying mora structure. Linguistic Inquiry, 25(2):344–353.Qian, Y., Lee, T., and Soong, F. K. (2007). Tone recognition in continuous cantonese speechusing supratone models. The Journal of the Acoustical Society of America, 121(5):2936–2945.Ra¨sa¨nen, O., Doyle, G., and Frank, M. C. (2018). Pre-linguistic segmentation of speechinto syllable-like units. Cognition, 171:130–150.Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.Rose, P. (1987). Considerations in the normalisation of the fundamental frequency of lin-guistic tone. Speech communication, 6(4):343–352.Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validationof cluster analysis. Journal of computational and applied mathematics, 20:53–65.Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy,A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognitionchallenge. International Journal of Computer Vision, 115(3):211–252.Saffran, J. R., Aslin, R. N., and Newport, E. L. (1996). Statistical learning by 8-month-oldinfants. Science, 274(5294):1926–1928.Samuels, B. D. (2009). The structure of phonological theory. Harvard University Cambridge,MA.Schuknecht, H. F. (1993). Pathology of the Ear, volume 1. Lea & Febiger Philadelphia.Schuster, M., Johnson, M., and Thorat, N. (2016). Zero-shot translation with google’smultilingual neural machine translation system. Google Research Blog.Senders, J. W. and Moray, N. P. (1995). Human error: Cause, prediction, and reduction.Shen, X. S. (1992). On tone sandhi and tonal coarticulation. Acta Linguistica Hafniensia,25(1):83–94.Shi, Q. S. (2004). Yi bai nian qian guang zhou hua de yin png diao. Fangyan, 1:34–46.152Shih, C. (1997). Mandarin third tone sandhi and prosodic structure. Linguistic Models,20:81–124.Shih, C. (2000). A declination model of mandarin chinese. In Intonation, pages 243–268.Springer.Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert,T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without humanknowledge. Nature, 550(7676):354.Silverman, D. (1992). Multiple scansions in loanword phonology: evidence from cantonese.Phonology, 9(2):289–328.Silverman, D. (2006). A critical introduction to phonology: of sound, mind, and body. A&CBlack.Silverman, K. E., Beckman, M. E., Pitrelli, J. F., Ostendorf, M., Wightman, C. W., Price,P., Pierrehumbert, J. B., and Hirschberg, J. (1992). Tobi: a standard for labeling englishprosody. In ICSLP, volume 2, pages 867–870.Smolensky, P. and Prince, A. (1993). Optimality theory: Constraint interaction in genera-tive grammar. Optimality Theory in phonology, page 3.Steedman, M. (2011). Romantics and revolutionaries. Linguistic Issues in Language Tech-nology, 6(11):1–20.Stevens, K. and Halle, M. (1971). A note on laryngeal features,’. MIT-RLE QuarterlyProgress Report, 101:198–213.Stro¨mbergsson, S. (2016). Today’s most frequently used f0 estimation methods, and theiraccuracy in estimating male and female pitch in clean speech. In INTERSPEECH, pages525–529.Surendran, D. R. (2007). Analysis and automatic recognition of tones in Mandarin Chinese.The University of Chicago.Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neuralnetworks. In Advances in neural information processing systems, pages 3104–3112.Taylor, P. (2009). Text-to-speech synthesis. Cambridge university press.Toderici, G., Vincent, D., Johnston, N., Hwang, S. J., Minnen, D., Shor, J., and Covell,M. (2017). Full resolution image compression with recurrent neural networks. In CVPR,pages 5435–5443.Toshniwal, S., Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., and Rao, K.(2018). Multilingual speech recognition with a single end-to-end model. In 2018 IEEEInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), pages4904–4908. IEEE.153Unni, K. (2018). Decrypting convolution neural network using simple images.https://towardsdatascience.com/convolution-neural-network-decryption-e323fd18c33.Accessed: 2019-08-16.Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalch-brenner, N., Senior, A. W., and Kavukcuoglu, K. (2016). Wavenet: A generative modelfor raw audio. In SSW, page 125.van Oostendorp, M., Ewen, C. J., Hume, E. V., and Rice, K. (2011). The BlackwellCompanion to Phonology, 5 Volume Set, volume 1. John Wiley & Sons.Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M.,Dudzik, A., Huang, A., Georgiev, P., Powell, R., et al. (2019). Alphastar: Mastering thereal-time strategy game starcraft ii. DeepMind Blog.Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D.,Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,Wilson, J., Jarrod Millman, K., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R.,Larson, E., Carey, C., Polat, I˙., Feng, Y., Moore, E. W., Vand erPlas, J., Laxalde, D.,Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald,A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., and Contributors, S. . . (2019).Scipy 1.0–fundamental algorithms for scientific computing in python. arXiv e-prints, pagearXiv:1907.10121.Wang, J. (2004). The neutral tone in trisyllabic sequences in chinese dialects. In Interna-tional Symposium on Tonal Aspects of Languages: With Emphasis on Tone Languages.Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao,Y., Chen, Z., Bengio, S., et al. (2017). Tacotron: Towards end-to-end speech synthesis.arXiv preprint arXiv:1703.10135.Werbos, P. (1974). Beyond regression: new tools for prediction and analysis in the behav-ioral sciences. PhD thesis, Harvard University.Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception: Evidence forperceptual reorganization during the first year of life. Infant behavior and development,7(1):49–63.Whalen, D. H. and Xu, Y. (1992). Information for mandarin tones in the amplitude contourand in brief segments. Phonetica, 49(1):25–47.Xu, Y. (1997). Contextual tonal variations in mandarin. Journal of phonetics, 25(1):61–83.Xu, Y. (2001). Fundamental frequency peak delay in mandarin. Phonetica, 58(1-2):26–52.Yaohong, L. and Guoqiao, Z. (1998). The dong language in guizhou province. Trans. fromChinese by D. Norman Geary. Dallas and Arlington: Summer Institute of Linguistics,and University of Texas at Arlington.Yip, M. (2002). Tone. Cambridge University Press.154Yu, A. (2007). Understanding near mergers: The case of morphological tone in cantonese.Phonology, 24(1):187–214.Yu, K. M. (2011). The learnability of tones from the speech signal. PhD thesis, Universityof California, Los Angeles.Yu, K. M. (2017). The role of time in phonetic spaces: Temporal resolution in cantonesetone perception. Journal of Phonetics, 65:126–144.Yu, K. M. and Lam, H. W. (2014). The role of creaky voice in cantonese tonal perception.The Journal of the Acoustical Society of America, 136(3):1320–1333.Yuan, J., Ryant, N., and Liberman, M. (2014). Automatic phonetic segmentation in man-darin chinese: boundary models, glottal features and tone. In 2014 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 2539–2543.IEEE.Yuan, J., Ryant, N., and Liberman, M. (2015). Mandarin chinese phonetic segmentationand tone ldc2015s05. Linguistic Data Consortium.Yuen, I. (2007). Declination and tone perception in cantonese. Tones and Tunes: Experi-mental Studies in Word and Sentence Prosody, pages 63–78.155Appendix ASupplementary Figures156A.1 Distribution of groundtruth labels in clusters(hypothesized tones) identified using the method forMandarinMandarin as analyzed with:Two contrastive tones Three contrastive tonesFour contrastive tones Five contrastive tonesSix contrastive tones Seven contrastive tonesEight contrastive tones Nine contrastive tonesFigure A.1: Hypothesized tones for Mandarin (two through nine tones) and the proportionof groundtruth tone labels that occur within the cluster corresponding to that tone. Theseresults were generated using the adversarial autoencoder described in the thesis.157A.2 Hypothesized tones for Mandarin by clustering latentcodes from a vanilla autoencoderMandarin as analyzed with:Two contrastive tones Three contrastive tonesFour contrastive tones Five contrastive tonesSix contrastive tones Seven contrastive tonesEight contrastive tones Nine contrastive tonesFigure A.2: Hypothesized tones for Mandarin (two through nine tones) and the propor-tion of groundtruth tone labels that occur within the cluster corresponding to that tone.These results were generated using a vanilla autoencoder (in contrast to the adversarialautoencoder described in the thesis).158A.3 Hypothesized tones for Mandarin by clusteringacoustic-parameters without an autoencoderMandarin as analyzed with:Two contrastive tones Three contrastive tonesFour contrastive tones Five contrastive tonesSix contrastive tones Seven contrastive tonesEight contrastive tones Nine contrastive tonesFigure A.3: Hypothesized tones for Mandarin (two through nine tones) and the proportionof groundtruth tone labels that occur within the cluster corresponding to that tone. Theseresults were generated using a only acoustic-parameterization and no abstraction from anautoencoder.159

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0389820/manifest

Comment

Related Items