Bayesian models of learning and generatinginflectional morphologybyBlake H. AllenA.B., Harvard University, 2011A.M., Harvard University, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Linguistics)The University of British Columbia(Vancouver)October 2016c© Blake H. Allen, 2016AbstractIn many languages of the world, the form of individual words can undergo sys-tematic variation in order to express concepts including tense, gender, and relativesocial status. Accurate models of these inflectional systems, such as verb conjuga-tion and noun declension systems, are indispensable for purposes of both languageresearch and language technology development.This dissertation presents a theoretical framework for understanding and pre-dicting native speakers’ use of their languages’ inflectional systems. I propose aprobabilistic interpretation of the task that speakers face when inferring unfamiliarinflected forms, and I argue in favor of a Bayesian approach to modeling this task.Specifically, I develop the theory of sublexical morphology, which augments theBayesian approach with intuitive methods for calculating necessary probabilities.Sublexical morphology also possesses the virtue of computational implementabil-ity: this dissertation defines all data structures used in sublexical morphology, andit specifies the procedures necessary to use a model for morphological inference.I provide along with this dissertation a Python package that implements all theclasses and methods necessary to perform inference with a sublexical morphologymodel. I also describe an implemented learning algorithm that allows induction ofsublexical morphology models from labeled but unparsed training data.As empirical support for my core claims, I describe the outcomes of two behav-ioral experiments. Evidence from a test of Icelandic speakers’ inflection of novelwords demonstrates that speakers are able to additively make use of informationfrom multiple provided inflected forms of a word, and evidence from a similar teston Polish speakers suggests that speakers may be limited to this additive way ofcombining such pieces of information. In clear support of a Bayesian interpretationof morphological inference, both experiments additionally demonstrate that priorprobabilities—understood as reflecting lexical frequencies of different groupingsof words—play a major role in speakers’ use of their inflectional systems. This isshown to be true even when influence from prior probabilities results in speakersapparently deviating from exceptionless lexical patterns in those systems.iiPrefaceThis dissertation is original intellectual product of the author, Blake Allen. Dataabout the grammar and lexicon of Icelandic were compiled with assistance fromGunnar Ó. Hansson, who also provided guidance as a native speaker of Icelandicwhen I was designing the Icelandic experiment. Paulina Lyskawa provided guid-ance as a native speaker of Polish when I was designing the Polish experiment. Thelearning algorithm for sublexical phonology grammars, which is the basis of thePyParadigms learning algorithm described in chapter 2, was developed in collabo-ration with Michael Becker.Parts of chapters 3 and 4 were presented in their preliminary versions at theInternational Morphology Meeting (Feb. 2016 in Vienna) and the Germanic Lin-guistics Annual Conference (May 2016 in Reykjavík).All projects and associated methods were approved by the University of BritishColumbia’s Research Ethics Board [certificate #H14-01142].iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Goals and motivations . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Why model the paradigm cell filling problem? . . . . . . . . . . . 41.1.1 Theoretical linguistics . . . . . . . . . . . . . . . . . . . 41.1.2 Natural language processing . . . . . . . . . . . . . . . . 51.2 Model desiderata . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.1 Accuracy and precision . . . . . . . . . . . . . . . . . . . 71.2.2 Computational implementability . . . . . . . . . . . . . . 81.2.3 Learnability . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Structure of dissertation . . . . . . . . . . . . . . . . . . . . . . . 102 Bayesian morphology and sublexical morphology . . . . . . . . . . 122.1 Formalizing the paradigm cell filling problem . . . . . . . . . . . 132.2 Bayesian morphology . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Sublexical morphology . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Theoretical core of sublexical morphology . . . . . . . . 182.3.2 Data structures of sublexical morphology . . . . . . . . . 202.4 Derivative inference in sublexical morphology . . . . . . . . . . . 272.4.1 Calculating probabilities of bases . . . . . . . . . . . . . 28iv2.4.2 Calculating prior probabilities . . . . . . . . . . . . . . . 322.4.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . 342.4.4 Generating the candidate set . . . . . . . . . . . . . . . . 342.4.5 Bringing everything together . . . . . . . . . . . . . . . . 362.5 Learning in sublexical morphology . . . . . . . . . . . . . . . . . 372.5.1 Learning algorithm inputs . . . . . . . . . . . . . . . . . 372.5.2 Learning mapping sublexicons . . . . . . . . . . . . . . . 402.5.3 Learning paradigm sublexicons . . . . . . . . . . . . . . 432.5.4 Learning gatekeeper grammar weights . . . . . . . . . . . 442.6 Relatedness to other theories . . . . . . . . . . . . . . . . . . . . 473 Inference from multiple bases . . . . . . . . . . . . . . . . . . . . . 513.1 Single-base hypotheses . . . . . . . . . . . . . . . . . . . . . . . 533.1.1 Motivations for the single surface base hypothesis . . . . . 553.1.2 A probabilistic single surface base hypothesis . . . . . . . 583.2 Icelandic nouns and multiple bases . . . . . . . . . . . . . . . . . 603.2.1 Icelandic noun inflection . . . . . . . . . . . . . . . . . . 613.2.2 Predictors of the Icelandic AccPl . . . . . . . . . . . . . . 633.3 Falsifying the single-base restriction: an Icelandic experiment . . 663.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 663.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.3.3 Discussion of Icelandic experiment . . . . . . . . . . . . 793.4 Base independence . . . . . . . . . . . . . . . . . . . . . . . . . 793.4.1 The base independence hypothesis . . . . . . . . . . . . . 803.4.2 Testing the base independence hypothesis . . . . . . . . . 863.5 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . 1024 Empirical priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.1 Priors in inflection . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.2 Assessing prior influence in Icelandic . . . . . . . . . . . . . . . 1114.2.1 Lexical frequencies in Icelandic . . . . . . . . . . . . . . 1114.2.2 Evidence for empirical priors in Icelandic . . . . . . . . . 1154.3 Assessing prior influence in Polish . . . . . . . . . . . . . . . . . 1204.3.1 Lexical frequencies in Polish . . . . . . . . . . . . . . . . 1204.3.2 Evidence for empirical priors in Polish . . . . . . . . . . 1244.4 Summary and discussion . . . . . . . . . . . . . . . . . . . . . . 1275 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.1 Summary of proposals and evidence . . . . . . . . . . . . . . . . 1305.2 Other applications of sublexical morphology . . . . . . . . . . . . 134v5.2.1 Paradigm leveling . . . . . . . . . . . . . . . . . . . . . 1345.2.2 Paradigmatic gaps . . . . . . . . . . . . . . . . . . . . . 1385.2.3 Paradigm entropy . . . . . . . . . . . . . . . . . . . . . . 1395.3 Limitations and future directions . . . . . . . . . . . . . . . . . . 140Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144Appendix: supplementary materials . . . . . . . . . . . . . . . . . . . . 153viList of FiguresFigure 1.1 A tabular representation of part of the paradigm for the lexemeTO LOVE in normative European Spanish. . . . . . . . . . . . 2Figure 2.1 Tableau showing example weights and violation profiles forfour hypothetical gatekeeper grammar constraints, as well asthe forms’ harmony scores. . . . . . . . . . . . . . . . . . . . 26Figure 2.2 Example weights of various constraints in the “-ar”, “-er”, and“-ir” sublexicons of normative European Spanish. . . . . . . . 29Figure 2.3 Constraint output/violation profiles for the inputs comprisingInf and 3Sg forms of TO LOVE for the three example sublexicons. 32Figure 2.4 Example morphological operations for an “-ar” sublexicon innormative European Spanish. . . . . . . . . . . . . . . . . . . 35Figure 2.5 The three steps of the PyParadigms learning algorithm for sub-lexical morphology models. . . . . . . . . . . . . . . . . . . 37Figure 2.6 Inputs to the PyParadigms learning algorithm. . . . . . . . . . 38Figure 2.7 Base–derivative cell pairs among the present indicative cells inSpanish verbs. . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 2.8 Tableaux showing the training data for the Spanish “-ar” sub-lexicon with their observed and predicted frequencies. . . . . 46Figure 3.1 Examples of three classes of nouns in Middle High German,in the NomSg and NomPl. . . . . . . . . . . . . . . . . . . . 54Figure 3.2 A fully connected inflection graph. . . . . . . . . . . . . . . . 56Figure 3.3 An inflection graph under the single surface base hypothesis,with cell a as the privileged base. . . . . . . . . . . . . . . . 57Figure 3.4 An inflection graph under a weakened version of the single sur-face base hypothesis, assuming that each cell can be generatedfrom some cell. . . . . . . . . . . . . . . . . . . . . . . . . . 58viiFigure 3.5 The four cases and two numbers of Icelandic nouns, as well astheir abbreviations. . . . . . . . . . . . . . . . . . . . . . . . 61Figure 3.6 Representative words and their suffix paradigms from six in-flectional classes associated with the feminine gender. . . . . 62Figure 3.7 Four AccPl suffixes of Icelandic nouns, their usual genders,and their stem vowels. . . . . . . . . . . . . . . . . . . . . . 63Figure 3.8 Raw counts (and head noun-based counts) of Icelandic nounforms grouped by their AccPl, GenSg, and NomPl suffixes. . . 64Figure 3.9 Raw counts (and head noun-based counts) of Icelandic nounforms grouped by their AccPl, GenSg, and NomPl suffixes,but with GenSg-based and NomPl-based groupings performedseparately. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 3.10 Schematic of four possible AccPl suffixes in Icelandic, withtheir typical lexical correspondences to GenSg and NomPl suf-fixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure 3.11 A screenshot of one trial frame in the Icelandic experiment. . 67Figure 3.12 The four presentation conditions of the Icelandic experiment. . 68Figure 3.13 The suffixes of the four inflectional classes into which novellexeme stems were randomly distributed. . . . . . . . . . . . 70Figure 3.14 Participants’ proportions of “correct” responses in the Icelandicexperiment by presentation condition. . . . . . . . . . . . . . 72Figure 3.15 A GLMM with maximum likelihood coefficients predictingwhether a participant’s selected AccPl corresponded to the cor-rect AccPl. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Figure 3.16 Results of likelihood ratio tests for Icelandic. . . . . . . . . . 78Figure 3.17 A schematic of an inflectional system which would be able tomake use of cross-base constraint conjunctions. . . . . . . . . 83Figure 3.18 The two cases of immediate interest and two numbers of Polishnouns, as well as their abbreviations. . . . . . . . . . . . . . . 87Figure 3.19 The full inflectional paradigms of nouns MAP map- and BOR-DER, LIMIT granic-, representative of the hard feminine andsoft feminine inflectional classes, respectively. . . . . . . . . 88Figure 3.20 The suffixes associated with the GenSg, GenPl, and NomPlforms of soft neuter, masculine, and feminine nouns in Polish. 89Figure 3.21 The four presentation conditions of the Polish experiment. . . 90Figure 3.22 Participants’ proportions of “correct” responses in the Polishexperiment by presentation condition. . . . . . . . . . . . . . 93Figure 3.23 The suffixes associated with the GenSg, GenPl, and NomPlforms of soft neuter, masculine, and feminine nouns in Polish. 94viiiFigure 3.24 Participants’ proportions of “correct” (-a NomPl) responsesfor neuter-class items in the Polish experiment by presentationcondition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Figure 3.25 A GLMM with maximum likelihood coefficients predictingwhether a participant’s selected NomPl corresponded to thecorrect NomPl. . . . . . . . . . . . . . . . . . . . . . . . . . 98Figure 3.26 GLMMs with maximum likelihood coefficients predicting, foreach subset of the data of a particular gender class, whichNomPl suffix participants selected. . . . . . . . . . . . . . . . 100Figure 4.1 A toy nominal inflectional system, showing the “singular” and“plural” forms of nouns in three classes. . . . . . . . . . . . . 107Figure 4.2 Counts of Icelandic lexemes whose AccPl forms take each ofthe four target endings. . . . . . . . . . . . . . . . . . . . . . 113Figure 4.3 Visualized counts of Icelandic lexemes whose AccPl formstake each of the four target endings. . . . . . . . . . . . . . . 114Figure 4.4 Frequencies of participants in the Icelandic experiment select-ing an AccPl with each of the four possible suffixes. . . . . . 116Figure 4.5 Kullback-Leibler divergences of hypothetical prior distribu-tions over AccPl endings from the observed response distri-bution from the Icelandic study in the presentation conditionproviding only DatPl forms. . . . . . . . . . . . . . . . . . . 119Figure 4.6 The suffixes associated with the GenSg, GenPl, and NomPlforms of soft neuter, masculine, and feminine nouns in Polish. 121Figure 4.7 Counts of Polish lexemes whose NomPl forms end in each ofthe target characters. . . . . . . . . . . . . . . . . . . . . . . 122Figure 4.8 Visualized counts of Polish lexemes whose NomPl forms endin each of the target characters. . . . . . . . . . . . . . . . . . 123Figure 4.9 Frequencies of participants in the Polish experiment selectinga NomPl with each of the two possible suffixes. . . . . . . . . 125Figure 4.10 Kullback-Leibler divergences of hypothetical prior distribu-tions over NomPl endings from the observed response distri-bution in the DatPl-only condition of the Polish study. . . . . 126Figure 5.1 A schematic of Old Latin and Golden Age Latin NomSg andGenSg forms relevant to the leveling of HONOR-like words. . 135Figure 5.2 The morphological operations deriving NomSg and GenSg formsfrom each other in the paradigm sublexicons of Old Latin. . . 137ixGlossarylexeme a unit of meaning with an associated syntactic category, which in lan-guages exhibiting inflection of that syntactic category, specifies a particularphonological form only when associated with a set of morpho-syntactic/semanticfeatures• Examples: CAT, JUMP(morpho-syntactic/semantic) feature a category of “auxiliary meanings” whichspecifies one dimension in an inflectional paradigm• Examples: person, number, tense, mood(feature) value a meaning specifying one possible semantic referent sub-categoryfor a feature• Examples: 3rd (person), singular (number), preterite (tense), subjunctive(mood)(word) form a phonological shape corresponding to the pairing (combination) ofa lexeme and a full set of feature specifications• Example: in English the plural (number) form of CAT is [kæZY]cell a full set of features which together could specify a form of any lexeme in aninflectional system• Example: the dative (case) singular (number) cell in the Icelandic noun in-flectional systemxparadigm the set of all cells in an inflectional system, or a particular lexeme’s setof all inflected forms• Example: in English the paradigm of a noun includes only a singular formand a plural form; the paradigm of CAT is {singular: [kæZ], plural: [kæZY]}candidate one of the forms that could conceivably express a particular combina-tion of a lexeme and a set of feature values• Example: [d@IræfY] and [d@Irævz] are candidates for the plural form of GI-RAFFEbase in the context of a derivation/inference task, a known form of the target lex-eme which can be used to infer unknown forms (derivatives) of that lexeme• Example: a Spanish speaker may use the 1st person singular present and3rd person singular present forms of the lexeme LOVE, [amo] and [ama] re-spectively, as bases when attempting to infer that lexeme’s 1st person pluralpresent form• Note: can also be used to refer to a specific cell, abstracted away from anyparticular lexeme, e.g. “speakers tend to use the infinitive as a base”derivative in the context of a derivation/inference task, an unknown form of somelexeme which must be inferred• Example: a Spanish speaker may use known forms of the lexeme LOVE toinfer a derivative form of that lexeme such as its 1st person plural present form• Note: can also be used to refer to a specific cell, abstracted away from anyparticular lexeme, e.g. “speakers used the singular base forms provided to inferplural derivatives”constraint a function that assesses whether a form meets some criterion; equiva-lent to a feature (in the Machine Learning sense, not the phonological sense)or an operationalized independent variable (in the statistics sense, e.g. as partof a linear model)• Example: the constraint [3Sg: a#] evaluates to 1 if applied to a 3rd personsingular form ending in [a] and evaluates to 0 otherwisexiAcknowledgmentsEven from before my first day as a PhD student, I have been profoundly fortunateto have the academic and personal support of many of the finest people I have everknown. Thanks in large part to these bonds, I have mercifully been spared most ofthe hardships usually associated with completing a doctoral degree program.First, I cannot imagine a department more worthy of being a source of pride forme than UBC’s department of linguistics. The hours, thought, and care afforded meby my department-internal dissertation committee members—Gunnar, Kathleen,and especially Doug—have far exceeded even my optimistic expectations from sixyears ago. It has been a complete pleasure and honor working with all of you asteacher/student and as collaborators, and I sincerely hope that we will be able tocontinue these relationships even now that I have left Vancouver. At the absoluteleast, you will always remain, in my thoughts and my heart, the greatest linguistsI could have selected to serve on my committee. Speaking of my department, Ican honestly say that every professor and staff member there has been a positiveinfluence on me, especially Carla, Bryan, Martina, Eric, Shaine, Strang, and Edna.I am so grateful to all of you and the rest of the department for your role in my lifethese past five years.Many mentors outside my department have also played crucial roles in helpingme achieve my doctoral degree. Out of all these I first want to thank Michael forhelping me take my initial steps toward becoming a fully fledged computationallinguist over breakfast at the 2012 LSA meeting in Portland (not to mention hisexemplary work as my AM advisor). I am also so fortunate to have spent timediscussing my ideas with Alex—the ideal statistician to serve on a linguistics PhDcommittee—and with Paul, whose sabbatical at UBC was crucial in solidifyingxiimy understanding of the math underlying some of my proposals. I would like togive my thanks to Bruce, Robert, and Naomi for their thoughtful comments andteachings at various points in my program as well, and especially to Adam for thesame and for serving as my external examiner. I also acknowledge the friendlyhelp that Paulina provided as I was constructing the methodology for my Polishexperiment, and the excellent work of Bosung and the rest of the eNunciate team.Of course, I would not have made it through my program so happily withoutthe other graduate students who have become such close friends of mine and haveshared so many unforgettable hours with me. Zoe, Kamila, Somaye, Erin, Kevin,Andrei, Natalie, Michael (all three of you), and everyone else who has enrichedmy daily life: thank you! Aside from my fellow UBC graduate students, I cannotgo without thanking Sarah for her irreplaceable role in my life: past, future, andkairos. And finally, I am tremendously grateful for the unyielding love, enthusiasticsupport, and exemplary kindness I have received from Ben these past three and ahalf years.I will close by noting my greatest appreciation to the two people who haveinfluenced and cared for me the most throughout my entire life: my parents. Mom,Dad, there are no two people in the world whose deep love I have felt so keenlyfor so long, nor any two who I imagine could have provided as much care andinspiration as the two of you have. Thank you.xiiiChapter 1Goals and motivationsSystems of inflectional morphology, including the systems of verb conjugation andnoun declension widespread among the world’s languages, exemplify the powerand complexity of natural language. These systems encode a bidirectional map-ping between sound and meaning, and like natural language more generally, theyare characterized by their productivity: knowledge of an inflectional system givesspeakers the ability to generate and comprehend previously unknown sound–meaningmappings.Unlike the syntax of a language, however, inflectional morphology permits aclearly demarcated, if still infinite, range of meanings. Inflected words typicallyhave a core meaning with an associated syntactic category, such as CAT (noun) orRUN (verb), along with additional elements of meaning which narrow the word’srange of denotations, e.g. by specifying a noun as plural in number or a verb aspresent in tense. In this dissertation, I refer to the core meanings as lexemes andthe additional elements of meaning as morpho-syntactic/semantic features. Eachfeature can take various values; for example, the inflectional system of nouns inEnglish includes a number feature which can take two values: singular or plural.1Inflectional systems are systems by which the form (spoken or written) of a lexemecan change depending on its feature values.The uniquely restricted nature of inflectional systems emerges from a notewor-1The glossary starting on page x contains a full list of related terms and their definitions as usedin this dissertation.1thy distinction beween lexemes and features. In any inflectional system, the set oflexemes is open, meaning that new lexemes (like TO GOOGLE) can be introducedand assimilated into the inflectional system at any time. Conversely, the set of fea-tures and the sets of their respective values are closed; except in cases of languagechange, they cannot be augmented or reduced. Therefore the maximum numberof different phonological or orthographical forms a lexeme can take is equal to thenumber of combinations of the feature values its inflectional system includes.2As a result of the semi-closed nature of inflectional morphology, descriptions ofinflectional systems often depict them in a tabular format. Each feature correspondsto a dimension (e.g. the horizontal dimension ranging across columns or the verti-cal dimension ranging across rows), and each value of a feature corresponds to aparticular column/row/etc. of the feature’s dimension. Figure 1.1 exemplifies sucha table for a small subset of the forms of a single verb in normative European Span-ish, using phonological representations of segmental information (Alarcos Llorach,1994). The horizontal dimension displays the possible values for the person fea-ture, and the vertical dimension within each sub-table displays the possible valuesfor the number feature. The top and bottom sub-tables should be considered a thirddimension, showing two possible values present and imperfect for the tense feature.[present] 1st 2nd 3rdsingular amo amas amaplural amamos amais aman[imperfect] 1st 2nd 3rdsingular amaba amabas amabaplural amabamos amabais amabanFigure 1.1: A tabular representation of part of the paradigm for the lexemeTO LOVE in normative European Spanish.2Strictly speaking, the set of phonologically distinct forms can exceed this number in cases ofvariation, e.g. smelled and smelt would occupy the same cell in a paradigm. The number of sets offorms with inflectionally distinct meanings, however, is fixed.2If we consider only the feature values shown above, then this table exhausts allof the possible meanings of verbs with the lexeme TO LOVE, as well as all of theircanonical forms. It is likely that an adult native speaker of this variety of Spanishwill have heard all of these forms at one point or another, and as a result, the abilityof such a speaker to produce [amamos] with the intent of conveying the meaning“we love” could be attributed to a feat of memory.Suppose, however, that a particular native speaker of this variety of Spanish hasnever encountered the second person plural imperfect of TO LOVE. Because of theproductivity of inflectional morphology, this speaker, if pressed by conversationalcontext to produce the appropriate form of this lexeme, would most likely be ableto infer from the inflectional system as a whole and from her/his known formsof the lexeme TO LOVE that this unknown form should be [amabais]. Moreover,a third party who has also never heard this form before would likely be able toinfer the form’s meaning, independent of its context. Given that the set of lexemesis boundless, and that the set of cells grows multiplicatively with the number offeatures in an inflectional system, the need for both types of inference is likelycommon in languages with non-trivial inflectional systems and/or frequent coinageof neologisms.The first of these inferences—inferring and producing an unknown form in aninflectional system—has been called the paradigm cell filling problem (Ackermanet al., 2009; Malouf & Ackerman, 2010). This name draws on the tabular metaphorintroduced above: each combination of a lexeme and a set of compatible featurevalues specifies a single cell in such a table, and when a speaker does not explicitlyknow the form that belongs in any one of these cells, she or he must infer a formto fill that gap. Here I define a paradigm as the set of all the forms associatedwith a particular lexeme. Inference of an unknown form represents a “problem”in two senses: first, when a speaker needs to produce an inflected form, if a formwith that meaning has not been heard before then the speaker cannot pull it frommemory and must instead solve the the problem of determining what that formshould be; and second, this phenomenon of speakers generating unknown formspresents a problem for linguists because the mechanism by which it occurs is notwell understood.This dissertation develops a formal and computationally implemented model3of the paradigm cell filling problem as it is faced by native speakers of a lan-guage with inflectional morphology. The next section describes why a model ofthe paradigm cell filling problem is both interesting for theoretical linguists anduseful for language-related tasks in the real world. Section 1.2 then proposes aset of specific criteria that such a model should meet. The final section provides asummary of the structure of this dissertation, including the next chapter in which Ipresent a framework for modeling the paradigm cell filling problem.1.1 Why model the paradigm cell filling problem?Empirically valid formal models of the paradigm cell filling problem can greatlybenefit theoretical linguists’ understanding of the faculty of language, and compu-tational implementations of these models are useful for a range of natural languageprocessing tasks. This section briefly explains why readers from both fields shouldbe interested in learning about and participating in the development of such mod-els.1.1.1 Theoretical linguisticsTheoretical linguists rarely describe their research as dedicated to modeling a spe-cific linguistic task or linguistic behavior of speakers; instead, it is common to fo-cus on modeling the faculty of language or grammar itself, with the mostly implicitunderstanding that an accurate model of the grammatical system itself will natu-rally explain specific linguistic behaviors related to it. Even so, since at least therise of generative grammar (Chomsky & Halle, 1968; Chomsky, 1956, 1957), onetype of linguistic behavior has claimed central importance as the object of study: anative speaker’s inference (or generation) of wellformed linguistic units in her orhis language. Optimality theory (McCarthy & Prince, 1993; Prince & Smolensky,2008) and minimalist syntax (Chomsky, 1995), for example, are primarily theoriesof derivations of utterances from the properties of a grammar, and the practicalmachinery of these theories is crafted for this purpose. However, inference (orgeneration or derivation) of wellformed utterances is not the only natural languagetask humans must perform. Such robust, practical machinery for modeling someof these other tasks, such as learning a first language (Pater & Tessier 2003, Hud-son Kam&Newport 2005 among others) and judging wellformedness (Coleman &Pierrehumbert 1997, Hayes &Wilson 2008 among others), have been developed to4a degree by communities of linguists, but others, such as identifying the semanticsof an encountered sentence/form, have received little attention within theoreticallinguistics.Studying the paradigm cell filling problem is tantamount to studying the fa-miliar topic of language generation or derivation in the domain of inflectional mor-phology. In this sense, despite the mild unorthodoxy of explicitly describing my re-search as focusing on a single language-related task, this dissertation has the samegoal as most of the existing linguistic theory literature: to explain how speakers usetheir grammars productively to create wellformed but novel utterances. Making thelimited scope of my investigation clear this way therefore does little to actually re-duce the breadth of my domain of inquiry, while it does assist in developing aprecise theory by creating a laser-like focus on a single linguistic phenomenon.Moreover, the generation of inflectional morphology in particular warrantsstudy. Inflectional morphology lies at one interface among phonology, semantics,and syntax, and so at the very least, empirical investigations of inflection and a for-mal understanding of its limitations both serve the purposes of all these sub-fieldsof linguistics. Largely due to its uniquely semi-closed nature as described above, Iagree with the likes of Matthews (1972), Zwicky (1985), Spencer (1991), Ander-son (1992), Aronoff (1994), and Beard (1995) that researchers can arrive at useful,illuminating conclusions about language by investigating inflectional morphologyon its own rather than only as it relates to phonology or syntax. This dissertationalso presents questions about inflectional morphology which research on syntax orphonology would not be likely to ask, and indeed I show that these questions haveanswers not found in existing literature.1.1.2 Natural language processingThe paradigm cell filling problem, as I use the term, is equivalent to the task ofnatural language generation in the domain of inflectional morphology. While theinflectional morphology of English may be simple, the proliferation of computersand internet access across the globe has created a need for language technologiesthat cover other languages—not only those with large speaker populations, but alsothose with fewer speakers, e.g. for detecting natural disasters based on social mediadata (Gales et al., 2014; Ji et al., 2014; Mortensen et al., in review). Most of these5languages are more inflectionally complex than English (Stump & Finkel, 2013).For these inflectionally complex languages, and even for English, merely com-piling a database of inflected forms based on dictionaries or corpora does not suf-fice for supporting natural language processing systems. There are several reasonsthat such an approach cannot succeed. The set of lexemes in a language is notfixed, and neologisms that must be integrated into inflectional systems arise fre-quently, stymying efforts to create a comprehensive database of inflected forms.Moreover, because native speaker use of an inflectional system can differ fromprescriptively “correct” forms found in reference texts, and because speaker useof an inflectional system can also change over time, such a database would needto draw on an ever-changing corpus of speaker productions. However, given theroughly Zipfian distribution of lexeme frequencies (Wyllys, 1981), continually ex-panding such a corpus in order to fill in previously unattested forms is likely tocontinually add new lexemes—each with, perhaps, only a single form attested—meaning that a corpus-based database of all inflected forms in a language is unten-able. In addition, when memory limits are strict, the ability of generative models toproduce an infinite number of possible forms based on a constant-sized grammarmay be useful. Considering these challenges, computational models of inflectionalmorphology capable of generating novel forms are essential to natural languagegeneration, especially for inflectionally complex languages.Many natural language processing tasks including machine translation andquestion answering require not (only) the generation of inflected forms from asemantic input, but the identification of the semantics of provided inflected forms.Whereas a model that parses the semantics of inflected forms cannot generate novelforms, generative models of the paradigm cell filling problem can, indirectly, serveas models of the opposite task. With a training set of inflected forms and a gen-erative model based on those forms, one can predict all unattested inflected formsfor observed lexemes, and these predictions can be matched to inflected forms insome new source text in order to determine the forms’ ranges of possible interpre-tations. In this sense, models of the paradigm cell filling problem can constitutegeneral-purpose models of a variety of inflectional morphology tasks.61.2 Model desiderataWhile models of the paradigm cell filling problem stand to benefit both theoreti-cal linguistics and language technology, these benefits will be maximized only ifsuch models are designed conscientiously, with an eye toward ways they might beused. This section sketches some of the key criteria for a successful model of theparadigm cell filling problem.1.2.1 Accuracy and precisionIt is universally true that useful models must be as accurate as possible, in thesense that for any set of inputs, the outcomes predicted by a model should ideallybe identical to the outcomes observed in the system being modeled. I contend thatmodels should also be as precise as possible, in the sense that for any set of inputs,the range of outcomes compatible with a model’s predictions should be maximallynarrow. The importance of this property derives from the fact that more precisemodels are easier to falsify and therefore easier to improve upon. Even better, amodel should make predictions at various levels of precision so that the specificlevel at which it errs can be identified.Assessing the accuracy of a model of the paradigm cell filling problem is not asstraightforward as ensuring that native speakers produce the same single inflectedform as the model predicts for a particular paradigm cell filling problem/query.Speaker behavior in morpho-phonological tasks exhibits widespread but princi-pled variability and gradience (Batchelder 1999; Becker et al. 2012; Becker &Gouskova 2013; Hayes & Londe 2006; Hayes et al. 2009; Eddington et al. 2013;among others). An accurate model therefore cannot predict only a single outputor response for each query—it must define a set of possible outputs/responses anda measurement of how likely speakers are to produce each one. Probability the-ory naturally fits this need: creating probabilistic models allows the assignment ofprobabilities to particular patterns of outputs given a model, as well as samplingfrom a model in order to simulate speaker behavior. Given this property of proba-bility theory, as well as the fact that its mathematical bases are well understood andthe fact that implementations of various probabilistic model families are widelyavailable, I suggest that the most accurate models of the paradigm cell filling prob-lem must be probabilistic ones.7The accuracy of models of the paradigm cell filling problem should be as-sessed primarily using behavioral experiments, i.e. wug tests (Berko, 1958; Kawa-hara, 2011, 2016), which elicit judgments about novel inflected forms to determinespeaker knowledge of morphological patterns. Ultimately, the system being mod-eled is the speaker’s grammar, that is, her or his ability to use an inflectional sys-tem productively, applying it to produce inflected forms that have not previouslybeen encountered. The problem with measuring the accuracy of a model againstfrequencies (which can be viewed as proportional to probabilities) of forms in acorpus of inflected forms—even in cases where such corpora exist—is that speak-ers producing inflected forms may simply be retrieving them from memory afterhaving encountered them before. Wug tests provide a convenient and establishedmethodology for avoiding this confound. Wug tests can also be carried out at largescales, for example using internet-based methods, yielding a large sample of re-sponses which can be used to estimate response distributions comparable to theprobability distributions predicted by a model.In the context of the paradigm cell filling problem, a maximally precise modelwould predict exactly the same inflected form given a grammar and a set of in-puts (i.e. base forms of the target lexeme) as speakers with that grammar and thoseinputs would produce. Given the gradient nature of linguistic behavior, as statedabove, the model should actually predict a probability distribution over speakerproductions. This level of precision in predictions differs from, for example, amodel which simply generates the set of likely inflected forms. Coarser granular-ities of predictions are still useful, however. As long as a model produces predic-tions at this maximal level of precision, it can also predict, for example, that somefactor x should influence speaker behavior, or that factors y and z should interactin influencing behavior. The nature of these coarser predictions depends largelyon the formalism used for a model, but can include, as I address in this disserta-tion, the model taking a stance on the influence of lexical frequencies on speakerbehavior.1.2.2 Computational implementabilityWhereas most theoretical models of inflectional morphology have been definedonly in terms of prose descriptions and formal notation, researchers have recently8begun implementing their models computationally, i.e. formalizing them in a bodyof source code. These computationally implemented models include NetworkMorphology (Brown & Hippisley, 2012), the generative component of the Min-imal Generalization Learner (Albright & Hayes, 2002), and sublexical phonology(Allen & Becker, in review; Gouskova & Newlin-Łukowicz, 2013). No such ap-proach has yet explicitly modeled the paradigm cell filling problem in the domainof entire inflectional systems, but I propose that computational implementability isessential in models of this phenomenon as well.Creating a computational implementation along with a traditional prose andnotational description of a model family confers several benefits. For one, a com-putational implementation minimizes model ambiguity—other researchers can in-vestigate any aspects of a model by looking at their implementations, which mustbe clearly defined enough for a computer to run them. The need to achieve thislevel of clarity also benefits the originator of the theory, sometimes bringing tolight ambiguities that would have otherwise gone unnoticed and unaddressed.Computational implementations also make it far easier to test a model’s predic-tions about a given dataset, even for large datasets, as compared to needing to createa model’s representations and work through a model’s processes by hand. Conse-quently, it is possible to test models on much more wide-ranging datasets, both forthe researcher developing a model and for others testing it on their own data. (Re-sponses to this need for machine-readable data files also improve the ecosystemof data available to the community of researchers.) Quantitative and probabilisticmodels in particular virtually require computational implementations, as the pricepaid for their flexibility and power is a proliferation of mathematical operationsnecessary when evaluating model predictions.1.2.3 LearnabilityAs described above, a precise model of the paradigm cell filling problem will pre-dict a probability distribution over inflected derivative forms given a grammar anda set of base forms. However, quantitative models tend to have very large hypoth-esis spaces. Unlike an Optimality Theory grammar (McCarthy & Prince, 1993;Prince & Smolensky, 2008), for example, which has n! possible configurations forthe n! rankings of n constraints, a grammar with n weighted constraints such as9a Maximum Entropy harmonic grammar (Goldwater & Johnson, 2003; Hayes &Wilson, 2008) or other flavor of harmonic grammar (Legendre et al., 1990; Pater,2009) has an essentially infinite number of possible configurations, since each con-straint can take any real number as its weight. One appealing way to cope with thisproblem of massive hypothesis spaces is to also model the learning of models orgrammars from sets of input data: defining a procedure for determining a model’sparameter values tightly limits the space of valid model configurations for a givenset of data. These limitations free the analyst from needing to consider the spaceof implausible model parameterizations, and can result in more accurate modelsby including predictors or interactions among predictors that may be difficult forhuman analysts to notice or formalize (Hayes & Wilson, 2008; Hayes & White,2013). Moreover, when the input to a model is a set of observables (e.g. the wordforms a speaker knows) as opposed to abstract parameters of a grammar, the modelcan be used more easily for practical purposes, as there is less need for an expertanalyst with knowledge of how to tune model parameters.Beyond these pragmatic reasons, the call for morphological and phonologi-cal theory grounded in considerations of learnability by humans is stronger thanever (Albright & Hayes, 2011; Archangeli & Pulleyblank, 2012; Moreton & Pa-ter, 2012) and I find these calls as justified for the paradigm cell filling problemas for other domains of language. One goal of theoretical linguistics is to studythe mental linguistic systems of humans, and so since humans learn language fromdata in our environments, unlearnable grammar formalisms are unlikely to accu-rately model human language. The domain of learnability has also proved worthyof study in its own right, since some notable aspects of natural language may onlybecome apparent when studied from the standpoint of learnability (Hudson Kam& Newport, 2005; Jesney & Tessier, 2009; McMullin, 2016).1.3 Structure of dissertationIn the next chapter, I introduce sublexical morphology, a Bayesian framework formodeling the paradigm cell filling problem which is computationally implementedand comes equipped with a learning algorithm. The remainder of the dissertationsubstantiates the claims that make up the sublexical morphology proposal. Specif-ically, the content of the following chapters can be summarized as follows.10Chapter 2 lays out my modeling proposals. It starts with a probabilistic inter-pretation of the paradigm cell filling problem and then introduces a Bayesian viewof how speakers “solve” this “problem”. Finally and most substantially, the chapterdetails the sublexical morphology framework which grounds the abstract Bayesianapproach in concrete methods for generating inflected forms and inferring theirprobability distributions.Chapter 3 presents two experimental investigations into questions of how na-tive speakers use knowledge of base forms when inferring unknown derivativeforms, questions whose answers bear directly on the validity of sublexical mor-phology. First, I use evidence from a behavioral experiment on Icelandic speakersto show that speakers are able to combine discrete pieces of information from mul-tiple inflected base forms when performing inference. Second, I discuss resultsfrom a similar experiment on Polish speakers which suggest that speakers may berestricted to linear (additive) combinations of such pieces of information.Chapter 4 revisits the Icelandic and Polish experimental results, performingpost-hoc analyses suggesting that speakers are strongly influenced by raw lexicalfrequencies of morphological exponents. According to these results, these influ-ences can even cause speakers to fail to productively apply otherwise exception-less morphological patterns. In the context of sublexical morphology and Bayesianmorphology in general, these results support the hypothesis that prior probabilitiesof morphological exponents play a central role in determining linguistic behavior.Chapter 5 concludes the dissertation. First it summarizes the claims of previ-ous chapters, interspersing proposals and their empirical support. The chapter thendescribes how theoretical linguists could fruitfully apply the theory of sublexicalmorphology to research topics other than the paradigm cell filling problem per se,such as paradigm leveling and paradigmatic gaps. It ends by mentioning limita-tions of the theory as it stands now and proposing follow-up research that couldaddress them and extend the theory’s applicability.11Chapter 2Bayesian morphology andsublexical morphologyThis chapter develops the primary claims of the dissertation. First, in section 2.1,I use the language of probability theory to establish a formal description of theparadigm cell filling problem. Section 2.2 then introduces the concept of a surface-oriented, Bayesian view of morphological inference. The most fundamental pro-posal I make in this dissertation is that such a Bayesian account of the paradigm cellfilling problem is both valid and useful, and so this section sets up specific claimsthat I substantiate in chapters 3 and 4. Because the framework of Bayesian mor-phology does not by itself constitute a concrete, computationally implementabletheory, in the remainder of the chapter, I propose a specific flavor of Bayesian mor-phology: sublexical morphology. Section 2.3 describes the architecture of sublex-ical morphology models, and section 2.4 illustrates how such a model can performthe inference necessary to solve the paradigm cell filling problem. The sublexicalmorphology proposal introduces further empirical claims which are addressed inchapters 3 and 4. Section 2.5 details a learning algorithm for sublexical morphol-ogy models. Finally, section 2.6 compares sublexical morphology to a selection ofother theories of inflectional morphology.As a supplement to this dissertation, I have created a Python (van Rossum &Drake, 1995) implementation of sublexical morphology, which is publicly avail-able under the project name PyParadigms. This implementation includes a learn-12ing algorithm for sublexical morphology models (described in section 2.5) as wellas various command-line utilities for using learned models, e.g. for performingparadigm cell filling problem queries and investigating formal properties of an in-flectional system. This software provided freely on an open source basis, availablefor download at https://github.com/bhallen/pyparadigms.In this chapter and the remainder of the dissertation, I use a standard set of ty-pographical conventions as laid out here. Names of lexemes like CAT are typeset insmall caps, and verb lexeme names are normally written with a preceding “to” as inTO LOVE, but they may sometimes be written simply as LOVE when part of speechis clear from context. Transcriptions are provided in square brackets using IPAsymbols, e.g. [kæZ], and reflect phonological forms (broad transcriptions) exceptwhere otherwise indicated. Orthographical forms like vivir are written in italics.Conventions for labeling cells in a paradigm are introduced as necessary through-out, but in general I use compounds of abbreviated names of a cell’s morpho-syntactic features, e.g. 1SgPresIndic for the first singular present indicative cell.Throughout this chapter, I will use the system of verbal inflection in normativeEuropean Spanish (Alarcos Llorach, 1994) to illustrate key concepts. Specifically,I draw examples from the present indicative cells of the three “regular” classes ofverbs: “-ar” verbs (e.g. amar TO LOVE), “-er” verbs (e.g. temer TO FEAR), and “-ir”verbs (e.g. partir TO SPLIT/DEPART). Having limited the domain of examples thisway, I use person-number labels like 1Sg and 3Pl to indicate these cells, abstractingaway from their tense and mood.2.1 Formalizing the paradigm cell filling problemThe paradigm cell filling problem is the problem of how to predict the form in anunfamiliar cell of a familiar lexeme’s paradigm. I assume that the lexeme mustbe familiar in the sense that at least one of its forms is known to the speaker orprovided to the model.1 In the context of a probabilistic grammar, this task can1The related question of how a speaker might initially fit some new word into an inflectionalparadigm lies outside the scope of this thesis. However, in many cases, this process has been de-scribed by other authors. In Spanish, for example, new verbs can enter the inflectional system asinfinitives by concatenating the basic pronunciation of the referent with a standard [ear] suffix, e.g.faxear TO FAX and bloguear TO BLOG (Honrubia et al., 2011). Similarly, in Japanese, a novel verb’sdictionary form can be created by replacing the final mora of the base word with the suffix [ru]and enforcing relevant verbal phonotactics, as in [jafu:] YAHOO! → [jafuru] TO SEARCH USING13be re-interpreted as one of first inferring a probability distribution p(D) over thecandidatesD for this unfamiliar (derivative) form of the familiar lexeme ℓ, and thensampling from that distribution to select a single form to utter.In order to infer a distribution over derivative form candidates, a speaker musthave encountered at least one “base” form of the same lexeme ℓ. These base formsserve two purposes. First, familiarity with at least one such form makes the speakeraware of the lexeme’s existence, a logical precursor to solving the paradigm cellfilling problem as conceived here. Second, it is the phonological shapes of thesebase forms that provide information the speaker can then use to infer a specific dis-tribution over derivative forms. For example, a speaker of Spanish who has heardof the lexeme TO BLOG only that its first person singular present indicative form is[blogeo] and that its third person singular present indicative form is [blogea] caninfer from the shapes of these forms that the probability of this lexeme’s infinitivebeing [blogear] is much higher than its probability of being [bloger]. Of course,to make productive use of such implicational relationships, the speaker must alsohave a grammar that encodes them; the general idea of Bayesian morphology doesnot require any specific formalism for this grammar, and the question of what agrammar compatible with Bayesian morphology might look like is addressed indiscussions of sublexical morphology starting in section 2.3.Just as D represents the set of derivative form candidates d for a lexeme ℓ,I use B to signify the set of possible base forms b in a single cell for a lexemeℓ. Because forms associated with multiple base cells can conceivably be used ininferring a single derivative form, I employ a subscript to indicate the base form setsof individual base cells. In the abstract, then, we can indicate a speaker’s inferredprobability that the derivative form of some lexeme ℓ is d, given the observed formsin n base cells, as shown in 2.1.p(D= d|B1 = b1;B2 = b2; :::Bn = bn) (2.1)YAHOO! and [gu:guru] GOOGLE → [guguru] TO SEARCH USING GOOGLE (Tsujimura & Davis,2011).14Hereafter, I will make use of the shorter notational convention shown in 2.2where contextually appropriate. By this convention, the probability p(D= d) thatthe discrete variable D takes the value d (e.g. that the plural form of a lexemehas some particular phonological form) is abbreviated as p(d), with equivalentabbreviations for other variables B1 etc.p(d|b1;b2; :::bn) = p(D= d|B1 = b1;B2 = b2; :::Bn = bn) (2.2)For example, either of these formats can represent the Spanish TO BLOG exam-ple introduced above, as shown in equation 2.3.p([blogear]|[blogeo]; [blogea])= p(in f initive= [blogear]|1Sg= [blogeo];3Sg= [blogea])(2.3)In reality, there is no restriction to a single phonological form for any lexeme-cell combination. There may be multiple phonologically distinct forms in commonuse, for example [kækZ@YIz] and [kækZaI] for the plural of CACTUS in English.2The probabilistic framework I have laid out is compatible with this complexity:instead of conditioning derivative candidate distributions on base form variableswhich must each take a single value, as in 1Sg=[blogeo], we can condition themon observation counts of base forms. Section 2.3 through 2.5 provide more detailabout how such counts are used in sublexical morphology. The notation shown inequation 2.3 is more compact and approximates the relevant facts for forms withno phonological variation, and so I continue to use it for expositional purposes.Wherever the simpler notation is used, it can be treated as a shorthand for relevantdistributions of frequency counts.2In this dissertation I abstract away from predictable allophonic detail in inflected forms, for ex-ample the presence or absence of aspiration on the initial [k] of [kækZaI]. I assume that all mentalcalculations are carried out on phonological forms that have been accurately inferred from interlocu-tor speech.152.2 Bayesian morphologyThe previous section defined the task of generating an unfamiliar derivative form—that is, solving an instance of the paradigm cell filling problem—as the inference ofa conditional probability distribution p(D|B1;B2; :::Bn) followed by sampling fromthat distribution. This definition alone, however, falls far short of a useful modelof the phenomenon, as there is no clear way to directly infer such a distributionfrom base forms. Moreover, to directly calculate derivative probabilities this waywould require summing over derivative probabilities conditioned on all possiblecombinations of base forms in order to arrive at the constant used to normalizeprobabilities so that they sum to 1. Because any concatenation of phonologicalunits is a possible form of a given lexeme in any specific cell, this set of com-binations of possible base forms is infinite, and summing over it would requirecommensurate computational resources (although see Hayes & Wilson 2008 foran approach that approximates this normalization constant at substantial but finitecomputational expense). If learning such a model from a set of training data basedon counts of base form combinations, sparsity would also pose a substantial prob-lem, as only a tiny fraction of possible combinations of base forms would be likelyto be represented in the data.Fortunately, by applying Bayes’s theorem, it is possible to decompose the con-ditional distribution p(D|B1;B2; :::Bn) into sub-parts more amenable to direct cal-culation. As shown in equation 2.4, the probability of a derivative candidate givena set of base forms is proportional to the following quantity: the probability ofthose base forms given the derivative candidate, times the prior probability of thederivative candidate.p(D|B1;B2; :::Bn) µ p(B1;B2; :::Bn|D)p(D) (2.4)Proportionality of the right-hand side to the left-hand side in this case meansthat the left-hand side is equal to the right-hand side divided by a normalizationconstant Z. This constant sums the right-hand side of equation 2.4 across all possi-ble derivative candidates. The set of derivative forms to sum over is still infinite likethe normalization constant that would be required to calculate p(D|B1;B2; :::Bn).However, given a finite approximation of the set of all possible forms in an arbi-16trary cell, the size of this set for a single derivative cell is smaller than the size ofthe set of all combinations of base cells by a factor of the number of cells in aninflectional system. More importantly for the purposes of this dissertation, one canposit constraints that restrict the set of derivative candidates in particular to a smallset, such that summing over them is trivial; this is the approach that I take in thefollowing section.I define Bayesian morphology as the proposal that the behavior of native speak-ers faced with the paradigm cell filling problem can be predicted using this appli-cation of Bayes’s theorem to the probability theoretic definition of the paradigmcell filling problem. This general framework predicts, for example, that knowledgeof the prior probabilities of derivative forms (as defined in some way) is indispens-able in predicting morphological behavior. But while Bayesian morphology hasthe computational and empirical advantages described here, it still lacks the speci-ficity necessary to constitute a practical theory of the paradigm cell filling problem.In order to achieve this research goal, it is necessary to create a theory that buildson Bayesian morphology by adding mechanisms for calculating the distributionsp(B1;B2; :::Bn|D) and p(D).2.3 Sublexical morphologySublexical morphology is a specific flavor of Bayesian morphology that provides anintuitive, computationally implementable, and efficient way to calculate the distri-butions p(B1;B2; :::Bn|D) and p(D) and therefore the paradigm cell filling problemobjective distribution p(D|B1;B2; :::Bn). The sublexical morphology frameworkalso provides an algorithm for learning models of inflectional systems from sets oftraining data, form–meaning pairs which approximate the information that humanlearners could plausibly have when learning their morphological grammars. Sub-lexical morphology is based in part on sublexical phonology (Allen & Becker, inreview; Becker & Gouskova, 2013; Gouskova & Newlin-Łukowicz, 2013), fromwhich it inherits the spirit of concepts like sublexicons and gatekeeper grammars,although sublexical phonology lacks the explicitly Bayesian character of sublexicalmorphology which is a major focus of this dissertation.This section introduces the central claims specific to sublexical morphology,explaining how they build on the general claims of Bayesian morphology. This ex-17position includes both an introduction to the core theoretical claims of the frame-work and the data structures that the theory of sublexical morphology uses to rep-resent an inflectional system. Section 2.4 then details the mechanism by whicha sublexical morphology model can be used to perform morphological inference,that is, how it can solve the paradigm cell filling problem. Section 2.5 followsup on these descriptions of sublexical morphology models by explaining the algo-rithm by which such models can be learned from a set of training data. Finally,section 2.6 compares sublexical morphology to some other influential theories ofinflectional morphology.2.3.1 Theoretical core of sublexical morphologyThe theory of sublexical morphology can be characterized by the claim that an in-flectional system is comprised of a set of paradigm sublexicons, each of which isa set of lexemes with identical morphological behavior. These paradigm sublexi-cons resemble the traditional concept of inflectional classes, but have more clearlydefined internal structures and roles in derivation. Cases in which a lexeme ex-hibits multiple attested forms in any particular cell are the principled exception tothis rule, and such a lexeme may belong to multiple paradigm sublexicons; suchcases are discussed further later in this section. In general, however, a language’sparadigm sublexicons can be thought of as a partition of the lexicon into by-lexeme(not by-cell) subparts each of which is homogeneous with respect to the language’sinflectional morphology. Note that except where contrasting paradigm sublexi-cons with the related concept of mapping sublexicons in section 2.5, I will refer toparadigm sublexicons simply as sublexicons.When I describe a sublexicon as homogeneous in its morphological behavior, Imean that for every lexeme in a sublexicon, for each pair of cells, there is a singlemorphological operation (which may include multiple changes, e.g. stem vowelmutation and suffixation) that takes as input the phonological form of that lexemein one of those cells and outputs the phonological form of that lexeme in the othercell. These operations deal only in surface-level phonological forms, not abstractunderlying representations or roots; for a discussion of how this approach contrastswith other theories of inflectional morphology and how I justify it, see section 2.6.For example, in normative European Spanish, there might be one sublexicon (the18“-ar” sublexicon) with morphological operations like those shown in 2.5. The #symbol here indicates the right edge of an inflected form.morphological operations :8>>>>>><>>>>>>:1Sg→ 2Sg: [o#]→ [as#]1Sg→ 3Sg: [o#]→ [a#]: : :3Pl→ 1Pl: [n#]→ [mos#]3Pl→ 2Pl: [n#]→ [is#](2.5)The central reason for positing sublexicons is that when at least one base formof a lexeme is known, division of the lexicon into sublexicons sets up a directmapping from sublexicon to derivative candidate, and this mapping can be usedin morphological inference. Sublexical morphology allows many-to-one mappingsfrom sublexicons to derivative candidates, i.e. different sublexicons that happento generate the same derivative candidate, but it explicitly disallows one-to-manymappings, meaning that the choice of a sublexicon fully determines the choice ofa derivative candidate. This property also extends to a probabilistic setting: estab-lishment of a probability distribution over sublexicons fully determines a probabil-ity distribution over derivative candidates. In general, the probability of a deriva-tive form is equal to the probability of the sublexicon that generates that derivativeform, as shown in equation 2.6. This equation constitutes perhaps the most centralproposal of this dissertation. Note that there is a trivial exception to this simpleequality when multiple sublexicons generate the same derivative form; since theirprobabilities only need to be added together, for now I abstract away from theseedge cases to avoid baroque notational conventions.p(D|B1;B2; :::Bn) = p(S|B1;B2; :::Bn) (2.6)This equality reduces the task of inferring a probability distribution over deriva-tive candidates to a classification task: the speaker or model needs only to assesshow similar the target lexeme (the one whose derivative form is being inferred) is toeach sublexicon, and this probabilistic classification suffices to arrive at a distribu-tion over derivative candidates. However, because the distribution over sublexicons19is still conditioned on the joint distribution over all sets of base forms, an infiniteset of sets, direct calculation of conditional distributions over sublexicons is stillinfeasible for the same reason as discussed in section 2.2.Fortunately, the equality set up between derivative candidates and sublexiconsleaves the distribution over the latter equally amenable to application of Bayes’stheorem. By applying the theorem just as shown in section 2.2, but with distribu-tions S over sublexicons substituted for distributions D over derivative forms, wearrive at equation 2.7.p(S|B1;B2; :::Bn) µ p(B1;B2; :::Bn|S)p(S) (2.7)The advantage of this interpretation of the paradigm cell filling problem isthat both quantities from which derivative probabilities emerge, p(B1;B2; :::Bn|S)and p(S), have intuitive, computationally tractable methods of calculation. Fi-nally, therefore, this equation is the end result of all the manipulations necessaryto describe sublexical morphology, since it achieves the goal of relating easilycomputable quantities to the quantities of central import, i.e. the probabilities ofderivative candidates. The term p(B1;B2; :::Bn|S), which I call the likelihood termbecause it indicates likelihood of the attested base forms given a particular sub-lexicon, can be calculated by using sublexicons’ gatekeeper grammars, log-linearmodels based on phonological constraints. The prior probabilities of sublexiconsp(S) correspond to the “sizes” of the various sublexicons in terms of how manylexemes are associated with them.The following subsection takes these theoretical claims and specifies how theyare implemented in a set of formal, pseudo-computational data structures. This halfof the section largely serves to provide a concrete grounding to the abstract claimsset up so far, as well as to detail how various special cases are handled. Note thatchapters 3 and 4 provide empirical evidence for the likelihood term and the priorterm, respectively.2.3.2 Data structures of sublexical morphologyWithin the sublexical morphology framework, a model of a particular inflectionalsystem consists of a set of (paradigm) sublexicons. Each sublexicon has threecomponents: a set of associated lexemes (or, more properly, associated forms of20lexemes), a set of morphological operations that map from forms in one cell toforms in another, and a gatekeeper grammar that assesses the likelihood of a set ofbase forms given the sublexicon. The structure of a model, and the structures ofits sublexicons, are schematized in 2.8. In this subsection I describe each of theseparts in turn.model :8>>>>>>>>>>><>>>>>>>>>>>:paradigm sublexicon:8><>:associated formsmorphological operationsgatekeeper grammarparadigm sublexicon:8><>:associated formsmorphological operationsgatekeeper grammar: : :(2.8)Associated formsEach paradigm sublexicon is associated with a morphologically homogeneous sub-set of the lexemes in the language’s inflectional system. In the case of the normativeEuropean Spanish verbs example—ignoring for now complicating phenomena likediphthongization and velar insertion (Albright, 2002) which would increase thenumber of sublexicons—the inflectional system can be split into three sublexiconsthat parallel the three canonical inflectional classes of Spanish verbs:model :8><>:paradigm sublexicon 1 (“-ar”)paradigm sublexicon 2 (“-er”)paradigm sublexicon 3 (“-ir”)(2.9)Each form among the associated forms of a sublexicon is stored along with alabel for its lexeme, which cell it belongs to, and its frequency, as shown in theabstract in 2.10 and for a hypothetical Spanish “-ar” sublexicon in 2.11. Interpret-ing sublexical morphology as a theory of human use of inflectional morphology,these frequency counts indicate the number of times each phonological form hasbeen heard by a speaker; when a sublexical morphology model is learned by acomputational implementation of its learning algorithm, frequencies represent thefrequencies of those forms in the training data provided to the learning algorithm.21Either way, I assume in this chapter that forms are stored as phonological repre-sentations of uttered inflected forms. Note however that sublexical morphology isnot strictly limited to the domain of phonological representations, and can operateover orthographical representations where they approximate phonological repre-sentations, as used in chapters 3 and 4.associated forms :8><>:cell, lexeme, form: frequencycell, lexeme, form: frequency: : :(2.10)associated forms :8>>>>>><>>>>>>:1Sg, SPEAK, [ablo]: 8002Sg, SPEAK, [ablas]: 200: : :2Pl, COOK, [kosinais]: 1003Pl, COOK, [kosinan]: 700(2.11)When each lexeme–cell combination in a language has only one attested phono-logical form, all forms of a particular lexeme will be associated with just a singlesublexicon. Section 2.5, which describes the learning algorithm for sublexical mor-phology models, makes it clear why this is the case. When there is any variabilityin the form of a particular lexeme in any cell, however, that lexeme will be as-sociated with one sublexicon for each of its variants. For example, the Englishplural of CACTUS varies between [kækZ@YIz] and [kækZaI], and so the former formwould be associated with an “add final -Iz to pluralize” sublexicon and the latterwith an “[@Y#]→[aI#] to pluralize” sublexicon. It is convenient, though, to thinkof sublexicons roughly as partitions of the lexicon by lexeme, since describingthem as partitions “by form” is less clear about the criteria for the partitioning (andwould, e.g. be consistent with mistakenly thinking of each sublexicon as containinga subset of the cells in an inflectional system). Note as well that because sublex-ical morphology lacks any mechanism for performing derivational morphology,even a compound which one could argue “contains” multiple lexemes, like EnglishBLACKBOARD or SUNLIGHT, is itself treated as a single lexeme.22Morphological operationsThe forms within a particular sublexicon are morphologically homogeneous withrespect to each other, and this property of homogeneity is encoded in a sublexi-con’s morphological operations. A single morphological operation is defined asthe changes that must be applied to the inflected form in one cell to produce itsinflected form in a different cell. Therefore there are as many morphological op-erations associated with a paradigm sublexicon as there are ordered pairs of cellsin the inflectional system, a total of n2−n operations for n cells. The schematic in2.12 shows the abstract form of a sublexicon’s morphological operations, and 2.13repeats the earlier example of morphological operations for the “-ar” sublexicon innormative European Spanish. As before, the # symbol is used to indicate the rightedge of an inflected form.morphological operations :8><>:base cell→ derivative cell: operationbase cell→ derivative cell: operation: : :(2.12)morphological operations :8>>>>>><>>>>>>:1Sg→ 2Sg: [o#]→ [as#]1Sg→ 3Sg: [o#]→ [a#]: : :3Pl→ 1Pl: [n#]→ [mos#]3Pl→ 2Pl: [n#]→ [is#](2.13)The morphological homogeneity of a sublexicon can be formalized in termsof such operations. Each operation can be thought of as a function that takes aninflected base form as its input and yields an inflected derivative form as its output.For a sublexicon to be morphologically homogeneous, then, the following musthold: for any lexeme in that sublexicon, for any base cell and derivative cell, thesublexicon’s morphological operation from that base cell to that derivative cellmust generate (one of) the lexeme’s attested form(s) in the derivative cell whenprovided (one of) the lexeme’s attested base form(s).3 For example, the operations3The parenthetical additions here accommodate the fact that a lexeme may be associated with23for the Spanish sublexicon shown in 2.13 are valid for all pairs of forms that wouldbe associated with that sublexicon, e.g. TO TOUCH 1Sg [toko] ∼ 2Sg [tokas] andTO SPEAK 3Pl [ablan] ∼ 2Pl [ablais].Each morphological operation can include multiple individual changes. In amodel of the inflectional morphology of Arabic nouns, there might be a sublexi-con whose morphological operations include one indexed to a cell Singular as thebase cell and Plural as the derivative cell and comprising the following changes:mutate the first vowel to [a], insert [a:] after the second consonant, and mutate thelast vowel to [i]. Such an operation would map between inflected form pairs likeLOCUST Singular [d@undub] ∼ Plural [d@aTa:dib] (Childs, 2003).Sublexical morphology does not commit to any one particular computationalimplementation of the functions constituting these morphological operations. Theycould conceivably be implemented as finite state transducers (Karttunen &Beesley,2005), at least for regular morpho-phonological relations. The PyParadigms im-plementation of sublexical morphology follows the work of Allen & Becker (in re-view) in using an operation formalism designed specifically for encoding morpho-phonological changes in a way that mirrors the cross-linguistic typology of suchchanges. This formalism makes it possible for operation positions to be stated, forexample, in terms of a word’s final syllable nucleus. While the formalism usedfor these operations plays an undeniably large role in the utility and learnabilityof sublexical morphology models, the task of making empirical claims about thenature of these operations beyond their basic nature as mappings from inflectedform to inflected form lies outside the scope of this dissertation. Allen & Becker(in review) includes a discussion of considerations related to this issue.Gatekeeper grammarFinally, each sublexicon includes a gatekeeper grammar. This component of a sub-lexicon assigns a probability to a set of provided base forms; intuitively, this proba-bility indicates how well-formed the base forms are as members of that sublexicon.Gatekeeper grammars have the formal structure of Maximum Entropy (MaxEnt)harmonic grammars (Goldwater & Johnson, 2003; Hayes &Wilson, 2008; Wilson,multiple sublexicons. In such cases, for each sublexicon that a lexeme is associated with, there mustbe some pair of a base form and a derivative form such that the appropriate morphological operationin that sublexicon correctly generates the derivative form from the base form.242006), meaning that they are constraint-based grammars in which constraints havereal-valued weights rather than a set ranking. Outside the domain of theoreticallinguistics, they can be described as log-linear models (Knoke & Burke, 1980).A gatekeeper grammar is fully parameterized by its set of constraints and theirweights.Notably, while I borrow use of the term constraint from the MaxEnt harmonicgrammar literature, these objects are not constraints in the usual phonological senseof the word. Each constraint acts as an indicator function, meaning that it evaluatesto 1 if a particular structure is present in the input and otherwise evaluates to 0. Inthe terminology of Optimality Theory (Prince & Smolensky, 2008), evaluating to 1equates to assigning a violation. Notably, however, a constraint may either serve todetect a structure whose presence increases the probability of the input base forms(if the constraint has a positive weight) or to detect a structure whose presence de-creases that probability (if it has a negative weight). It is also possible to restrictgrammars to using only positive or negative weights, or only weights within ar-bitrary ranges, as desired—see Pater (2009) and Daland (2015) for discussions ofsome implications of various weight conventions.Evaluation of a constraint can be usefully thought of as a two-step process, onereflected in the notational convention I use for constraints and in the data structureused to represent them. A constraint first needs a label for the cell whose formsit evaluates; having extracted the input form in that cell (if one is provided), theremainder of the constraint provides a description of the structure whose presencein that form will result in the constraint evaluating to 1. There are no restrictionson which cells a sublexicon’s constraints can refer to. As with the morphologi-cal operations, sublexical phonology is agnostic as to the specific details of howthis structure-detecting component is implemented. PyParadigms uses regular ex-pressions supplemented with rules capable of expanding featural descriptions (ifcompatible with a provided phonological feature system) into sets of charactersamenable to regular expression matching. The abstract template for a gatekeepergrammar and an example of the grammar for the Spanish “-ar” sublexicon areshown in 2.14 and 2.15, respectively. Again, the symbol # indicates the right edgeof a word, equivalent to $ in a regular expression.25gatekeeper grammar :8><>:cell: target sequence weightcell: target sequence weight: : :(2.14)gatekeeper grammar :8>>>><>>>>:3Sg: a# 43Sg: e# −2:53Pl: an# 2:2: : :(2.15)The example below provides a tabular representation of how the Spanish “-ar”sublexicon’s grammar operates on two Spanish forms. This tableau represents ahypothetical case of the paradigm cell filling problem in which two forms of theverb TO LOVE, 3Sg [ama] and 3Pl [aman], have been provided to the gatekeepergrammar, as one step in the process of establishing a probability distribution overcandidates for this lexeme’s form in some other cell. Columns in the central part ofthe table correspond to constraints, whose weights (w) are given in the uppermostrow. Each of the two rows below the horizontal rule indicates how a single inflectedform has been evaluated by constraints in the grammar. The harmony scoresH ofeach base form, as well as their total harmony, are shown in the rightmost column.Harmony scores are weighted sums of forms’ violation profiles.w=4 w=-2.5 w=2.2 w=-1freq. 3Sg: a# 3Sg: e# 3Pl: an# 3Pl: en# FormH3Sg: ama 1 1 0 0 0 43Pl: aman 1 0 0 1 0 2.2Total harmony of bases: 6.2Figure 2.1: Tableau showing example weights and violation profiles for fourhypothetical gatekeeper grammar constraints, as well as the forms’ har-mony scores. In this example, two bases for a verb have been providedto this grammar as inputs, and the grammar has produced a total har-mony score of 6.2 for them.26Because this section focuses only on describing the formal structure of sublex-icons, the procedure by which gatekeeper grammars’ harmony scores are used toevaluate wellformedness is covered in detail in the section dedicated to inference insublexical morphology, section 2.4, and the learning algorithm for these grammarswill be given a similar treatment in 2.5 along with the other aspects of the overallsublexical morphology learning algorithm.As a final piece of exposition about the structure of gatekeeper grammars, Inote that there is currently no way to explicitly encode dependencies between sub-lexicons or between their gatekeeper grammars. If the associated forms for twosublexicons have very similar evaluation profiles for the constraints used in theirgatekeeper grammars, then the weights of those constraints will be similar in thosetwo sublexicons’ grammars. These similarities are therefore not “accidental” be-cause they are derived from and grounded in the phonological similarities of thedifferent sublexicons, but there is no mechanism for grammars to share weightswith each other or otherwise directly interact.2.4 Derivative inference in sublexical morphologyI turn now from a description of the formal structure of sublexical morphologymodels to a description of the process by which such a model can solve the paradigmcell filling problem. I call this process derivative inference, since it results in a con-ditional probability distribution over candidates for the form of the specified lex-eme in the specified derivative cell. Recall that sublexical morphology arrives atthis distribution via the the equalities shown in equation 2.16, which—for reasonsthat will soon become clear—explicitly includes division by the normalization con-stant Z instead of leaving this division implicit by using the proportionality symbolµ as shown earlier in e.g. equation 2.7.p(D|B1;B2; :::Bn)= p(S|B1;B2; :::Bn)=p(B1;B2; :::Bn|S)p(S)Z(2.16)To summarize this equation, the probability of a derivative candidate d giventhe base forms b1 through bn can be calculated from the following quantities:271. p(b1;b2; :::bn|s), the likelihood of b1 through bn given the sublexicon swhichgenerates d2. p(s), the prior probability of the aforementioned sublexicon s3. the normalization constant ZThe subsections of this section walk through these three elements of the equa-tion part by part, using an example paradigm cell filling problem query based onSpanish to illustrate relevant mechanics. An additional subsection then addressesthe topic of how derivative candidates are generated in sublexical morphology,showing also how a distribution over sublexicons is then used to arrive at a dis-tribution over derivative forms. The final subsection shows how these values arebrought together by equation 2.16 for the Spanish example.2.4.1 Calculating probabilities of basesThe term p(b1;b2; :::bn|s) indicates the likelihood (probability) of the observeddistributions over base forms given (assuming) that the lexeme in question is amember of paradigm sublexicon s. This likelihood can be usefully thought of asa kind of comparative phonotactic wellformedness rating (cf. Hayes To appear)which expresses how much the phonological shapes of the observed bases matchthe phonological regularities among forms associated with a sublexicon. In sublex-ical morphology, these likelihood values are calculated by sublexicons’ gatekeepergrammars: the joint probability of the observed bases of the target lexeme given aparticular paradigm sublexicon s out of the set of paradigm sublexicons S is deter-mined by applying the gatekeeper grammar of s to the observed bases.In order to unpack this description, I begin by setting up a hypothetical paradigmcell filling problem task (for a speaker) or query (to a model), focusing again onnormative European Spanish. Suppose that there exists a speaker of this languagewho, despite otherwise normal fluency in the language, has limited experience withthe lexeme TO LOVE. Particularly, this speaker has only ever heard the infinitive(Inf) of this lexeme, [amar], and its third person singular (3Sg) form, [ama], andeach only once. In some conversational setting, this speaker finds a need to ex-press the lexeme TO LOVE in its first person plural (1Pl) form. In other words,the speaker needs to use her/his knowledge of the inflectional system, [amar], and28[ama] in order to infer a probability distribution over the candidates for the 1Pl formof TO LOVE from which she/he can sample part of an utterance. Alternatively, asublexical morphology implementation of the Spanish verbal inflection system in acomputer could be provided the following query: what is the distribution over 1Plforms for a verb given that its Inf form is [amar] and its 3Sg form is [ama]?Having established the nature of the example paradigm cell filling problemtask/query, consider the three hypothetical sublexicons of regular verbs in norma-tive European Spanish, repeated below as 2.17.model :8><>:paradigm sublexicon 1 (“-ar”)paradigm sublexicon 2 (“-er”)paradigm sublexicon 3 (“-ir”)(2.17)Each of these sublexicons has its own gatekeeper grammar, the constraint weightsof which reflect the phonological regularities of each sublexicon. For example,among lexemes in the “-ar” sublexicon, infinitive forms always end in [-ar], whileinfinitives in the “-er” and “-ir” sublexicons invariably end in [-er] and [-ir], respec-tively. 3Sg forms in the “-ar” sublexicon end in [-a], while those in the other twosublexicons end in [-e]. Such categorical regularities, in addition to numerous gra-dient patterns that distinguish sublexicons, are encoded in their weights. Exampleweights in the three sublexicons of constraints relevant to these patterns are shownin 2.2.Inf: ar# Inf: er# Inf: ir# 3Sg: a# 3Sg: e#-ar sublexicon 2.3 -1.1 -1.4 1.2 -0.7-er sublexicon -0.3 2.1 -1.3 -0.8 1.3-ir sublexicon -0.9 -1.1 2.4 -1.0 1.1Figure 2.2: Arbitrary example weights of various constraints in the “-ar”, “-er”, and “-ir” sublexicons of normative European Spanish. Inf is usedas an abbreviation for the infinitive cell of the paradigm. Positive con-straint weights are bolded for ease of visual parsing.At this point, inference proceeds by having each sublexicon assign a likelihood29to the provided base forms Inf: [amar] and 3Sg: [ama]. Strictly speaking, thelikelihood of a set of bases given a sublexicon s is defined as its likelihood giventhat sublexicon’s gatekeeper grammar g:p(B1;B2; :::Bn|s) = p(B1;B2; :::Bn|g) (2.18)Because all gatekeeper grammars g are Maximum Entropy (MaxEnt) harmonicgrammars, the method of assessing probabilities conditional on them is well de-fined. According to the definition of a MaxEnt harmonic grammar (Goldwater &Johnson, 2003; Hayes & Wilson, 2008; Wilson, 2006), the probability of a phono-logical object x conditional on a grammar g is equal to the constant e to the powerof that object’s harmony scoreHx;g, divided by a normalization constant indexedto that grammar Zg, as shown in equation 2.19.p(x|g) = eHx;gZg(2.19)In the case of a gatekeeper grammar, the phonological object being evaluatedis the set of all n provided base forms b1;b2; :::bn. Their cumulative harmony scoreis equal to the sum of their individual harmony scores. The harmony score of asingle base form is equal to the weighted sum of its constraint output profile, i.e.its constraint “violations”. Equation 2.20 presents this definition by using cg;i torefer to the output (1 or 0) of the ith constraint in the grammar g and using w torefer to the weight of the ith constraint. b j indicates the base form in the jth basecell.Hb1;b2;:::bn;g =åiåjwg;icg;ib j (2.20)In equation 2.19, Zg is the sum of eH over all possible forms in the inflectionalsystem. Because of the difficulties inherent in estimating the phonological proper-ties of this infinite set of forms, sublexical morphology follows the example set bysublexical phonology (Allen & Becker, in review) by approximating Zg as the sumof eH over all forms available to the model.4 Concretely, then, the value of the4See, however, Hayes & Wilson (2008) for a way to more accurately estimate the phonologicalproperties of an infinite space of forms.30normalization constant Zg is given by the equality in 2.21, where b ranges across allthe base forms in a sublexicon s or in the data provided as part of the paradigm cellfilling problem task/query andHb;g is the harmony of base b given the grammar gof s.Zg =åsåbHb;g (2.21)Intuitively, the normalization constant Zg serves to ensure that the probabili-ties assigned by g to the set of possible base forms constitute a proper probabilitydistribution by summing to 1.0. This purpose is similar to that of the distinct nor-malization constant Z described in subsection 2.4.3, which normalizes probabilitiesof paradigm sublexicons rather than those of base forms.Returning to the Spanish example, in order to determine the likelihood of thethree sublexicons in 2.17, each sublexicon must assign a probability to the providedforms Inf: [amar] and 3Sg: [ama]. Figure 2.3 shows, for each sublexicon, theconstraint output/violation profiles for these forms, as well as their individual andcumulative harmony scores and their overall likelihood. For the weights of theseconstraints in the different sublexicons, see Figure 2.2.31-ar sublexiconInf: ar# Inf: er# Inf: ir# 3Sg: a# 3Sg: e# HInf: [amar] 1 0 0 0 0 2.33Sg: [ama] 0 0 0 1 0 1.2Cumulative harmony: 3.5Likelihood: 0.331-er sublexiconInf: ar# Inf: er# Inf: ir# 3Sg: a# 3Sg: e# HInf: [amar] 1 0 0 0 0 -0.33Sg: [ama] 0 0 0 1 0 -0.8Cumulative harmony: -1.1Likelihood: 0.003-ir sublexiconInf: ar# Inf: er# Inf: ir# 3Sg: a# 3Sg: e# HInf: [amar] 1 0 0 0 0 -0.93Sg: [ama] 0 0 0 1 0 -1.0Cumulative harmony: -1.9Likelihood: 0.001Figure 2.3: Constraint output/violation profiles for the input comprising Inf:[amar] and 3Sg: [ama], for the three example sublexicons. Constraintweights are given in figure 2.2. It is assumed that all three sublexiconshave a normalization constant Zg of 100, although this constant willnormally vary from sublexicon to sublexicon.Once the likelihood of the provided base forms given each sublexicon has beencalculated, the role of the gatekeeper grammars has ended. These values are storeduntil the sublexicons’ prior probabilities and their normalization constants are cal-culated, if they have not been calculated already, so that these values can be com-bined afterwards.2.4.2 Calculating prior probabilitiesIn sublexical morphology, calculating the prior probability of a particular paradigmsublexicon is intuitive and computationally simple. This probability corresponds to32the “relative size” of a paradigm sublexicon, which is a function of the frequenciesof the forms in each paradigm sublexicon. In the PyParadigms implementation, therelevant frequencies are assumed to be type rather than token frequencies, i.e. thecardinality of the set of a paradigm’s associated forms.As an illustration of this principle, 2.22 shows an example of the forms thatcould be associated with the hypothetical “-ar” sublexicon, repeated with slightmodification from 2.11. The overall type frequency of this sublexicon is 4, sinceit contains a total of four forms. If the “-er” and “-ir” sublexicons contained 5and 7 distinct forms, respectively, then the prior probability of the “-ar” sublexiconwould be 4=(4+5+7) = 0:25.associated forms :8>>>><>>>>:1Sg, SPEAK, [ablo]: 8002Sg, SPEAK, [ablas]: 2002Pl, COOK, [kosinais]: 1003Pl, COOK, [kosina]: 700(2.22)More generally, if the frequency of a sublexicon is |s|, then its prior probabilityis given by equation 2.23.p(s) =|s|ås |s|(2.23)One notable result of the Bayesian character of sublexical morphology is itsprediction that in the absence of any known base forms for a lexeme, or if onlyprovided base forms that contain no useful information about which sublexiconmight give the lexeme a higher probability, a speaker solving the paradigm cellfilling problem will simply sample from the prior distribution over sublexicons inorder to infer a derivative form for that lexeme. When base forms that containinformation relevant to sublexicon choice are available, they will pull speakers’predicted distributions away from the prior distribution, but in general the priordistribution will always play a significant role in shaping their a posteriori distri-butions, i.e. the distributions the speaker arrives at by combining base likelihoodsand prior probabilities. Chapter 4 confirms these predictions and discusses howregularization of gatekeeper grammar weights can be used to modulate the relativeimportance of likelihood terms and prior terms.332.4.3 NormalizationAccording to the definition of probability, the individual sublexicon probabilitiesthat make up the distribution p(S|b1;b2; :::bn) must sum to 1.0. This propertyis guaranteed by dividing each sublexicon’s numerator p(b1;b2; :::bn|s)p(s) by anormalization constant Z. As with the constant Zg from subsection 2.4.1, whichnormalizes the distribution p(b1;b2; :::bn|s) by summing across the exponentiatedharmony values of (an approximation of) all possible base forms, this constant Zconstitutes a sum. Specifically, Z is the sum of the products of each sublexicon’sp(b1;b2; :::bn|s) and p(s), as expressed in 2.24.Z =åsp(b1;b2; :::bn|s)p(s) (2.24)In the case of calculating Z, the terms p(b1;b2; ::bn|s) and p(s) are calculatedin the same manner as described in the preceding subsections for the numerator inequation 2.7.2.4.4 Generating the candidate setWhile the attested forms of the bases b1 through bn are provided to the model, theset of candidates D must itself be inferred. In sublexical morphology, this processis straightforward. Any arbitrary provided base b of the lexeme in question isprovided in turn to each sublexicon, and each paradigm sublexicon uses one of itsmorphological operations to generate a derivative candidate from that base. Theforms generated by this process constitute the derivative candidate set.To examine how candidates are generated, recall that each sublexicon containsa set of morphological operations, each indexed to a base cell and a derivative cell.In an inference task, the derivative cell is set, and so only the operations whosederivative index accords with that derivative cell are relevant.At this point, the candidate generation process depends on which base formsare available. Any arbitrary base form can be selected from among this set, sincethe sublexicon contains a morphological operation from each base cell to the tar-get derivative cell. To generate the derivative candidate, the paradigm sublexiconsimply applies this properly base-indexed operation to the selected base form. Ac-cording to the definition of a sublexicon, regardless of which base cell is chosen,34as long as the appropriate morphological operation is used on that base, the samederivative candidate will obtain.For example, suppose that a paradigm sublexicon for normative European Span-ish contains the operations shown in figure 2.4. Note again that as stated in section2.3, the exact nature of these operations depends on the formalism used to expressthem and the learning algorithm; the ones shown here are provided as examples.Inf→1Pl: change final [ar] to [amos]3Sg→1Pl: change final [a] to [amos]Figure 2.4: Example morphological operations for an “-ar” sublexicon in nor-mative European Spanish.Recall the earlier example inference task of determining a distribution over the1Pl forms of the lexeme TO LOVE when out of its inflected forms only its Inf [amar]and its 3Sg [ama] are known. Either of these forms can be used to generate thisparadigm sublexicon’s 1Pl derivative candidate. Assuming that the 3Sg [ama] ischosen (arbitrarily), then the 3Sg→1Pl operation is applied to this form, yieldingthe derivative candidate [amamos]. This result would have been the same if the Infform [amar] and Inf→1Pl operation were chosen instead.It is possible in some cases that a morphological operation will be unable to ap-ply to a given base form. Behavior in these situations depends, in the PyParadigmsimplementation, on user-specified parameter values. By default, for example, theoperation change final [ar] to [amos]would be able to apply to a base like [asir] byignoring the operation’s mention of the specific sequence [ar] and instead replacingthe segments in the same position (word-final, in this example), with the sequence[amos], yielding the derivative [asamos]. Users may specify instead that the oper-ation should require material in the base to match that specified in the operation,in which case this example would not yield a derivative form but instead instantlyassign the sublexicon a probability of zero. This outcome would be the same if, forexample, an operation altering the second-to-last syllable nucleus of a base wereapplied to a base with only one syllable. Sublexicons given a probability of zeroare effectively treated as non-existent for the remainder of a derivation.352.4.5 Bringing everything togetherThe previous subsections have shown, in the abstract and for a running examplefrom Spanish, how sublexical morphology models determine the different valuesnecessary to perform derivative inference. Equation 2.16, repeated here as 2.25,shows how a base likelihood p(b1;b2; :::bn|s), a sublexicon prior probability p(s),the normalization constant Z, and each sublexicon’s derivative candidate can becombined to yield a distribution over derivative candidates.p(D|B1;B2; :::Bn)= p(S|B1;B2; :::Bn)=p(B1;B2; :::Bn|S)p(S)Z(2.25)As a concrete demonstration of how these values relate, consider once again theSpanish example of inferring a distribution over 1Pl forms of TO LOVE given theInf [amar] and the 3Sg [ama]. The same three sublexicons that have been referredto throughout this section still make up the model: the “-ar” sublexicon, the “-er”sublexicon, and the “-ir” sublexicon. As shown in the subsection about base likeli-hood, the likelihood of the base forms Inf [amar] and the 3Sg [ama] given each ofthese sublexicons might be 0.331, 0.003, and 0.001, respectively. Suppose that therespective prior probabilities of these sublexicons are 0.4, 0.4, and 0.2. The termp(b1;b2; :::bn|s)p(s), i.e. the pre-normalization numerator, for each of the threesublexicons would therefore be 0.1324, .0012, and 0.0002, respectively. Z is equalto the sum of these values: 0.1338. Dividing the numerator for each sublexiconby Z gives the a posteriori probability of each sublexicon, that is, its probabilitytaking both likelihood and prior probabilities into account: 0.990, 0.009, and 0.001for the “-ar”, “-er”, and “-ir” sublexicons.At this point, we have calculated the distribution on the middle level of equation2.25. All that remains is to assign these sublexicon probabilities to their appropriatederivative candidates. Suppose that in this case the three sublexicons’ morphologi-cal operations produce the following three candidates for the 1Pl form: [amamos],[amemos], and [amimos]. None of these derivative forms are homophonous, andso sublexicon probabilities can be assigned to them without any need for sum-36ming probabilities of same-candidate sublexicons. Consequently, the speaker ormodel would assign a probability distribution of {0.990, 0.009, 0.001} to the threecandidates {[amamos], [amemos], [amimos]}. Sampling a single form from thisdistribution, a speaker solving this particular instance of the paradigm cell fillingproblem would therefore be very likely to produce [amamos] as the 1Pl form of TOLOVE.2.5 Learning in sublexical morphologySublexical morphology models possess the virtue of demonstrable learnability.Moreover, in contrast with e.g. Network Morphology (Brown & Hippisley, 2012),these models can be learned from phonological forms of words without analyst-provided morph divisions, making its learning inputs more similar to those presum-ably encountered by human learners. This section describes the learning algorithmfor sublexical morphology models implemented in PyParadigms as an example ofone practical approach to learning these models. The learning algorithm describedhere proceeds in three sequential steps, as shown in figure 2.5.Overview of the learning algorithm:1. Learning the mapping sublexicons to and from each cell2. Learning the paradigm sublexicons of the inflectional system3. Learning the gatekeeper grammars for the paradigm sublexiconsFigure 2.5: The three steps of the PyParadigms learning algorithm for sub-lexical morphology models.This section describes the learning algorithm step-by-step. The first subsectionintroduces the inputs assumed by the algorithm, and the following three explain thethree steps of learning, following the order set forth in figure 2.5. Runtime analysesare provided as appropriate throughout.2.5.1 Learning algorithm inputsThe learning algorithm for sublexical morphology models takes as inputs the datashown in figure 2.6.37Learning inputs:• a training set of form transcriptions and their cell labels• a set of base-indexed constraints• settings for various parameters that control, e.g., speed vs. accuracy of learn-ing• (optionally) a set of phonological feature specificationsFigure 2.6: Inputs to the PyParadigms learning algorithm.The training data consist of a set of forms along with their cell and lexemelabels (and, optionally, their frequencies). These form-cell-lexeme-frequency bun-dles are equivalent in form to those shown as sublexicons’ associated forms in2.10 and 2.11; 2.26 and 2.27 below show this form in the abstract case and for theSpanish data from 2.11. This similarity is due to the fact that sublexicons’ sets ofassociated forms are simply these training data divided up among the sublexicons.Frequency is set apart from the cell-lexeme-form tuple here simply to show that itis optional; if unspecified, it defaults to a frequency of 1. These constraints can besupplied in a column-delimited text file.training data :8><>:cell, lexeme, form: frequencycell, lexeme, form: frequency: : :(2.26)training data :8>>>>>><>>>>>>:1Sg, SPEAK, [ablo]: 8002Sg, SPEAK, [ablas]: 200: : :2Pl, COOK, [kosinais]: 1003Pl, COOK, [kosina]: 700(2.27)The PyParadigms learning algorithm assumes that the user has provided thephonological constraints used in a model’s sublexicons’ gatekeeper grammars. Thealgorithm currently has no ability to induce constraints that might be useful in gate-38keeper grammars, and so having the constraint set be part of the training data al-lows gatekeeper grammars to be constructed. For now the same set of constraintsis used by all gatekeeper grammars (although, of course, with potentially differentweights including 0 weights), but this limitation is not an inherent part of sublexicalphonology. These provided constraints must have the form of gatekeeper grammarconstraints previously specified in 2.14 and exemplified for Spanish in 2.15, ex-cept that constraints provided to the learning algorithm do not have pre-specifiedweights. Below, 2.28 shows this format and 2.29 provides an example from Span-ish. Note that as the target sequences are regular expressions, the $ symbol herematches the end of a string, similar to a final # in the more abstract notation usedbefore. Cell names must match those provided in the training data, and target se-quences must be interpretable as regular expressions. These constraints can besupplied in a column-delimited text file.input constraints :8><>:cell: target sequencecell: target sequence: : :(2.28)input constraints :8>>>><>>>>:3Sg: a$3Sg: e$3Pl: an$: : :(2.29)The PyParadigms learning algorithm also allows the user to specify numerousother boolean and real-valued parameters that affect learning in one way or an-other, e.g. regularization coefficients for gatekeeper grammars. For further detailsabout these parameters, see the PyParadigms documentation at https://github.com/bhallen/pyparadigms.Finally, PyParadigms allows the user to provide a set of phonological featurevalues. If available, these phonological features enable additional functionality.First, these features can be used to specify morphological operations that mutatefeature values, potentially allowing the algorithm to create fewer, more generalsublexicons, e.g. by unifying an [e]→[i] operation and an [o]→[u] operation into asingle [+syll,-high,-low]→[+high] operation. Second, phonological features allow39the user to specify feature-based constraints, which are then translated into regularexpressions for evaluation.2.5.2 Learning mapping sublexiconsWhen using the PyParadigms learning algorithm, the process of learning a sublex-ical morphology model begins by learning all of the mapping sublexicons in thetraining data. As originally mentioned in section 2.3, I have so far mostly beenusing the term sublexicon to refer to the paradigm sublexicons which constitute thelexical partitions that form the basis of sublexical morphology. However, a givenset of training data not only corresponds to a set of paradigm sublexicons, but alsoto a set of sets of mapping sublexicons, and determining these mapping sublexi-cons is the first step in learning a model. This subsection explains how mappingsublexicons differ from paradigm sublexicons and why learning them is essentialto the PyParadigms algorithm.Recall that a paradigm sublexicon is a set of inflected forms that is morpholog-ically homogeneous: within a paradigm sublexicon, for any pair of forms of thesame lexeme in that paradigm sublexicon, there is a single morphological opera-tion that can take one of those forms as input and yield the other form. Notably,the forms in a paradigm sublexicon are able to belong to any cell in the inflectionalsystem, and each paradigm sublexicon has a morphological operation for everyordered pair of cells in the system.A mapping sublexicon can be thought of as similar to a paradigm sublexiconexcept in that its base cell and derivative cell are fixed. Therefore a mapping sub-lexicon is indexed to a single base cell and a single derivative cell. A mappingsublexicon includes only one morphological operation: that from its base cell toits derivative cell. Its associated forms are all in either its base cell or its deriva-tive cell. In sublexical morphology, there is no need for a mapping sublexicon toinclude a gatekeeper grammar, since mapping sublexicons are only used in orderto learn paradigm sublexicons. For instance, 2.30 shows an example mapping sub-lexicon for Spanish present tense verbs, with 1Sg as the base cell and 3Sg as thederivative cell. One useful way to conceptualize the relationship between mappingsublexicons and paradigm sublexicons is that if a set of paradigm sublexicons sharethe same operation from some cell x to some other cell y, then all their associated40forms/lexemes will be associated with the same x→ y mapping sublexicon.mapping sublexicon :8>>>>>>>>>>><>>>>>>>>>>>:Base cell: 1SgDerivative cell: 3SgAssociated forms:8>>>><>>>>:1Sg, SPEAK, [ablo]1Sg, LOVE, [amo]3Sg, SPEAK, [abla]3Sg, LOVE, [ama]Morphological operation: [o#]→ [a#](2.30)Mapping sublexicons in sublexical morphology are based on and formallysimilar to the sublexicons of sublexical phonology (Allen & Becker, in review;Gouskova & Newlin-Łukowicz, 2013), a framework that uses sublexical divisionsof pairs of forms (e.g. singular–plural pairs) to encode the phonological subregu-larities relevant to morphological differences in those form pairs. While the sub-lexicons in sublexical phonology include gatekeeper grammars and are themselvesthe end state of learning, the PyParadigms learning algorithm for sublexical mor-phology uses these mapping sublexicons only as a convenient intermediate step inthe process of learning paradigm sublexicons.When a set of training data includes forms belonging to more than two cells, thedata can be parsed into multiple sets of mapping sublexicons. Specifically, a set ofmapping sublexicons can be constructed for each ordered pair of cells representedin the training data: a set from every base cell to every derivative cell. For a setof training data comprising forms from all present indicative cells in the normativeEuropean Spanish verbal system, for example, a set of mapping sublexicons canbe constructed for each of the ordered pairs shown in 2.7.411Sg → 2Sg1Sg → 3Sg1Sg → 1Pl1Sg → 2Pl1Sg → 3Pl2Sg → 1Sg2Sg → 3Sg. . .3Pl → 2PlFigure 2.7: Base–derivative cell pairs among the present indicative cells inSpanish verbs.In order to learn the paradigm sublexicons of this inflectional system, the Py-Paradigms learning algorithm first learns a set of mapping sublexicons for each ofthese base–derivative cell pairs. For an inflectional system with n cells, n2−n setsof mapping sublexicons must be learned, one for each ordered pair of cells, result-ing in a runtime on the order of O(n2). Each pair’s set of mapping sublexicons islearned through a procedure nearly identical to the learning algorithm described byAllen & Becker (in review) except for its omission of the learning of mapping sub-lexicon gatekeeper grammars. Since the scope of this dissertation covers sublexicalmorphology rather than sublexical phonology, I will not recapitulate the details ofthe sublexical phonology learning algorithm here, but will instead focus in the nextsubsection on how the learned mapping sublexicons are combined to establish a setof paradigm sublexicons.I do note, however, that the set of sets of mapping sublexicons—and thereforethe set of paradigm sublexicons—learned from a collection of forms will dependon the morphological operations learned to map between pairs of sets of forms.While the learning algorithm arrives at these operations by postulating numeroushypotheses about the possible operations and then paring them down to the mostparsimonious ones jointly able to account for the available data, numerous param-eters of the algorithm determine exactly what restrictions are placed on these oper-42ation hypotheses and thus what sets that will be learned. For a discussion of manyof these parameters and their influence on learned sublexicons, I refer readers toAllen & Becker (in review).2.5.3 Learning paradigm sublexiconsOnce the mapping sublexicons for an inflectional system have been determined,it is simple to determine the inflectional system’s paradigm sublexicons. Recallthat a sublexicon is defined as a subset of inflected forms with uniform morpho-logical behavior. Whereas a set of mapping sublexicons needs only to enforce thisuniformity with respect to a single base–derivative cell pair, paradigm sublexiconsmust enforce it with respect to all cell pairs. Consequently, determining paradigmsublexicons will generally divide a language’s lexemes into a greater number ofsublexicons than would determining the mapping sublexicons for those lexemesfor a pair of cells; any base–derivative mapping can introduce a distinction be-tween lexemes that must be incorporated into the lexical partitions of a paradigmsublexicon, but the mapping sublexicon for a particular cell pair may have no needfor a division required for some other cell pair’s mapping sublexicons.As an example of this principle, consider again the Spanish present indicativeverbal system. We have already established that (ignoring most types of irregulari-ties for expositional purposes) there are three paradigm sublexicons: one for “-ar”verbs, one for “-er” verbs, and one for “-ir” verbs. However, the mapping sublex-icons for some base–derivative cell pairs would not need to make this three-waydistinction. For both “-er” verbs and “-ir” verbs, 3Pl ends in [-en], contrasting with“-ar” verbs’ [-an], and so depending on the rule formalism being used, there wouldneed to be at most two 3Sg→3Pl mapping sublexicons for this cell pair—one forinserting [-en] and one for inserting [-an]. However, the mapping from 1Sg forms,which always end in [-o], to 1Pl forms, which end in [-amos], [-emos], or [-imos],requires a three-way division, and so for the set of paradigm sublexicons to enforcesublexical homogeneity, lexemes in the three classes must be separated into threeparadigm sublexicons.The procedure for learning paradigm sublexicons from a set (one for every or-dered cell pair) of sets of mapping sublexicons follows naturally from this princi-ple. Once all sets of mapping sublexicons are learned, the algorithm considers each43lexeme in turn. In the simplest case, in which the training data include exactly oneform for each lexeme in each cell, each lexeme’s paradigm sublexicon is defined asthe combination of its forms’ mapping sublexicons for each cell-to-cell mapping,where each mapping sublexicon is represented by a label including its cell pair andmorphological operations. After such a record has been made for each lexeme,each lexeme has been assigned to a paradigm sublexicon: the distinctive label for aparadigm sublexicon is the set of the cell-pair-and-operations labels for each of itsassociated mapping sublexicons. The overall inflectional system’s set of paradigmsublexicons is therefore simply the set of all paradigm sublexicons correspondingto at least one of these lexemes. Alternatively, the set of paradigm sublexiconscan be thought of as essentially the Cartesian product of mapping sublexicons—specifically, the subset of that product that is attested in the training data. Theruntime of this step in the learning algorithm is therefore proportional to the num-ber of lexemes m times the number of cell pairs n2− n, although the operationrequired for these mn2 checks is trivial, amounting only to the addition of a listedvalue to a set of mapping sublexicon labels.Note that as of the time of writing, the PyParadigms procedure for learningmapping sublexicons’ morphological operations is not guaranteed to produce op-erations that fit the definition of a paradigm sublexicon: the operations may not“converge” by being guaranteed to produce the same derivative candidate regard-less of which base cell is chosen. Addressing such cases is the most pressingissue at hand for this project, although I anticipate the solution will be trivial: suchconvergence will likely be guaranteed if the learning of mapping sublexicons isskipped entirely in favor of learning paradigm sublexicons directly using an equiv-alent procedure. It may also be possible to achieve this goal by post-processingof mapping sublexicons. In any case, the remainder of this dissertation assumesparadigm sublexicons that do have convergent morphological operations.2.5.4 Learning gatekeeper grammar weightsThe gatekeeper grammar of each paradigm sublexicon is parameterized by a setof weights, one for each of its constraints. The creation of a gatekeeper grammartherefore amounts to the setting of its constraint weights, specifically to values thatmaximize the likelihood of the training data in that gatekeeper grammar’s paradigm44sublexicon while minimizing the probabilities of data outside that paradigm sublex-icon. In this way, the learned constraint weights for a particular paradigm sublexi-con allow its grammar to assign a high probability to forms phonologically similarto its associated forms while assigning a low probability to forms that drasticallydiffer from them.In order to arrive at weights that express phonological generalizations abouta paradigm sublexicon, the PyParadigms algorithm treats the task of weight set-ting as a numerical optimization problem. This characterization generally followsthe standard definition of a Maximum Entropy harmonic grammar developed byGoldwater & Johnson (2003), Wilson (2006), and Hayes & Wilson (2008). Eachsublexicon’s gatekeeper grammar is learned independently from the others, and sothe total runtime is on the order of O(n2) for n cells.Optimization of a gatekeeper grammar’s constraint weights begins with theweights initialized to values sampled independently from a Gaussian distributionwith mean 0 and variance 1. Optimization is performed using an iterative processin the gradient descent family of algorithms, which progressively adjusts initialweights so as to continually increase (and ultimately maximize) some objectivefunction. The objective function used here serves to maximize the likelihood of thetraining data subject to some amount of regularization (cf. section 4.1 for furtherdiscussion of regularization).Figure 2.8 shows two visual representations of the training data and state ofthe gatekeeper grammar for the “-ar” sublexicon in Spanish: one before the gra-dient descent algorithm has begun optimizing weights, and one in the midst ofoptimization. While the objective function actually outputs the likelihood of thedata given a set of weights, this figure shows a more intuitive proxy. Specifically,these tableaux show the observed frequencies of various forms within a sublexiconon the left side, and the rightmost column shows the counts predicted by the cur-rent grammar weights. Minimizing the sum of the absolute values of differencesbetween the various forms’ observed and expected counts is tantamount to maxi-mizing the likelihood of the data, and so the figure shows observed and predictedcounts, their differences, and the sum of the differences’ absolute values. Note alsothat the zero-valued initial weights are a simplification of the Gaussian-distributedweights actually used by the PyParadigms implementation.45Before optimization:observed w=0 w=0 w=0 w=0 predictedfrequency 3Sg: a# 3Sg: e# 3Pl: an# 3Pl: en# frequency3Sg: ama 400 1 0 0 0 1603Pl: aman 240 0 0 1 0 1603Sg: kome 0 0 1 0 0 1603Pl: komen 0 0 0 0 1 160Sum of absolute values of differences: 640Mid-optimization:observed w=0.6 w=-0.6 w=0.3 w=-0.3 predictedfrequency 3Sg: a# 3Sg: e# 3Pl: an# 3Pl: en# frequency3Sg: ama 400 1 0 0 0 261.383Pl: aman 240 0 0 1 0 193.633Sg: kome 0 0 1 0 0 78.723Pl: komen 0 0 0 0 1 106.27Sum of absolute values of differences: 369.98Figure 2.8: Tableaux showing the training data for the Spanish “-ar” sublex-icon with their observed and predicted frequencies. The tableau on topshows predicted frequencies with all weights set to 0, approximatingtheir state before learning, while the bottom tableau shows the result ofsome amount of iterative optimization.The probabilities of the various sublexicons in an inflectional system are com-bined after the fact into a multinomial distribution over sublexicons (comparable toa softmax function; Bishop 2006). Because of this fact, and because I approximatethe space of all possible forms using the set of all training and testing forms, theweights of a sublexicon’s gatekeeper grammar are actually optimized so as to max-imally distinguish that sublexicon’s associated forms from the associated forms ofother sublexicons. This is why the observed frequencies are as shown in figure462.8: observed frequencies of forms associated with the sublexicon whose grammaris being learned are set to their actual observed frequencies, while the frequenciesof forms associated only with a different sublexicon are set to zero. As the figureexemplifies, then, constraints which evaluate to 1 only on forms outside the currentsublexicon will tend toward lower, negative weights, while weights of constraintswhose outputs are 1 on forms in the current sublexicon will generally rise accord-ingly with the frequency of those forms. Notably, however, unlike in a naïve Bayesclassifier (Lewis, 1998), constraint weights are sensitive to each other at every stepin the iterative learning process, allowing the emergence of complex interactionsof constraints.2.6 Relatedness to other theoriesHaving concluded this chapter’s introduction to the theory of sublexical morphol-ogy (and the more general theory of Bayesian morphology), I now turn to thequestion of how this approach compares to other similar theories. According to thetypology of inflectional morphology theories set out by Stump (2001), sublexicalmorphology is an inferential-realizational theory: inferential essentially because itmakes use of morphological operations rather than morphemes, and realizationalessentially because the semantic features of a derivative form—i.e. its cell label—are what determine its phonological form rather than vice-versa.This categorization puts sublexical morphology in the company of several othertheories of inflectional morphology. The tradition of inferential-realizational the-ories of inflectional morphology largely corresponds to the word-and-paradigmview of inflectional systems, namely one in which entire words (the forms of sub-lexical morphology) and cells comprising an inflectional paradigm are the centralunits of structure and computation, as opposed to being treated as epiphenomenal.Proposals reflecting inferential-realizational or word-and-paradigm approaches in-cludeMatthews (1972), Zwicky (1985), Spencer (1991), Anderson (1992), Aronoff(1994), Beard (1995), Blevins (2006). To contrast the general flavor of these the-ories with sublexical morphology, I will look in greater depth at one that has beenparticularly influential: Paradigm Function Morphology (Bonami & Boyé, 2007;Stump, 2001).Paradigm FunctionMorphology (Stump, 2001) (see also Bonami &Boyé 2007)47posits that any inflectional system can be defined in terms of a paradigm functionwhich takes as its inputs the phonological form of a root and a set of morphosyntac-tic features (i.e. a cell label) and outputs the inflected form for that root in the spec-ified cell through the application of some number of rules. In this sense, ParadigmFunctionMorphology, like sublexical morphology, models the paradigm cell fillingproblem, since a paradigm function is able to generate unfamiliar inflected forms.In principle there is no requirement that this function behave deterministically, andso Paradigm Function Morphology is compatible with the observation that theremay be multiple viable forms for a particular root in a particular cell. However,this flexibility is distinct from the explicitly probabilistic nature of sublexical mor-phology, which both learns from and predicts principled yet noisily gradient mor-phological behavior.In other respects, although the differences between sublexical morphology andParadigm Function Morphology may be clear, their comparative merits are moredifficult to evaluate. As mentioned before, the starting point of a derivation inParadigm Function Morphology is a root form and a target cell (set of morphosyn-tactic features), whereas in sublexical morphology a derivation requires at leastone inflected base form instead of a root. From the standpoint of learnability, sub-lexical morphology has the advantage: while base forms can simply be observed,roots must be inferred, adding an extra complication to learning. Sublexical mor-phology also allows a model to tailor its predictions depending on the exact setand shapes of the available base forms, rather than predicting the same outputs fora particular root regardless of the encountered forms themselves. However, onecould argue that sublexical morphology is less parsimonious in the sense that it re-quires inflected forms to be stored in the lexicon for later use in derivations, whileParadigm Function Morphology stores only a root for each lexeme.Similarly, sublexical morphology and Paradigm Function Morphology differfundamentally in the nature of their mechanisms for transducing output forms frombase/root forms. In sublexical morphology, there is only a single morphological op-eration for each sublexicon which performs the entire modification from a storedbase form to a derivative candidate; paradigm functions allow each derivation toproceed in steps, from one block of rules to the next, in order to gradually mod-ify the root until it reaches its output form. In addition to the fact that sublexical48morphology’s operations are demonstrably learnable, the actual derivation processof applying an operation in sublexical morphology (as distinct from calculating itsprobability) is marked by formal simplicity. However, this simplicity comes at acost. Because there is a rule for every ordered pair of cells, the number of storedrules can quickly multiply for even a moderately sized inflectional system, andmoreover, there is currently no way for these morphological operations to capturegeneralizations like one set of cells’ forms building procedurally off of anotherset’s forms, which would be easily captured in the rule blocks of Paradigm Func-tion Morphology. In Japanese, for example, the formation of -tara conditionalsfrom past-tense forms uses the same morphological change—insertion of a final[-ra]—regardless of whether the base past tense form is negative or affirmative, iscausative or not, etc. Since past negative and past affirmative are as distinct as anytwo other cells in sublexical morphology, there is no way to ensure only a singleoperation is used for all derivations of -tara conditionals from past tense forms.It is the computational learnability and implementability of sublexical mor-phology that most sets it apart from other theories. To my knowledge, the onlyother computationally implemented model of the paradigm cell filling problemwith its own learning algorithm is Network Morphology (Brown & Hippisley,2012). This theory is similar to Paradigm Function Morphology, but with abstractnodes from which intermediate and output word forms inherit inflectional prop-erties instead of blocks of rules. Network Morphology is implemented using theDATR formal language, and its creators have provided an algorithm for learningDATR representations of inflectional systems. But in addition to its lack of compat-ibility with noisy or probabilistic inputs or outputs, its learning algorithm requirestraining forms to be annotated by the analyst with boundaries between roots andaffixes.The use of operations or rules that take inflected forms rather than roots asinputs is rare, but in addition to sublexical morphology, the model of inflectionalmorphology assumed by the Minimal Generalization Learner (Albright & Hayes,2002) also takes this approach, as does the theory of sublexical phonology (Allen& Becker, in review; Gouskova & Newlin-Łukowicz, 2013) which inspired sublex-ical morphology. However, neither of these approaches model entire inflectionalsystems; instead, both only produce models of individual base–derivative cell pair49relationships.Research in natural language processing has also touched on the paradigm cellfilling problem, most notably in the work of Dreyer & Eisner (2011). Whiletheir proposal resembles the approach I take here in that it models productiv-ity in inflectional morphology as sampling from inferred probability distributionsover inflected forms, numerous other aspects set it apart from sublexical morphol-ogy. Most notably, the Dreyer & Eisner model performs learning in a “mostly-unsupervised” (p. 616) manner, starting from inflected forms without labels fortheir morpho-syntactic/semantic features, meaning that it is incompatible with sub-lexical morphology in terms of inputs both to learning and to derivation queries.The graphical structure used to express relationships among cells in inflectionalparadigms is also fundamentally different (and more complex) than the one I pro-pose. Additionally, it is not clear that their framework would be able to explicitlymodel the prior probabilities of surface exponents (see chapter 4) that are crucialto sublexical morphology.50Chapter 3Inference from multiple basesThe sublexical morphology proposal relies crucially on the equation in 2.7, aninstantiation of Bayes’s theorem, repeated below as 3.1. Recall that S indicates avariable ranging over sublexicons s, while each Bcell indicates a variable rangingover the possible forms or shapes b of a lexeme in a particular cell.p(S|B1;B2; :::Bn) µ p(B1;B2; :::Bn|S)p(S) (3.1)This chapter and the following one each target a particular, potentially con-tentious aspect of this equation, supporting its inclusion in 3.1 using novel exper-imental findings. These results are drawn from wug tests (Berko, 1958) on Ice-landic and Polish which break new ground by adding a novel element—multiplebase forms—to the traditional wug test paradigm.This chapter serves to empirically validate the inclusion of bases B1 through Bnin the conditional probabilities in 3.1, and also to probe the limits of speakers’ abil-ities to combine information provided by these multiple bases. Intuitively, use ofthe expression B1;B2; :::Bn here denotes that the probability of a particular deriva-tive candidate d (by way of the sublexicon s that generates it; cf. 2.4) depends onthe observed shapes of multiple other base forms B1;B2; :::Bn of that derivative’slexeme. In other words, according to this hypothesis, speakers can productivelyuse information from multiple known forms of a lexeme when inferring unknownforms of that lexeme. Returning to the normative European Spanish example fromthe previous chapter, this expression might be used to state that the probability the51grammar gives to some unobserved 1SgPresIndic verb candidate (say, [pwento])as opposed to some other candidate for that form (say, [ponto]) can differ depend-ing on whether the speaker knows the 2SgPresIndic form of that lexeme, or the3SgPresIndic form of that lexeme, or both.This position contrasts with a view of morphological inference in which an in-flectional system has a single privileged base cell such that speakers’ inferencesabout other inflected forms can only make use of information in that privilegedcell’s base form(s). If only information in the privileged base is used in pre-dicting a derivative form, then the derivative is conditionally independent of thenon-privileged bases given the privileged base. Using the definition of condi-tional independence, this restriction can be written as shown in equation 3.2, whereB1;B2; :::Bn includes Bprivileged . I use D rather than S here to signify that this re-striction in no way depends on the assumption of sublexical morphology.p(D|B1;B2; :::Bn) = p(D|Bprivileged) (3.2)In other words, this view assumes that the probability of a derived form of alexeme given any of its other forms is equal to the probability of that derived formgiven only the form of its privileged base. Less formally, this equation states thatno information contained in forms of a lexeme other than the privileged base formcan exert any influence on probabilities given to that lexeme’s derived forms.While this more restricted hypothesis about inflectional systems may appearunnecessarily conservative, it is based on a related hypothesis that has proveduseful for understanding and modeling mechanisms of historical morphologicalchange: Albright’s (2002 et seq.) single surface base hypothesis. Because of theempirical successes of the single surface base hypothesis—and because the sub-lexical view of morphology derives from a similar set of assumptions about in-flectional morphology—I consider it worthwhile to test whether the single surfacebase hypothesis is tenable in a probabilistic model of inflectional morphology.This chapter proposes a probabilistic interpretation of the single surface basehypothesis in section 3.1, adapting it to the language and formalisms used in thisdissertation. Using data from Icelandic (section 3.2), I then conclude on the basisof experimental evidence presented in section 3.3 that contrary to the single sur-52face base hypothesis, multiple base forms must be available and usable in inflec-tional inference even in the inference of a single derivative form. This conclusionsupports the inclusion of the expression B1;B2; :::Bn in equation 3.1. Followingup on these results, section 3.4 addresses the question of whether the expressionp(B1;B2; :::Bn|S) can be decomposed to facilitate learning and inference, and itpresents an experimental study of Polish noun declensions designed to answer thisquestion. These results tentatively suggest that speakers may be systematicallylimited in how they can combine information from multiple bases, resulting in apicture of speaker capabilities that simplifies the modeling task for linguists. Over-all, then, the chapter’s empirical results support the inclusion of multiple bases inequation 3.1 and shed light on the modes of interaction among these bases.3.1 Single-base hypothesesAlbright (2002, p. 11) defines the single surface base hypothesis as a proposal that:...for one form in the paradigm (the [privileged] base), there are norules that can be used to synthesize it, and memorization is the onlyoption [for speakers to be able to produce this form]. Other forms inthe paradigm may be memorized or may be synthesized, but synthesismust be done via operations on the [privileged] base form...For the purposes of this dissertation, the single surface base hypothesis can besummarized as comprising two claims: (a) that any unknown inflected form of alexeme must be generated only from the memorized form of that lexeme in a singlecell, and (b) that the particular privileged cell used for this generation process isthe same across all lexemes in the language and all possible derivative cells. Thecell privileged in this way is the cell whose forms have the fewest neutralizations ofmorpho-phonological contrasts among inflected forms, i.e. the cell with the highestpredictive value as a base from which other forms can be inferred.As an example of this hypothesis at work, consider one locus of historicalchange in inflection in High German as described by Albright (2008). Figure 3.1shows three classes of nouns in Middle High German. The top class exhibits a[x] ∼ [ /0] alternation attributable to intervocalic loss of the ancestor of [x]. Thischange resulted in a neutralization in the NomPl of the top class and non-alternating53middle class. The bottom class has [x] at the end of NomSg forms and at the endof the stem in NomPl forms, which is due to a sound change of [k] to [x] after theelimination of intervocalic [x]. This second change resulted in a neutralization ofthe bottom and top classes in the NomSg. The patterns of vowel umlaut are notrelevant to this example.NomSg NomPl Glossflox fl÷:e ‘flea’rex re:(j)e ‘deer’ku: ky:e ‘cow’we: we:(j)e ‘woe’kox k÷:xe ‘cook’pex pexe ‘pitch’Figure 3.1: Examples of three classes of nouns in Middle High German, inthe NomSg and NomPl. Examples are reconstructions of historicalforms, transcribed using the IPA.In modern High German, however, the top and middle classes have been neu-tralized in the NomSg, with all such forms being vowel-final, e.g. [flo:] ‘flea’.According to the single surface base hypothesis, this neutralization—a form ofparadigm leveling—results from the fact that speakers are only able to make in-ferences from the NomPl form, which best maintains class contrasts overall (asidefrom the neutralizations shown here) and therefore is the privileged base. Thereforeany inference about the shape of a lexeme’s other forms must rely only on infor-mation present in its NomPl form, and so contrasts which are neutralized in thatform, such as that between the top and middle classes here, are at risk of being lostacross the entire paradigm. This restriction illustrates part (a) of the single surfacebase hypothesis. Moreover, according to part (b) of the hypothesis, the privilegedcell must be the cell that neutralizes the fewest contrasts in the inflectional sys-tem overall (the NomPl in High German), even if it is not especially predictiveof forms in the particular derivative cell whose form is being inferred, as in thisexample. Indeed, these predictions are borne out in the modern NomSg forms of54words like [flo:] ‘flea’, where contrasts lost in the NomPl are extended to NomSgforms, showing the explanatory power of the two parts of the single surface basehypothesis together.Note that part (b) of the single surface base hypothesis further restricts part (a).In other words, while it is possible to propose a weaker version of the hypothesisincluding part (a) but excluding part (b)—i.e. a view in which only one base formcan be used for generating derivatives but in which the choice of base form canvary—part (b) logically requires part (a).1 In this section, I will review the moti-vations for both parts of the hypothesis. I will then consider how the hypothesismight be adapted to a probabilistic setting.3.1.1 Motivations for the single surface base hypothesisAssuming the validity of the single surface base hypothesis benefits a word-basedtheory of inflectional morphology in two ways, both of which are relevant tothe aims of this dissertation. First, restricting the ways that novel forms can begenerated may reduce the computational and representational complexity of themorphological grammar. Second, the hypothesis accurately predicts patterns ofparadigm leveling in multiple languages. The remainder of this section focuses onthe first of these—the way that having a single privileged base can reduce modelcomplexity—using this opportunity to introduce a graph-theoretic interpretation ofinference in word-based inflectional morphology. The phenomenon of paradigmleveling, which I used in the previous subsection to explain the single surface basehypothesis, is covered in more detail in section 5.2, where I demonstrate that sub-lexical morphology is able to account for such patterns even without the singlesurface base hypothesis.In a word-based model of inflectional morphology, specifically one like thatassumed by Albright (2002) in which novel wordforms must be generated throughthe application of morphological operations to other wordforms, the grammar mustcontain information about how the speaker can generate a wordform in any one cell1One could also propose versions of the single surface base hypothesis with an arbitrary numberof privileged cells, e.g. one in which there are exactly two privileged cells whose forms must bememorized and which are the only two cells whose forms can be used for inference. This generaliza-tion shades into the concept of principal parts (Stump & Finkel, 2013). Due largely to considerationsrelated to my experimental methodology, I focus here on the most restrictive hypothesis, one withonly a single privileged base.55from the same lexeme’s wordform in any other cell. A primary purpose of the sin-gle surface base hypothesis is to simplify the information about an inflectional sys-tem that must be stored in the speaker’s grammar by restricting the “paths” throughwhich a novel form can be derived, reducing the grammar’s formal complexity andpresumably facilitating learnability. (Note, however, that the single surface basehypothesis itself provides no explanation for how speakers would learn the identityof the privileged base cell.)To observe these two effects of the hypothesis, we can contrast two views ofhow an unknown inflected form is derived from other known forms of the samelexeme, one adhering to the single surface base hypothesis and the other taking theopposite extreme. For both of these, I will make use of a graph-theoretic represen-tation of inflectional systems in which labeled nodes represent (wordforms in) cellsin the system and edges (directed, i.e. one-way, or undirected, i.e. bi-directional)represent the “paths” via which a wordform in one cell can be derived from that inanother cell. Further extending the spirit of the single surface base hypothesis intothis graph-theoretic idiom, I assume that the process of a derivation can traverseonly a single edge—i.e. that there are no intermediate representations.One such graph structure for an inflectional system with four cells is a completegraph with undirected edges, as shown in 3.2..a bc dFigure 3.2: A fully connected inflection graph. Each form in the inflectionalsystem can be used to generate each other form.Such a view of the paths of derivation in inflectional morphology is maximallyunconstrained. For any given target cell (a, b, c, or d in figure 3.2), the form in anyother cell can be used to generate the form in the target cell. Each edge in the graphrepresents a function that generates a form in one cell—or a set of candidates forthat form—from the form in another cell.56However, this freedom comes at the cost of substantial computational complex-ity. For an inflectional system with n nodes, the number of edges in its completegraph is n(n− 1)=2. Moreover, each edge represents two mappings: one fromnode/cell x to node/cell y, and one from y to x. Thus the number of morphologicaloperations deriving one cell’s form from another cell’s form is double this amount,n(n− 1). In the absence of disconfirming evidence, it would be computationallyand theoretically preferable to use a less information-dense representation for thegrammar.According to Albright’s (2002, 2008 et seq.) single surface base hypothesis, thenumber of morphological operations (edges in a graphical representation) derivingone inflected form from another amounts to only n−1, where n is again the numberof cells in the inflectional system. Figure 3.3 gives a graphical representation of thepossible derivations in a four-cell inflectional system under the single surface basehypothesis; note that cell a is assumed to be the privileged base, and that edgesare directed because the a form must be memorized and cannot be generated fromother forms..a bc dFigure 3.3: An inflection graph under the single surface base hypothesis, withcell a as the privileged base. The form in a must be memorized, andother forms must be derived only from a.Albright (2002, p. 9) also discusses an intermediate alternative according towhich “each [cell] in the paradigm must be derived from at most one unique base,but different [cells] may be derived using different bases.” Figure 3.4 shows anexample of this kind of system. While this proposal lacks the strictness of thesingle surface base hypothesis, it still predicts that any particular cell’s form canbe generated only from a single other cell’s form. This version of the hypothesisamounts to an assumption of its part (a) and a rejection of its part (b).57.a bc dFigure 3.4: An inflection graph under a weakened version of the single sur-face base hypothesis, assuming that each cell can be generated fromsome cell.The constant thread across all versions of the single surface base hypothesisproposed by Albright remains the restriction that wordforms in any particular cellcan only be generated from a single other cell (or else memorized). Despite theadvantages in formal complexity that these hypotheses afford, I conclude in thischapter that ultimately—and unfortunately, from the standpoint of computationaland formal complexity—the only empirically valid model of inference in inflec-tional morphology violates these hypotheses, necessitating a grammar that looksmore like the complete graph in 3.2 than any of the simpler graphs in this section.3.1.2 A probabilistic single surface base hypothesisThe single surface base hypothesis itself deals only with the question of which basecell’s forms can be used to synthesize, i.e. generate, candidates for inflected forms.In literature on the single surface base hypothesis (Albright, 2002, 2008; Albright& Hayes, 2003), determining which of these candidates will be uttered relies onthe machinery of the Minimal Generalization Learner, a framework for learningand applying sets of morphological rules for inflectional morphology. As Albright& Hayes (2002) in particular makes clear, while the confidence measures used bythe Minimal Generalization Learner allow quantitative comparison of rules andtherefore of candidates, the Minimal Generalization Learner does not learn proba-bilistic grammars. In order to test the spirit of the single surface base hypothesis inthe specific context of a probabilistic view of the paradigm cell filling problem, itis necessary to formulate an explicitly probabilistic version of the hypothesis.In the language of probability theory, we can consider the form of a derivedinflected word to be a discrete random variable D, such that p(D) indicates a distri-58bution of probability mass across a finite set of candidate wordforms. For example,3.3 shows a possible set of candidate wordforms for the plural form of the Englishlexeme GIRAFFE and a possible probability distribution over them. According tothe definition of probability, the probabilities across all candidates must sum to 1.p(D) :8>>>><>>>>:p( D = [d@IræfY] ) = 0:7 (cf. LAUGH)p( D = [d@Irævz] ) = 0:297 (cf. CALF)p( D = [d@IræfIz] ) = 0:002 (cf. CLASS)p( D = [d@IrævIz] ) = 0:001 (cf. HOUSE)(3.3)I propose that the appropriate probabilistic interpretation of the single surfacebase hypothesis is one that prohibits non-privileged base forms from affecting thecalculation of probability distributions over derivative candidates. As describedin the introduction to this section, then, we can use the definition of conditionalindependence to arrive at the probability-theoretic version of the single surfacebase hypothesis given by the equation in 3.2, repeated below as 3.4.p(D|B1;B2; :::Bn) = p(D|Bprivileged) (3.4)In other words, the probabilistic version of the hypothesis predicts that theprobability of a derived form of a lexeme given all of its other forms equals theprobability of that derived form given only the form of its privileged base. Lessformally, this equation states that no information contained in forms of a lexemeother than the base form can exert any influence on probabilities given to that lex-eme’s derived forms.Note that this probabilistic formulation of the single surface base hypothesisencompasses only its part (a) as defined at the start of this section: the limitationof making use of only one base form. The definition in equation 3.4 is agnosticwith respect to the question of whether that privileged base cell must be the sameacross all lexemes and all derivative cells in an inflectional system, i.e. part (b)of the definition from the beginning of this section. The requirement for an in-flectional system to have only one cell ever used as the privileged base needs noprobabilistic reinterpretation and can still be applied unchanged in cases where the59grammar predicts a probability distribution over derivative candidates. Moreover,it is still the case in a probabilistic setting that this restriction (b) logically requiresits part (a), i.e. equation 3.4, and so by falsifying this equation, the limitation toone privileged base across an entire inflectional system can also be falsified.This discussion brings us to the question of how to falsify the probabilisticversion of the single surface base hypothesis. As usual in hypothesis testing, it isnecessary to determine what falsifiable predictions the hypothesis makes. To thisend, we can rewrite equation 3.4 to specifically indicate that the privileged base isone of the bases in B1;B2; :::Bn, making it clear that the conditioning factor on theright hand side of the equation is a proper subset of those on the left:p(D|Bprivileged;Bother; :::Bn) = p(D|Bprivileged) (3.5)In other words, it is possible to test the probabilistic single surface base hy-pothesis by determining whether there exist inflectional systems such that basesother than the privileged base can affect the probability of a derivative candidate.This type of conjecture—proving that some equality does not hold, especially incases where one condition is a superset of another—lends itself to scrutiny by ex-periment. The following section describes an experiment designed to test the veryequality in 3.5.3.2 Icelandic nouns and multiple basesIn previous sections, this chapter has made a case for the usefulness of the singlesurface base hypothesis, and it has also proposed a way that the hypothesis mightbe adapted to a probabilistic model of inflectional morphology. In these two fi-nal sections, however, I present evidence that the single surface base hypothesisas described here cannot hold true for all languages. This discussion focuses ona particular pattern in Icelandic noun declensions that appears to require informa-tion from multiple bases in the production of a single derivative form. To expandthis argument from the domain of abstract properties of an inflectional system toconcrete facts of speaker behavior, I report experimental evidence that Icelandicspeakers do indeed make use of multiple base forms in such situations, counter tothe predictions of the single surface base hypothesis.603.2.1 Icelandic noun inflectionNouns in Icelandic inflect for case and for number (Einarsson, 1949; Kress, 1982).Having four cases and two numbers, the inflectional system of nominals comprisesa total of eight cells. These eight case–number combinations are shown in Figure3.5 along with their abbreviations.Nominative Accusative Dative GenitiveSingular NomSg AccSg DatSg GenSgPlural NomPl AccPl DatPl GenPlFigure 3.5: The four cases and two numbers of Icelandic nouns, as well astheir abbreviations.The phonological exponents of case and number information primarily con-sist of suffixes which jointly express case and number, for example, a DatPl suffixwritten as -um and pronounced [Ym], or an orthographically and phonologicallynull NomSg suffix. Icelandic nouns sometimes also exhibit stem vowel alterna-tions, e.g. NEED NomSg þörf [:÷rv] ∼ NomPl þarf-ir [:arvir]. While such vowelalternations present no issues for the models I advocate, for the sake of simplic-ity, I focus here on suffix patterns. Note in addition that nouns can be marked fordefiniteness, e.g. HOUSE DatPl definite hús-unum [hu:YOnYm] ∼ DatPl indefinitehús-um [hu:YYm], but that the domain of immediate interest is limited to indefiniteforms. For the remainder of this discussion of Icelandic, I will use orthographicrepresentations rather than transcriptions, as the experimental data I will present re-lates only directly to the former. Orthographic representations can be fairly viewedas roughly corresponding to equivalent IPA symbols, and more fine-grained detailsof the Icelandic writing system and its relation to segmental phonology are notrelevant to this discussion.Most case-number combinations exhibit a variety of suffixal exponents. Thesesuffixes combine to form what have traditionally been called inflectional classes,i.e. sets of lexemes with identical inflection across the entire paradigm (Müller,2005). These classes also interact with grammatical gender: inflectional class61and gender (masculine, feminine, or neuter) are strongly correlated. For exam-ple, Figure 3.6 shows the suffix co-occurrence patterns of six inflectional classeswhose member lexemes are exclusively (except for one or two proper names of theSTANZA class; Gunnar Ó. Hansson, p.c.) of the feminine gender.CHIP NYMPH BAY HEATH MOVEMENT STANZANomSg flís dís vík heiD-i hreyfing vís-aAccSg flís dís vík heiD-i hreyfing-u vís-uDatSg flís dís vík heiD-i hreyfing-u vís-uGenSg flís-ar dís-ar vík-ur heiD-ar hreyfing-ar vís-uNomPl flís-ar dís-ir vík-ur heiD-ar hreyfing-ar vís-urAccPl flís-ar dís-ir vík-ur heiD-ar hreyfing-ar vís-urDatPl flís-um dís-um vík-um heiD-um hreyfing-um vís-umGenPl flís-a dís-a vík-a heiD-a hreyfing-a vís-naFigure 3.6: Representative words and their suffix paradigms from six inflec-tional classes associated with the feminine gender.Information about an inflected form’s suffix provides information about whatinflectional class that form’s lexeme could belong to—and, more descriptively, pro-vides information about what suffixes other inflected forms of that lexeme are likelyto take. Knowledge of a lexeme’s gender can also provide information about itslikely suffixes (and vice-versa). For example, according to the classes shown inFigure 3.6, if it is known that some arbitrary lexeme is feminine and that its NomPltakes the -ir suffix, then one can infer that its GenSg takes the -ar suffix (as a lex-eme in the NYMPH inflectional class). Conversely, if it is known that a lexeme’sGenSg takes the -s suffix (common for masculines and neuters, but not used forfeminines), then one can infer that it is not of the feminine gender. The primarygoal of the experiment described in the next section is to determine whether na-tive speakers of Icelandic in fact use knowledge of a lexeme’s suffixes to performinference in this way.623.2.2 Predictors of the Icelandic AccPlFor the purposes of setting up the experiment introduced in the next section, I focusnow on the AccPl forms of Icelandic nouns. In particular, the discussion will focuson the four AccPl suffixes described in Figure 3.7, as these suffixes’ distributionscomprise a basis for testing the single surface base hypothesis.Suffix Gender Theme Vowel-a masculine a-i masculine i-ar feminine a-ir feminine iFigure 3.7: Four AccPl suffixes of Icelandic nouns, their usual genders, andtheir stem vowels.Of central interest is the question of how a noun’s AccPl suffix can be predictedgiven knowledge only of its GenSg and/or NomPl forms, that is, without any ad-ditional information about gender or other inflected forms. The GenSg and NomPlare classically considered the “principal parts” of Icelandic nouns, owing to theirgreat predictiveness of other cells’ inflected forms, and are standardly listed in dic-tionary entries, e.g. Árnason (2007). The GenSg of a lexeme provides informationabout its gender and therefore also its AccPl: a GenSg with the -s suffix is compat-ible with the masculine AccPl suffixes -a and -i but not with -ar or -ir, while the-ar GenSg suffix is compatible primarily with the feminine AccPl suffixes -ar and-ir. A lexeme’s NomPl form provides complementary information: a NomPl withthe -ar suffix is compatible only with the a-stem AccPl suffixes -a and -ar, whilethe -ir NomPl suffix is compatible only with the i-stem AccPl suffixes -i and -ir.Figure 3.8 illustrates these patterns by showing counts (i.e. type frequencies) ofIcelandic nouns with any one of the four GenSg-NomPl-AccPl suffix constellationsdescribed above, drawn from the Database of Modern Icelandic Inflection (Bjar-nadóttir, 2012) and grouped according to their AccPl, GenSg, and NomPl suffixes.These counts constitute all three forms from a total of approximately 182,000 lex-63emes. Numbers in parentheses provide more abstract estimates of how much ofthe lexicon falls into each category by conflating all compounds containing thesame head noun, e.g. counting vorlaukur ‘spring onion’ (lit. ‘spring-onion’) andgraslaukur ‘chives’ (lit. ‘grass-onion’) together as having a frequency of one.AccPl -a AccPl -i AccPl -ar AccPl -irGenSg -s 17,193 0 0 0NomPl -ar (1,880) (0) (0) (0)GenSg -s 0 3,289 0 0NomPl -ir (0) (153) (0) (0)GenSg -ar 318 0 6,704 0NomPl -ar (14) (0) (476) (0)GenSg -ar 0 3,983 0 14,274NomPl -ir (0) (108) (0) (714)Figure 3.8: Raw counts (and head noun-based counts) of Icelandic nounforms grouped by their AccPl, GenSg, and NomPl suffixes, taken fromthe Database of Modern Icelandic Inflection (Bjarnadóttir, 2012).As Figure 3.8 shows, knowing both a lexeme’s GenSg suffix and its NomPlsuffix should allow a speaker to narrow down the four AccPl suffix options to justa single choice—those shown in bold—compatible with implicational relationshipsin the lexicon, either absolutely (when GenSg is -s) or with high certainty (whenGenSg is -ar).When only the GenSg suffix or only the NomPl suffix (but not both) are known,according to the same lexical patterns as above, one can perform useful inferenceabout the AccPl suffix, narrowing its viable suffix options from four to fewer. How-ever, knowing either the GenSg or NomPl alone does not suffice to pick out a singlehighly likely AccPl suffix. Figure 3.9 demonstrates the relevant lexical patterns insuch cases of limited useful information.64AccPl -a AccPl -i AccPl -ar AccPl -irGenSg -s 17,193 (1,880) 3,289 (153) 0 0GenSg -ar 318 (14) 3,983 (108) 6,704 (476) 14,274 (714)AccPl -a AccPl -i AccPl -ar AccPl -irNomPl -ar 17,511 (1,894) 0 6,704 (476) 0NomPl -ir 0 7,272 (261) 0 14,274 (714)Figure 3.9: Raw counts (and head noun-based counts) of Icelandic nounforms grouped by their AccPl, GenSg, and NomPl suffixes, but withGenSg-based and NomPl-based groupings performed separately, takenfrom the Database of Modern Icelandic Inflection (Bjarnadóttir, 2012).The schematic in Figure 3.10 provides an abstracted view of the lexical factsset forth so far. It displays the four AccPl suffixes of interest as lying in a table withfour sections. With high accuracy, knowledge of a lexeme’s GenSg predicts whichcolumn in the table should contain that lexeme’s AccPl suffix, while knowledge ofits NomPl predicts which row should contain its AccPl suffix.GenSg-s -arNomPl-ar -a -ar-ir -i -irFigure 3.10: Schematic of four possible AccPl suffixes in Icelandic, withtheir typical lexical correspondences to GenSg and NomPl suffixes.This situation, in which two base forms provide complementary informationconceivably usable in inferring a derivative form, provides an ideal case for testingthe probabilistic formulation of the single surface base hypothesis. The next sectiondescribes an experiment designed to test whether information in both base forms isaccessible to Icelandic speakers for use in inference (part b of the hypothesis) andwhether speakers can combine information from the two forms in order to improve65the accuracy of their inference (part a of the hypothesis).3.3 Falsifying the single-base restriction: an IcelandicexperimentAs the previous section describes, certain inflectional morphology tasks plausiblyfaced by native speakers of Icelandic—such as the prediction of a noun’s AccPlform in the absence of gender information—exhibit patterns of incomplete pre-dictability which would require that an analyst considering the pattern look beyondthe information in just a single base form. The purpose of this section is to deter-mine whether native speakers of Icelandic adjust their judgments in such situationsby considering, as analysts can, information from multiple base forms. The findingthat native speakers can do so would directly contravene the single surface basehypothesis, and indeed this result obtains in the experiment presented here.3.3.1 MethodologyA total of 191 Icelandic native speaker participants were recruited by posts on mail-ing lists and social media and through word of mouth. Of these participants, 123completed the entire experiment and described themselves as native speakers ofIcelandic, and it is these 123 participants’ data which I analyze here. Among theseparticipants, approximately 30% described themselves as having been born be-tween 1948 and 1957, and approximately 31% as having been born between 1958and 1967; the 1938–1947, 1948–1957, and 1978–1987 ranges each comprised ap-proximately 10% as many participants, with the remaining few participants eitherborn before 1937, born after 1988, or declining to specify. Among these same 123,approximately 68% described their gender as female, approximately 26% as male,and approximately 1.6% as another gender, while the rest declined to specify.66Figure 3.11: A screenshot of one trial frame in the Icelandic experiment. Thisframe has presented the DatPl and NomPl forms of the GLEIT noncelexeme and is now eliciting a choice for its AccPl.ProcedureThe experiment itself was carried out entirely online using the Experigen (Becker& Levine, 2012) framework. On the Experigen web interface, participants weregiven information about the experiment and were then provided a consent formto electronically sign. Each participant who consented engaged next in two non-randomized practice trials, after which she or he completed thirty-two test trials.Finally, all consenting participants filled out a demographic questionnaire askingfor non-identifying personal information. All parts of the experiment were pre-sented in Icelandic, as translated from English by a native speaker of Icelandic.The English translation of this questionnaire can be found in the appendix.Each trial concerned a single novel lexeme designed to resemble existing Ice-landic noun lexemes. A participant would first be exposed to some number ofinflected forms of the novel lexeme in carrier sentences. The task was then to67select one preferred AccPl form of the lexeme out of a fixed set of four options,one created using each of the four AccPl suffixes discussed in the previous section.The order of presentation of these suffix options was randomized. The participant’schoice of AccPl form was recorded as the key experimental measure.Primary manipulationThe information about a novel lexeme presented to a participant before the Ac-cPl choice task varied according to each stimulus frame’s presentation condition.This variable ranged across the four options shown below, which correspond to theinflected forms shown to participants before they were asked to select an AccPlform. The DatPl form conveys no information about which AccPl suffix is mostappropriate, because it takes an -um suffix for all nouns in the language (with theexception of a handful of monosyllabic vowel-final stems). The DatPl form is al-ways provided to introduce participants to each novel lexeme and encourage themto think of it as an existing Icelandic noun. Some trials then also present eitherthe GenSg or NomPl, which—according to the lexical patterns described in theprevious section—provides limited useful information about what the AccPl formshould be. Some other trials then provide the remaining base form, which—again,according to the lexical patterns described in the previous section—should leaveonly one AccPl exponent likely.1. DatPl only [uninformative]2. DatPl, then GenSg [somewhat informative]3. DatPl, then NomPl [somewhat informative]4. DatPl, then GenSg, then NomPl [maximally informative]Figure 3.12: The four presentation conditions of the Icelandic experiment.This manipulation tests both parts of the single surface base hypothesis. Byusing these four presentation conditions, it was possible to evaluate whether boththe GenSg form and the NomPl form were used by speakers in inference by com-paring responses in conditions 2 and 3 to responses in condition 1. This design also68made it possible to evaluate whether the combined information from the GenSg andNomPl together in condition 4 was used to further modify judgments compared toconditions 2 and 3.StimuliNovel lexeme stems, 32 in total, were assigned randomly to the four inflectionalclasses corresponding to the four AccPl suffixes forming the range of participants’choices. For clarity, these inflectional classes are repeated in Figure 3.13. Thesestems were designed so as to minimally influence judgments about their appro-priate inflectional class or gender, or at least to balance stems more likely to bejudged as belonging to one class/gender with stems more likely to be judged oth-erwise. Specifically, all stems were of the shape ((C)C)CVC, with their vowelsand final consonants drawn from the sets {e, ei, a, ó} and {p, t, m, n}, respec-tively. These choices of stem shape, vowels, and consonants are based largely onresearch by Hansson (2006), who shows experimentally that Icelandic speakers aresensitive to lexical correlations among stem shape, gender, and inflection reportedby Jónsdóttir (1989, 1993). Two stems were generated from each combination ofvowel and final consonant, yielding a total of 32 stems. The two stems in each pairwith the same vowel-consonant combination were always assigned to different in-flectional classes. To minimize the risk of stems being similar enough to existingstems that the existing lexemes’ inflection could unduly affect judgments about thenovel stems, I implemented a script that rejected any stems with an edit distanceof 1 from any existing stem in the Icelandic lexicon (Bjarnadóttir, 2012), and alinguistically savvy native speaker of Icelandic also performed a similar form offiltering using his own judgments of similarity. A complete list of stimuli can befound in the appendix.69GenSg NomPl AccPl1 (Masc-a) -s -ar -a2 (Masc-i) -s -ir -i3 (Fem-a) -ar -ar -ar4 (Fem-i) -ar -ir -irFigure 3.13: The suffixes of the four inflectional classes into which novel lex-eme stems were randomly distributed. All lexemes took the DatPl suf-fix -um.(Pseudo-)randomization of stimuli was performed in two ways. The order ofnovel lexemes was first itself randomized. The assignment of lexemes to presen-tation conditions also varied across participants, but rather than being randomized,Experigen cycled through four possible sets of pairings of lexeme and presentationcondition from each participant to the next. Each of these sets of pairings dis-tributed lexemes of various inflectional classes evenly across the four presentationconditions, such that there were always two lexemes in each inflectional class ineach presentation condition. The two stems for each vowel-consonant combinationwere always presented in separate presentation conditions.Carrier sentencesThe DatPl, GenSg, and NomPl forms, when presented, were always given insidea carrier sentence designed to make it clear which paradigm cell the presentedform belongs to. These sentences were carefully designed so as not to provide anyadditional information about gender. The AccPl was also elicited using a carriersentence, which provided a long underscore (blank line) in the position of the re-quested AccPl form and which also provided no extra information about gender.After the presentation of each inflected form, participants pressed a button to trig-ger presentation of the next inflected form or finally the AccPl choice task, alwaysleaving all previously presented inflected forms of the trial lexeme visible in theirframe sentences. This button-based procedure was added to encourage participantsto consider the information provided by each inflected form. Inflected forms them-selves were shown in boldface to make them stand out against the carrier sentences.70Together, all of the carrier sentences formed a short narrative about a personnamed Jón and his fondness of certain collectible objects, intended to allow par-ticipants to think of novel lexemes as the names of obscure trinkets. The narrativewas coherent regardless of which sentences were included. The purpose of thisdesign choice, in addition with the explicit instruction before the test trials to thinkof novel words as real but rare Icelandic nouns, was to encourage participants touse their knowledge of inflectional patterns in existing Icelandic words when per-forming the AccPl choice task. All presentation and elicitation was performedusing orthographic representations, and no recordings were played or made at anytime. Frame sentences are provided in Icelandic with English translations in theappendix.3.3.2 ResultsThe Icelandic-speaking participants demonstrated, as a group, their ability to makefruitful use of information contained in both GenSg and NomPl forms, as well astheir ability to combine information contained in these two base forms. Cruciallyto these conclusions, participants achieved a higher rate of success in selecting alexeme’s associated AccPl form when provided either its GenSg form or its NomPlform, as compared to when provided only its DatPl form. Moreover, success rateswere higher still when participants were provided both the GenSg and the NomPlform of a lexeme. Figure 3.14 summarizes these response patterns.710.000.250.500.751.00DatPl DatPl+GenSg DatPl+NomPl DatPl+NomPl+GenSgPresentation conditionProportion correctFigure 3.14: Participants’ proportions of “correct” responses in the Icelandicexperiment by presentation condition. Vertical bars show 95% confi-dence intervals, and horizontal bars show quartile values. Color-codedprobability density functions show kernel estimations of the underly-ing distributions.72In order to demonstrate the statistical validity of these conclusions, I fit linearmodels to the experimental data and evaluate the model parameters optimized todescribe the data. To this end, I first introduce the linear model framework itselfand then demonstrate that according to this model of the responses, neither part ofthe probabilistic single surface base hypothesis is consistent with the experimentaldata.Linear models: set-upResults from this experiment were analyzed in R (R Core Team, 2013) using thelme4 package (Bates et al., 2015b), which implements generalized linear mixedeffects models (GLMMs). The dependent variable to be modeled is a binary vari-able indicating, for each frame/lexeme, whether the participant selected the AccPlform that was defined as the “correct” AccPl form of that lexeme. Consequently,analyses were performed using logistic regression as implemented by the glmerfunction with the argument family="binomial". Assuming that any informa-tion about GenSg and NomPl forms used by participants will potentially shift theirjudgments toward the AccPl form(s) that correspond to those GenSg and NomPlsuffixes, this measure makes it possible to assess whether participants are using in-formation in the provided GenSg and NomPl forms. In short, this approach allowsus to estimate whether and how much knowing the GenSg or NomPl of a nonceword (or both) improves participants’ likelihood to select its proper AccPl form,and therefore whether participants make use of more than a single base.The fixed effects under investigation—in other words, the predictors or exper-imentally manipulated independent variables—are the effects of knowing a lex-eme’s GenSg or NomPl form. Each of these is a binary variable (i.e. a variablethat takes either the value 1 or the value 0) indicating whether or not that form wasincluded in the lexeme’s presentation frame. In fitting a linear model, each fixedeffect receives an estimate of its coefficient, that is, a real number which infor-mally serves the purpose of indicating how strong an effect that predictor exhibitsgiven the data. These variables knew.gensg and knew.nompl each evaluateto 0 when the GenSg or NomPl (respectively) was not presented in a trial, andto 1 when that form was presented. The interaction between these two variableswas also examined as a fixed effect, the variable knew.nom:knew.gen which73evaluates to 1 only when both the GenSg and NomPl forms were presented. Anintercept, which always evaluates to 1, is also included to express the baselineperformance of participants when shown only the DatPl form. The log odds of se-lecting a lexeme’s proper AccPl form under this model therefore corresponds to thesum of the coefficients of variables that evaluate to 1. For example, it predicts thesuccess rate of a participant shown only the DatPl of a lexeme to be based on justthe intercept’s coefficient, while it predicts the success rate when all three basesare provided to be based on the sum of the intercept’s coefficient, the coefficient ofknowing the GenSg, the coefficient of knowing the NomPl, and the coefficient ofknowing both the GenSg and NomPl (i.e. their interaction).In GLMMs, random effects are included to account for variability within pop-ulations that have been sampled from. In the case of this experiment, participantshave been sampled from the population of all Icelandic speakers, and stems havebeen sampled from the population of all possible Icelandic nonce stems. Whetherany particular Icelandic speaker or stem behaves or is treated in some particularway is not within the scope of the research questions, but by adding random effectsfor these variables, it is possible to improve models’ ability to correctly assessthe effects of fixed effect variables. Random effect structures up to the maximalstructure were considered, specifically random intercepts by subject and stem andrandom slopes for both for all three fixed effects, as recommended by Barr et al.(2013). However, because models including random slopes for the two predictorvariables and their interaction failed to converge, I followed the recommendationof Bates et al. (2015a) by using only random intercepts for the two random effects,participants and items. This model suffered from no convergence problems.Linear models: hypothesis testingPart (a) of the probabilistic single surface base hypothesis states that in any inflec-tional inference task, only information from one base form can affect judgmentsabout the distribution over potential derivative forms. Part (b) of the probabilisticsingle surface base hypothesis states that there is exactly one privileged cell in aninflectional system whose forms can be used as the basis of inferences about de-rived forms in other cells. In the context of a GLMM analysis of these experimen-tal data, part (b) predicts that only one (or neither) variable out of knew.gensg74and knew.nompl should have a significant effect, and part (a) predicts that evenif both have a significant effect, their interaction knew.gensg:knew.nomplshould have a significant negative coefficient to offset the main effects’ sum of co-efficients from yielding a value higher than either one yields alone. The results offitting a GLMM with the parameters described above to the experimental data, asshown in Figure 3.15, are inconsistent with these predictions.75Dependent variable:correctknew.nom 0.738∗∗∗(0.120)knew.gen 0.423∗∗∗(0.121)knew.nom:knew.gen 0.069(0.166)(Intercept) −1.585∗∗∗(0.303)Observations 3,934Log Likelihood −1,864.399Akaike Inf. Crit. 3,740.798Bayesian Inf. Crit. 3,778.463Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01Figure 3.15: A GLMM with maximum likelihood coefficients predictingwhether a participant’s selected AccPl corresponded to the correct Ac-cPl. Values listed for each predictor are their coefficient estimates, andassociated values in parentheses are their standard errors. Table cre-ated using Stargazer (Hlavac, 2013).Because the model uses logistic regression, coefficient estimates are log odds.The intercept estimate of -1.585 shown here, taking also into account the randomeffects, corresponds to “success” rate of approximately 17% on trials when only the76DatPl was presented. Knowing the GenSg (knew.gen) and NomPl (knew.nom)here both have significantly positive effects, yielding a predicted success rate ofapproximately 24.7% when the GenSg is known and 29.9% when the NomPl isknown. The predicted success rate given both informative bases was 41.2%. Thesepredicted success rates were calculated by exponentiating sums of model coeffi-cients to arrive at odds and then converting these odds to probabilities.Recall that hypothesis part (b) depends crucially on the truth of hypothesis part(a). By disproving part (a), then, it is possible to disprove the entire probabilis-tic single surface base hypothesis. Indeed, the fitted model parameters falsify part(a), the stipulation that a speaker can make use of information from only one baseform. Note that as a linear model, the GLMM shown in Figure 3.15 is linearlyadditive: the models’s predictions are based on the sum of its predictors’ effects.Consequently, because both knew.gen and knew.nom have significantly posi-tive effects and no significant negative interaction effect, it is possible to concludethat when participants are provided both the GenSg and the NomPl, they achievea higher rate of correct responses than when provided only one base or the other,adding their GenSg and NomPl coefficients together. In addition, these resultsare incompatible with the hypothesis that while each Icelandic speaker can makeuse of only one privileged base, some speakers use the GenSg while others usethe NomPl. A significantly negative interaction effect would indicate that bothGenSg and NomPl knowledge contribute significantly but participants with knowl-edge of both perform less well than would be expected from their joint knowledge;however, the interaction effect achieves a p-value of ≈0.60, and the estimated co-efficient is dwarfed by those of the other effects. The fact that there is no signifi-cant interaction effect indicates that the additive nature of the knew.gensg andknew.nompl effects is legitimate. Therefore the judgments of even individualIcelandic speakers can be affected by both base forms together.77Fixed effect c2 p degrees of freedomknew.gen 16.94 < 0:001 2knew.nom 17.82 < 0:001 21 (only intercept) 110.1 < 0:0001 1Figure 3.16: Results of likelihood ratio tests performed using the anovafunction in R (R Core Team, 2013) comparing the superset modelshown in figure 3.15 to three subset models.It is also possible to explicitly test part (b) of the hypothesis. If there is onlya single privileged base form, the GenSg or NomPl (or some other form), then wewould expect a simpler GLMMwith a fixed effect only for knowledge of the GenSgor for knowledge of the NomPl (or neither) to achieve as much predictive power asthe model shown above with a superset of these predictors. Such simpler modelswere fitted to the experimental data, and likelihood ratio tests were performed toperform model comparison. Figure 3.16 shows the results of these tests. Thesuperset model predicts the experimental data better than any of the subset modelsunder consideration, lending further evidence that there is no single privileged baseform available for Icelandic speakers to use in inflectional inference, but rather thatthey make use of information from multiple bases.Because participants were recruited largely through Icelandic social media fo-rums and networks with large proportions of university-educated people, one ques-tion in the post-study demographic questionnaire asked participants whether theyhad taken any Icelandic language or linguistics classes at the post-secondary level.Such classes sometimes explicitly discuss the inflectional system of Icelandic (Gun-nar Ó Hansson, p.c.), potentially causing participants’ judgments to be informed inlarge part by conscious, explicit knowledge of grammatical patterns and prescrip-tive norms as opposed to only their intuitions as native speakers. Approximatelyone third of participants indicated that they had taken at least one of these ad-vanced Icelandic language or linguistics classes (n=45 out of 123). As confirmedby adding a fixed effect encoding this binary difference among participants to thesuperset model, the significant main effects and lack of a negative interaction ef-78fect shown in figure 3.15 still hold of both groups of participants. However, themain effect coefficient estimates are higher for the 45 participants who had takenan advanced Icelandic class, as might be expected if they have access to explicitprescriptive knowledge in addition to their implicit native speaker intuitions. Evenso, these data suggest that the GLMM analysis results do not stem solely fromparticipants’ knowledge of prescriptive linguistic norms.3.3.3 Discussion of Icelandic experimentThe results from this experiment on native Icelandic speakers force a sound re-jection of the probabilistic single surface base hypothesis as defined in 3.1. Eachbase form provided to participants improved their success rate, indicating that bothbases can be used in inference, even in conjunction with each other. Despite theformal, empirical, and learnability-theoretic advantages of models of inflectionalmorphology in which information from multiple bases cannot be combined, actualspeakers do not behave so simply.Despite the falsification of the single surface base hypothesis, it is not nec-essarily the case that information from any and all base forms can be combinedfreely. With an eye toward formulating as parsimonious a theory as is compatiblewith available behavioral data, the next section describes a successor to the sin-gle surface base hypothesis. This hypothesis shows promise in constraining thespace of possible inflectional grammars while still maintaining compatibility withthe Bayesian view adopted in this dissertation.3.4 Base independenceHaving established that an inflectional system is not limited to only a single priv-ileged base cell, and also that native speakers can combine information from mul-tiple base cells in their inferences about unknown inflected forms, I now turn toquestions of the precise nature of how speakers combine information from mul-tiple base forms. Specifically, I present a simplifying hypothesis of the potentialinteractions among information from different base forms, the base independencehypothesis, and I demonstrate that the Polish nominal inflection system presentsa test case for this hypothesis. An experimental investigation of Polish speakers’knowledge of this part of their inflectional system, however, failed to demonstrateeven the baseline behavior that was assumed for purposes of the experimental de-79sign, and so this experimental investigation of the base independence hypothesisprovides no substantial evidence for or against it.3.4.1 The base independence hypothesisBecause speakers are able to use multiple base forms in inference, the conditionalprobability distribution that their grammars generate is one over derivative forms(or, equivalently in sublexical morphology, sublexicons) conditional on all or atleast multiple known base forms, i.e. p(derivative|base1;base2; :::basen). Throughapplication of Bayes’s theorem, it can be seen that this distribution is proportionalto p(base1;base2; :::basen|derivative)p(derivative), i.e. the joint probability dis-tribution over all possible forms of the available bases conditioned on the deriva-tive candidates, times the prior probability distribution of the derivative candidates.However, calculating a joint probability distribution over the forms in all base cellstogether may pose considerable difficulties, because it may require not only the cal-culation of form probabilities for all base cells individually, but also the calculationof form probabilities for all combinations of base cells.For the purpose of determining whether the full joint probability distributionmust indeed be calculated, I propose the base independence hypothesis, whichstates that for any lexeme in an inflectional system, any proper subset of its formsare conditionally independent given some other form. Less formally, this hypothe-sis states that calculating the probabilities of individual base forms suffices to arriveat the probability of those base forms together, because these individual probabil-ities can simply be multiplied together. For the purposes of determining a prob-ability distribution over derivative forms given a set of base forms, it is possibleto state this hypothesis mathematically by applying the definition of conditionalprobability, as shown in equation 3.6. Note again here that Bcell indicates the setof a lexeme’s possible base forms in a particular cell, and S indicates the set ofsublexicons.p(B1;B2; :::Bn|S) = p(B1|S)p(B2|S):::p(Bn|S) (3.6)As equation 3.6 makes clear, the base independence hypothesis would allowthe joint distribution of multiple bases to be calculated by simply calculating theprobability of each base separately and then multiplying these probabilities. If this80hypothesis holds, then even with multiple bases available for inference, actuallyperforming inference with multiple bases would likely still be tractable, as it wouldbe equivalent to performing single-base inference for each available base and thencombining these results afterwards.It is worth considering at this point how to begin approaching the task of im-plementing the base independence hypothesis under the assumptions of sublexicalmorphology. The term p(b1;b2; :::bn|s), i.e. the joint probability of the providedbases given a particular sublexicon, is calculated by having the gatekeeper gram-mar of sublexicon s assign a probability to the base forms b1;b2; :::bn. A sublexi-con’s gatekeeper grammar is parameterized by a set of weighted constraints. Eachof these constraints is indexed to a specific base cell, e.g. a constraint 3Sg: a#which evaluates to 1 if a provided 3Sg base form ends in [a] and evaluates to 0 oth-erwise. However, each sublexicon has only one gatekeeper grammar, which caninclude constraints that refer to any and all cells in the inflectional system: mixedin with 3Sg: a# might be constraints like 3Pl: an# and 1Pl: amos#. Consequently,a gatekeeper grammar evaluates the probability of all provided base forms together,yielding their joint probability.However, if the base independence hypothesis is true, then there would be adifferent possible mechanism for calculating the joint probability of bases. In thiscase, a sublexicon could have one gatekeeper grammar for each base cell, witheach grammar only containing constraints specific to its cell. To arrive at the jointprobability of a set of bases, then, one could have each grammar evaluate the prob-ability of its respective base and then multiply these probabilities together. Such aset-up may seem to needlessly complicate the machinery of sublexicons, but by in-vestigating where the predictions of these two configurations differ, it can be shownthat in fact this latter set-up tightly restricts the space of possible grammars.Predictions of the base independence hypothesisIn attempting to falsify the base independence hypothesis, it is useful to considerwhat sorts of patterns a model of grammar abiding by that hypothesis would be fun-damentally incapable of predicting. One such type of pattern would require cross-base constraint conjunctions, i.e. constraints in the gatekeeper grammar which con-join conditions on different bases. One example might be a constraint that evaluates81to 1 if and only if the provided NomSg form ends in [-o] and the provided GenSgform ends in [-i]. If the base independence hypothesis is implemented using agatekeeper grammar that has a different “sub-grammar” for each base cell, andwhich then simply multiplies these sub-grammars’ probabilities of the providedbase forms, then there is no way that a constraint referring to multiple base formscould be included in such a model, as it would be incapable of ever assessing aviolation.Such constraints are not necessarily required for a grammar to express the im-plicational relationships in Icelandic noun inflection covered in the previous sec-tions. For the Icelandic exponents under discussion, knowledge of a lexeme’sGenSg form and its NomPl form can combine additively, as shown in the linearmodel in figure 3.15: there is no significant interaction term, either positive ornegative, and so participants did indeed exhibit this purely additive behavior. Ingatekeeper grammars, which are formally similar to logistic regression models inthat they allow linear combinations of weights, the same logic would apply. Multi-plying probabilities of independently evaluated bases would be equivalent to sum-ming their weights together within a single grammar, and so the single-gatekeeperand one-gatekeeper-per-cell configurations would not differ substantially in theirpredictions.But what sort of inflectional system might require such a cross-base constraintconjunction? Consider the toy system in Figure 3.17, which shows the suffixescharacteristic of the 1st, 2nd, and 3rd person forms of verbs in three conjugationclasses of a hypothetical language. The crucial aspect of this dataset is that each ofthe cells has only two possible suffixes, but the cell whose suffix differs from theother two cells’ suffixes varies across the three inflectional classes.821Sg 2Sg 3SgClass 1 -a -i -anClass 2 -a -e -enClass 3 -o -i -enFigure 3.17: A schematic of an inflectional system which would be able tomake use of cross-base constraint conjunctions. Suffixes shown inboldface are those referred to in the hypothetical inference task below.To see why this dataset might require a cross-base constraint conjunction, wecan suppose that a speaker is attempting to predict the 3Sg form of a lexeme fromits 1Sg and 2Sg forms. The 1Sg and 2Sg forms known to the speaker take thesuffixes -a and -i, respectively. Let us suppose also that the three classes are equallyfrequent in the lexicon. To an analyst, it would be clear given the inflectionalsystem that this lexeme should belong to Class 1 and therefore should take the3Sg suffix -an, and one might predict that native speakers would share a strongjudgment to this effect.However, applying the base independence hypothesis to this scenario predictsless certainty on the part of the speaker. Equations 3.7 and 3.8 below show cal-culations of each 3Sg suffix candidate’s pre-normalization conditional probabilityconditioned on the provided 1Sg and 2Sg base forms. These calculations evenassume conservatively that the grammars generating conditional probabilities pro-duce nearly categorical judgments (probabilities of 1.0, 0.5, or 0.0) by mirroringthe implicational relationships seen in Figure 3.17.p(3Sg= an|1Sg= a;2Sg= i)µ p(1Sg= a;2Sg= i|3Sg= an)p(3Sg= an) [Bayes’s theorem]= p(1Sg= a|3Sg= an)p(2Sg= i|3Sg= an)p(3Sg= an) [BIH]= 1:0∗1:0∗0:3= 0:3(3.7)83p(3Sg= en|1Sg= a;2Sg= i)µ p(1Sg= a;2Sg= i|3Sg= en)p(3Sg= en) [Bayes’s theorem]= p(1Sg= a|3Sg= en)p(2Sg= i|3Sg= en)p(3Sg= en) [BIH]= 0:5∗0:5∗0:6= 0:16(3.8)These equations make it clear that a 3Sg -an suffix is more probable giventhe provided base forms if assuming the base independence hypothesis, but thedifference in the probability assigned to a 3Sg -an suffix in equation 3.7 and thatassigned to a 3Sg -en suffix in equation 3.8 is not substantial enough to recapitulatethe analyst’s nearly categorical judgment that the 3Sg suffix in this case should be-an. In fact, as shown in equation 3.9, the probability assigned to a 3Sg -an suffixby a Bayesian model assuming the base independence hypothesis is only 0:6. Notehere that Z is a normalization constant which ensures that the probabilities of the3Sg candidates form a proper probability distribution by summing to 1; Z is equalto the sum of the values calculated in equations 3.7 and 3.8, which themselves arenot probabilities because they do not sum to 1.p(3Sg= an|1Sg= a;2Sg= i)= p(1Sg= a;2Sg= i|3Sg= en)p(3Sg= an)=Z [Bayes’s theorem]= 0:3=(0:3+0:16)= 0:6(3.9)However, this mismatch between intuitive and predicted probability distribu-tions is crucially due to application of the base independence hypothesis, not dueonly to the use of Bayes’s theorem. (Note also that this model is not recapitulatingthe lexical frequencies of 3Sg exponents: it is -an that receives a probability of 0:6,not -en.) Equations 3.10 and 3.11 show the pre-normalization conditional proba-bilities of 3Sg -an and -en, respectively, without the base independence hypothesisbut otherwise under the same assumptions. The probability of 3Sg taking the suffix-en in this case is predicted to be zero, meaning that the probability of the suffix -an84is 1.0, matching the intuition of an analyst applying knowledge of the inflectionalsystem in Figure 3.17 to the inference task. Equation 3.12 shows this calculation.p(3Sg= an|1Sg= a;2Sg= i)µ p(1Sg= a;2Sg= i|3Sg= an)p(3Sg= an) [Bayes’s theorem]= 1:0∗0:3= 0:3(3.10)p(3Sg= en|1Sg= a;2Sg= i)µ p(1Sg= a;2Sg= i|3Sg= en)p(3Sg= en) [Bayes’s theorem]= 0:0∗0:6= 0:0(3.11)p(3Sg= an|1Sg= a;2Sg= i)= p(1Sg= a;2Sg= i|3Sg= en)p(3Sg= en)=Z [Bayes’s theorem]= 0:3=(0:3+0:0)= 1:0(3.12)We can now consider the distinction between these two scenarios, one with thebase independence hypothesis and one without, in the constraint-based terms ofMaxEnt harmonic grammars. If constraints are restricted to referring to only onebase each, then the relevant constraints would be [1Sg: -a] and [2Sg: -i]. In theseconstraints, the portion to the left of the colon indicates the base cell whose formis evaluated, and the portion to the right of the colon indicates the material whosepresence in the selected base cell’s form results in the constraint evaluating to 1.With access only to these two constraints (but not their conjunction), the gram-mar would be limited to evaluating the violation profiles of the 1Sg and 2Sg formsseparately, tantamount to evaluating their conditional probabilities independently.22Such a constraint set would make the grammar functionally equivalent to the naïve Bayes for-malism used in statistics, information retrieval, and machine learning (Lewis, 1998).85Accordingly, the grammar would fail to assign these bases a near-zero joint proba-bility when conditioned on the sublexicon corresponding to the 3Sg taking the -ensuffix.The cross-base constraint conjunction which would be useful for such a systemis [1Sg: -a & 2Sg: -i], which evaluates to 1 only if both its requirements are met.With access to this feature, the MaxEnt grammars would be able to assign thecorresponding base forms conditional probabilities much closer to those shown inequations 3.10 and 3.11, paralleling the analyst’s intuition of a (nearly) categoricalprediction.The constraint-based interpretation of the base independence hypothesis, inwhich the hypothesis explicitly forbids cross-base constraint conjunctions, makesthe learnability-theoretic appeal of the base independence hypothesis clear. Thesearch space for phonological constraints within a single set of forms is formidablylarge (Hayes & Wilson, 2008). If the space of possible constraints also includescross-base constraint conjunctions, then the size of this search space is multipliedby the number of base cells whose constraints could conceivably be conjoined; atthe very least, assuming that constraint conjunctions maximally conjoin constraintson two bases, this allowance expands the search space from n constraints to n2constraints. The following subsections describe an experiment intended to testwhether cross-base constraint conjunctions are indeed necessary, designed in thehopes of demonstrating that linguists need not worry about this potential explosionof the constraint search space.3.4.2 Testing the base independence hypothesisThe previous subsection has described a hypothetical inflectional system for whicha Bayesian model of morphology adhering to the base independence hypothesiswould make substantially different predictions from a similar model without thebase independence hypothesis. As with the single surface base hypothesis, thenext questions are whether such patterns exist in the inflectional systems of naturallanguages, and whether speakers’ judgments are consistent with the hypothesis’spredictions.The first question, at least, I can answer with a confident “yes”: this subsec-tion describes a slice of the Polish nominal inflectional system which qualitatively86parallels the toy example in Figure 3.17. I motivate and describe an experimenton native speakers of Polish designed to test the base independence hypothesis onthe basis of this pattern, largely following in the methodological footsteps of theIcelandic experiment. Ultimately, the experiment fails to falsify the base indepen-dence hypothesis, but more investigation is necessary before concluding that thebase independence hypothesis holds true in general, especially given some unex-pected behavior patterns among participants.Polish soft declensionsNouns in Polish inflect for case and number, with seven cases and two numbers(Schenker, 1955). Of particular present interest are the nominative and genitiveforms, whose abbreviations in the singular and plural are shown in Figure 3.18.Nominative GenitiveSingular NomSg GenSgPlural NomPl GenPlFigure 3.18: The two cases of immediate interest and two numbers of Polishnouns, as well as their abbreviations. The other cases are accusative,locative, dative, instrumental, and vocative.Like Icelandic, nouns in Polish carry information about grammatical gender,and inflectional classes (as defined by constellations of suffixes) and gender interactin complex but predictable ways. Each of the three genders—masculine, feminine,and neuter—corresponds to multiple inflectional classes, but one commonality isthat each gender minimally corresponds to a hard class and a soft class of nouns,with this distinction based on whether the final consonant of the lexeme stemsof a class is phonologically “hard” (typically non-palatalized) or “soft” (typicallypalatalized).87Hard feminine Soft feminineSingular Plural Singular PluralNominative mapa mapy granica graniceAccusative mape˛ mapy granice˛ graniceGenitive mapy map granicy granicLocative mapie mapach granicy granicachDative mapie mapom granicy granicomInstrumental mapa˛ mapami granica˛ granicamiVocative mapo mapy granico graniceFigure 3.19: The full inflectional paradigms of nouns MAP map- and BOR-DER, LIMIT granic-, representative of the hard feminine and soft fem-inine inflectional classes, respectively. Paradigms are shown ortho-graphically, with suffixes given in bold. Forms other than the GenSg,GenPl, NomPl, and DatPl are shown here only to give an overall senseof the inflectional system.For purposes of testing the base independence hypothesis, I focus now on theGenSg, GenPl, and NomPl forms of the three genders’ soft declensions, as thesethree stand in a relationship equivalent to the hypothetical system sketched previ-ously. Figure 3.20 shows the suffixes associated with each of these cells in the threeclasses. Note that the patterns of similarity and difference here perfectly parallelthose shown in the toy dataset in Figure 3.17. GenSg, GenPl, and NomPl take theplace of 1Sg, 2Sg, and 3Sg, and the 3Sg exponents -a and -e take the place of thehypothetical (but coincidentally similar) -an and -en. The two exponents each ofthe GenSg and GenPl are distributed with respect to NomPl exponents in the sameway as those the 1Sg and 2Sg in figure 3.17 were distributed with respect to the3Sg.88GenSg GenPl NomPlSoft neut. -a /0 -aSoft masc. -a -y -eSoft fem. -y /0 -eFigure 3.20: The suffixes associated with the GenSg, GenPl, and NomPlforms of soft neuter, masculine, and feminine nouns in Polish.Because this portion of the Polish nominal inflectional system exhibits the sameproperties of inter-predictiveness as the toy dataset in Figure 3.17, Polish showspotential as a testing ground for the base independence hypothesis. The task ofpredicting the NomPl of a lexeme from knowledge that its GenSg takes the suffix -aand its GenPl has a null suffix (i.e. when the lexeme belongs to the soft neuter class)serves as the key test case. The rest of this subsection describes an experimentdesigned to test whether Polish speakers in this situation prefer the -a suffix for thelexeme’s NomPl more often than would be expected under the base independencehypothesis.MethodologyThis experiment was carried out using a methodology very similar to that used forthe Icelandic experiment described in 3.3.1. A total of 219 Polish native speakerparticipants were recruited by posts on mailing lists and through word of mouth.The experiment itself was carried out entirely online using the Experigen (Becker& Levine, 2012) framework.On the Experigen web interface, participants were given information about theexperiment and were then provided a consent form to electronically sign. Eachparticipant who consented engaged next in two non-randomized practice trials, af-ter which she or he completed forty-eight test trials, one for each lexeme. Finally,all consenting participants filled out a demographic questionnaire asking for non-identifying personal information. All parts of the experiment were presented inPolish, as translated from English by a native speaker of Polish. The English trans-lation of this questionnaire can be found in the appendix.Each trial concerned a single novel lexeme designed to resemble existing Polish89noun lexemes. A participant would first be exposed to some number of inflectedforms of the novel lexeme in carrier sentences. The task was then to select onepreferred NomPl form of the lexeme out of a pair of options, one taking an -asuffix and the other taking an -e suffix. The order of these response options wasrandomized. The participant’s choice of NomPl form was recorded as the keyexperimental measure.The information about a novel lexeme presented to a participant before theNomPl choice task varied according to each stimulus frame’s presentation condi-tion. This variable ranged across the four options shown below, which correspondto the inflected forms shown to participants before they were asked to select anNomPl form. The DatPl form, which conveys no information about which NomPlsuffix is most appropriate, is always provided to introduce participants to eachnovel lexeme and encourage them to think of it as an existing Polish noun.1. DatPl [uninformative]2. DatPl, then GenSg [somewhat informative]3. DatPl, then GenPl [somewhat informative]4. DatPl, then GenSg, then GenPl [maximally informative]Figure 3.21: The four presentation conditions of the Polish experiment.Novel lexeme stems, 48 in total, were assigned randomly to the three inflec-tional classes: soft masculine, soft feminine, and soft neuter. These stems weredesigned so as to minimally influence judgments about their appropriate inflec-tional class or gender. Specifically, all stems were of the shape CVCVC, withtheir last vowels and final consonants drawn from the sets {i, y, o, a} and {n´, s´,z´, c´, }, respectively. This set of final consonants was chosen in collaboration witha natively Polish-speaking linguist so as to include a phonologically diverse (yetsmall) set of unambiguously soft consonants, and similarly the set of vowels wasprimarily chosen so as to eliminate concerns about potential stem changes, e.g. yerdeletion (Jarosz, 2005; Scheer, 2012). Three stems were generated from each com-90bination of vowel and final consonant, yielding a total of 48 stems. The three stemswith the same vowel-consonant combination were always assigned to different in-flectional classes. To minimize the risk of stems being similar enough to existingstems that the existing lexemes’ inflection could unduly affect judgments about thenovel stems, I implemented a script that rejected any stems with an edit distanceof 1 from any existing stem in a publicly available one-million word subcorpus ofthe National Corpus of Polish (Przepiórkowski et al., 2010). The aforementionedPolish-speaking linguist also performed a similar form of filtering using her ownjudgments of similarity. A complete list of stimuli can be found in the appendix.(Pseudo-)randomization of stimuli was performed in two ways. The order ofnovel lexemes was first itself randomized. The assignment of lexemes to presen-tation conditions also varied across participants, but rather than being randomized,Experigen cycled through three possible sets of pairings of lexeme and presenta-tion condition from each participant to the next. Each of of these sets of pair-ings evenly distributed lexemes of various inflectional classes evenly across thethree presentation conditions, such that there were always exactly twelve lexemesin each inflectional class in each presentation condition, one for each consonant-vowel combination.The DatPl, GenSg, and GenPl forms, when presented, were always given insidea carrier sentence designed to make it clear which paradigm cell the presentedform belongs to. These sentences were carefully designed so as not to provideany additional information about gender. The NomPl was also elicited using acarrier sentence, which provided an underscore in the position of the requestedNomPl form and which also provided no extra information about gender. Afterthe presentation of each inflected form, participants pressed a button to continueon to the next inflected form or finally to the NomPl choice task. This button-based procedure was added to encourage participants to consider the informationprovided by each inflected form. Inflected forms themselves were bolded to makethem stand out against the carrier sentences.Together, all of the carrier sentences formed a short narrative about children’stoys, intended to allow participants to think of novel lexemes as the names of ob-scure Polish toys. Toys were chosen as the topic because toy words in Polish arenot restricted to having specific genders, as are e.g. animal names. The narrative91was coherent regardless of which sentences were included. The purpose of thisdesign choice, in addition with the explicit instruction before the test trials to thinkof novel words as real but rare Polish nouns, was to encourage participants to usetheir knowledge of inflectional patterns in existing Polish words when performingthe NomPl choice task. All presentation and elicitation was performed using or-thographic representations, and no recordings were played or made at any time.Frame sentences are provided in Polish with English translations in the appendix.ResultsThe base independence hypothesis predicts that participants given all three baseforms will perform no better than predicted by the summed effects of their gainsin accuracy attributable to knowing the GenSg form or the GenPl form separately.Indeed, as the visual summary of responses in figure 3.22 sketches, success ratesin trials including all three base forms were no higher than those presenting onlytwo base forms. Unexpectedly, however, there were no measurable differences inaccuracy at all across any of the four presentation conditions.920.000.250.500.751.00DatPl DatPl+GenSg DatPl+GenPl DatPl+GenSg+GenPlPresentation conditionProportion correctFigure 3.22: Participants’ proportions of “correct” responses in the Polish ex-periment by presentation condition. Vertical bars show 95% confi-dence intervals, and horizontal bars show quartile values. Color-codedprobability density functions show kernel estimations of the underly-ing distributions.93These results do not indicate that participants failed to ever make use of in-formation contained in presented base forms. Rather, the apparent lack of anydifferences here is due to considerable discrepancies in behavior when presentedlexemes belonging to the three gender classes (even though participants needed toinfer each lexeme’s gender). Recall the organization of the key exponents in Polish,as repeated in figure 3.23.GenSg GenPl NomPlSoft neut. -a /0 -aSoft masc. -a -y -eSoft fem. -y /0 -eFigure 3.23: The suffixes associated with the GenSg, GenPl, and NomPlforms of soft neuter, masculine, and feminine nouns in Polish.While the base independence hypothesis—if valid—should hold true for theentirety of the inflectional system, it is only the neuter lexemes that serve as theclearest test case for the hypothesis. This is because only among neuters (out of thethree classes shown here) does neither the GenSg exponent nor the GenPl exponentuniquely identify a lexeme’s gender: knowing that a lexeme’s GenSg ends in -ysuffices to categorize a lexeme as feminine, while knowing that a lexeme’s GenPlends in -y suffices to categorize a lexeme as masculine. The appropriate question toask, then, is whether participants are more likely to select an -a NomPl form whenprovided both base forms than would be expected by simply combining the effectsof knowing the GenSg or GenPl form individually. In other words, is the amountby which participants improve from the baseline (only DatPl) when given both theGenSg and GenPl attributable purely to the simple combination of improvementsseen when provided each one of these forms separately? Figure 3.24 shows ratesof selecting the correct NomPl form (-a) among neuter lexemes, aggregated bypresentation condition.940.000.250.500.751.00DatPl DatPl+GenSg DatPl+GenPl DatPl+GenSg+GenPlPresentation conditionProportion correctFigure 3.24: Participants’ proportions of “correct” (-a NomPl) responses forneuter-class items in the Polish experiment by presentation condition.Vertical bars show 95% confidence intervals, and horizontal bars showquartile values. Color-coded probability density functions show kernelestimations of the underlying distributions.95Even when looking at just the neuter lexemes, participants fail to ever demon-strate highly accurate “superadditive” behavior when provided both the GenSg andGenPl, a result consistent with the base independence hypothesis. Even so, partic-ipants’ behavior within gender classes also defied many expectations, just as didtheir behavior in the aggregate as shown in figure 3.22, and so I conclude that it ispremature to take these results as clearly validating the base independence hypoth-esis.3 The remainder of this section serves to motivate these claims statisticallyand explore possible explanations of these results.In order to provide statistical backing to these claims and to further formal-ize and test them, I analyzed the results of this experiment in R (R Core Team,2013) using generalized linear mixed effect models (GLMMs) as implementedby the glmer function in the lme4 package (Bates et al., 2015b), just as I didfor the results of the Icelandic experiment described in 3.3.2. Again in parallelto the Icelandic experiment, the dependent variable is a binary variable indicat-ing whether the participant’s chosen NomPl form is the same as the lexeme’s in-tended (“correct”) NomPl form. Consequently, the model used logistic regressionby setting the family parameter to binomial. Fixed effects are included for bi-nary variables indicating whether the GenSg form was presented (knew.gensg)and whether the GenPl form was presented (knew.genpl), as well as their in-teraction (knew.gensg:knew.genpl). By-participant and by-lexeme randomintercepts are also included, but random slopes were excluded due to model con-vergence issues; see 3.3.2 for details of the motivations behind these choices.GLMMs are a natural fit to the research question addressed here. With knowl-edge of only the DatPl as the baseline (corresponding to the intercept), the effectsof knew.gensg and knew.genpl correspond to the increase in likelihood ofselecting the correct NomPl gained by knowing the GenSg form and the GenPlform separately, that is, independent of each other. To falsify the base indepen-dence hypothesis, Polish speakers’ judgments when provided joint knowledge of3Of course, because the base independence hypothesis serves as the null hypothesis in this exper-iment, strictly speaking no result would be able to prove base independence. It may be possible toreframe the research question so as to cast base independence as a falsifiable alternative hypothesiswith “base dependence” as a falsifiable null hypothesis, but given that these experimental results lackclear interpretations, I content myself with demonstrating that participant behavior is not incompati-ble with the base independence hypothesis as formulated.96the GenSg and GenPl forms would need to differ substantially from those pre-dicted by combining effects of knew.gensg and knew.genpl. In other words,to falsify the base independence hypothesis, speakers would need to demonstratea superadditive effect between knowledge of the GenSg and GenPl, which in aGLMM would correspond to a significantly positive interaction term for the twopredictors, the variable knew.gensg:knew.genpl.Figure 3.25 shows the results of fitting such a GLMM to the Polish exper-imental data. Not only is there no significant interaction between the fixed ef-fects, but the fixed effects themselves do not achieve significance. The p-valuesof knew.gensg, knew.genpl, and their interaction are 0.6061, 0.151, and0.8142, respectively. The effect sizes are also small compared, for example, tothose from the Icelandic experiment shown in Figure 3.15.97Dependent variable:correctknew.gensg 0.036(0.069)knew.genpl 0.099(0.069)knew.gensg:knew.genpl −0.023(0.098)(Intercept) 0.500∗∗(0.214)Observations 10,841Log Likelihood −5,326.419Akaike Inf. Crit. 10,664.840Bayesian Inf. Crit. 10,708.580Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01Figure 3.25: A GLMM with maximum likelihood coefficients predictingwhether a participant’s selected NomPl corresponded to the correctNomPl. Values listed for each predictor are their coefficient estimates,and associated values in parentheses are their standard errors. Tablecreated using Stargazer (Hlavac, 2013).Figure 3.26 shows similar linear models fit to data only for each of the threegender classes. Notably, the dependent variable in these models is not whether aparticipant selected the “correct” NomPl exponent, but rather simply which NomPl98exponent was chosen. This variable guessed.suffix evaluates to 1 when aparticipant chooses -e and to 0 when a participant chooses -a, meaning that morepositive coefficients indicate a greater propensity to select e-final NomPl formsrather than a-final forms. The reason for this change is that each gender class hasa specific NomPl exponent that is “correct”, and so the two dependent variableschemes are mathematically equivalent to each other (except for signs on coeffi-cients), but using guessed.suffix makes it easier to see which effects wereof which sign across the three gender classes. Assuming the associations betweenNomPl suffix and gender class shown in figure 3.23, more positive coefficientscan therefore also be thought of as indicating beliefs that a lexeme is masculine orfeminine, as opposed to neuter.99Dependent variable:guessed.suffix == -eneuter masculine feminineknew.gensg −0.127 0.360∗∗ −0.254∗(0.148) (0.145) (0.134)knew.genpl −1.092∗∗∗ −0.030 −0.672∗∗∗(0.137) (0.137) (0.129)knew.gensg:knew.genpl −0.032 −0.402∗∗ 0.191(0.190) (0.199) (0.180)Intercept 2.593∗∗∗ 2.240∗∗∗ 2.276∗∗∗(0.194) (0.170) (0.159)Observations 3,614 3,613 3,614Log Likelihood −1,581.124 −1,396.733 −1,663.658Akaike Inf. Crit. 3,174.248 2,805.466 3,339.316Bayesian Inf. Crit. 3,211.403 2,842.620 3,376.471Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01Figure 3.26: GLMMs with maximum likelihood coefficients predicting, foreach subset of the data of a particular gender class, which NomPlsuffix participants selected. Values listed for each predictor are theircoefficient estimates, and associated values in parentheses are theirstandard errors. Table created using Stargazer (Hlavac, 2013).Looking for now only at the results for the neuter lexemes in figure 3.26, partic-ipant behavior is more in line with lexical patterns than the overall results in figure1003.25. The coefficient for knew.genpl is significant and negative, indicating thatknowledge of a neuter lexeme’s GenPl form improves participants’ accuracy in se-lecting the -a NomPl exponent. However, the main effect for knew.gensg is notsignificant; it is not clear what could be causing this difference between the two ef-fects. However, most importantly, their interaction knew.gensg:knew.genplis not significant and has a small coefficient estimate. This result is consistent withthe base independence hypothesis.Note that all three gender classes’ models have a significant intercept with alarge, positive coefficient. These coefficients indicate that when only a lexeme’sDatPl is presented—in which case participants should have no ability to discrimi-nate among the three gender classes—there is a strong bias toward selecting theNomPl form ending in -e. Moreover, for neuter nouns, even when the coeffi-cient estimates for all fixed and random effects are added up, i.e. modeling theDatPl+GenSg+GenPl condition, the sum of this negative value −1:251 and theintercept 2.593 still yields a positive value. Therefore even when presented withbases which should allow participants to always select the “correct” -a suffix forthese forms, participants still tend to prefer selecting the e-final forms. Figure 3.24from earlier in this section can be seen as a visualization of this result: note thateven in the DatPl+GenSg+GenPl condition, participants still chose the “correct”-a suffix less than 29% of the time. As the next chapter covers in detail, this biastoward e-final forms can be viewed as an effect of lexical frequencies interferingwith speakers’ ability to make fully productive use of even categorical implica-tional relationships in their inflectional systems.However, not all patterns of behavior shown by these models have such conve-nient explanations. As shown in the model of feminine lexeme responses, knowingtheir GenSg or GenPl forms actually decreased participant accuracy in selectingthe -e NomPl suffix consistent with this gender class. Among masculine lexemes,knowledge of the GenSg form (but not the GenPl form) improved rates of selectingthe appropriate -e NomPl suffix, but knowledge of both GenSg and GenPl resultedin lower accuracy than that achieved with only the GenSg. Recall that the phono-logical/orthographic shapes of the stems used in the experiment indicated specif-ically that lexemes used must belong to a soft declension, and that no other softclasses have distributions of suffixes that participants could plausibly be overap-101plying. These irregularities suggest that inexplicable confounds may have marredthe design of this experiment.As in the Icelandic experiment, one of the questions in the demographic ques-tionnaire presented at the end of the experiment asked participants whether theyhad ever taken Polish language or linguistics classes since secondary school. Nei-ther sub-population demonstrated different patterns of behavior, in general or justfor neuter lexemes, from those observed for the combined population, or from eachother’s. Even significance levels were identical across the three sets of participants,indicating that advanced prescriptive knowledge of Polish grammar was not a likelysource of any of these response patterns.Overall, the lack of significantly positive interactions both in the response dataoverall and within responses for only neuter lexemes is consistent with the baseindependence hypothesis. Even so, participants’ behavior deviated substantiallyin other ways from what one would expect based on the relevant lexical patterns.These aberrations suggest that the results should not be interpreted so straightfor-wardly as validating the base independence hypothesis, but leave open the possibil-ity that future experiments might build on this methodology to address the questionin a more conclusive way.3.5 Summary and discussionThis chapter has presented an investigation of the ways that humans make use ofknown inflected base forms of a lexeme when predicting unknown derivative formsof the same lexeme. The general procedure I have taken has been to describe asimple, falsifiable baseline model of such inference and then test experimentally,for each increase in model complexity, whether native speakers’ linguistic behaviorjustifies adding this complexity—that is, whether the behavior is consistent withthe more complex model but not with the baseline.The modeling choices under consideration in this chapter are derived from therelationships between base and derivative forms in Bayesian, surface-oriented ap-proaches to inflectional morphology, specifically sublexical morphology. Equation3.13 repeats the fundamental equation of sublexical morphology as an illustrationof these relationships.102p(S|B1;B2; :::Bn) µ p(B1;B2; :::Bn|S)p(S) (3.13)The first baseline tested was a probabilistic interpretation of the single surfacebase hypothesis (Albright, 2002), which posits that only a single base form’s shapecan be referred to when performing inflectional inference, and moreover that thisprivileged base is the same paradigm cell regardless of the nature of the inferencetask. In terms of equation 3.13, this hypothesis can be thought of probabilisticallyas defining its left-hand side as equal to p(S|Bprivileged), where the cell of Bprivilegedis defined as invariant across an entire inflectional system.This chapter presented experimental evidence from an Icelandic wug test thatthis hypothesis cannot account for the inferential abilities of Icelandic speakers,and therefore that the hypothesis is not valid cross-linguistically. In the experiment,Icelandic speakers exhibited an ability not only to make use of information fromboth GenSg and NomPl forms when predicting AccPl forms, but also to combineinformation from both bases in a single inference task. I concluded, then, that theequivalence of the left-hand side of equation 3.13 to p(S|Bprivileged) does not hold.This result prompted a follow-up question: are there any principled limita-tions on how speakers can combine information from various available base forms?Making use of the probabilistic, Bayesian setting of sublexical morphology, I pro-posed one such possible limitation. The base independence hypothesis constitutesa simplifying assumption about the left term of of the right-hand side of equation3.13, i.e. the term corresponding to the base probabilities assigned by sublexicons’MaxEnt gatekeeper grammars. According to the base independence hypothesis,probabilities of bases given a sublexicon are independent of each other, and sotheir joint conditional probability is simply the product of their individual condi-tional probabilities, as shown in equation 3.14.p(B1;B2; :::Bn|S) = p(B1|S)p(B2|S):::p(Bn|S) (3.14)To test the base independence hypothesis, I proposed an experiment similar tothat performed with Icelandic speakers, but instead targeting judgments pertain-ing to a relevant pattern in Polish. While the results of this experiment could begenerously interpreted as a failure to falsify the hypothesis, unexpected response103patterns suggest that the experiment may not have been designed in a way thatappropriately tests it.From these empirical data, I conclude tentatively that while speakers of lan-guages with inflectional morphology are able to combine information from multi-ple base forms when performing inference of unknown inflected forms, there is noconclusive evidence of hard or soft restrictions on the ways that such informationcan be combined. The complexity of speakers’ morphological knowledge thereforerepresents challenges to linguists interested in modeling knowledge of inflectionalmorphology, both from the standpoint of creating a parsimonious theory, and fromthe standpoint of designing an efficient learning algorithm. Even so, I hope thatothers will take up the mantle of testing the base independence hypothesis andproposing other testable hypotheses about the limits of inflectional knowledge andmorphological inference.104Chapter 4Empirical priorsWhat does it mean for a morphological pattern to be productive? In investigat-ing the implications of a Bayesian view of inflectional morphology, the presentchapter addresses this fundamental question of linguistic inquiry. On the basis ofexperimental evidence from Icelandic and Polish, I conclude that speakers’ appar-ent failure to apply lexically robust patterns can be fruitfully understood not asexhibiting a lack of productivity, but rather as a mathematically predictable inter-action between two competing pressures. The first of these is the familiar pressureto apply the implicational relationships evidenced in the lexicon, and the second isa simpler one: a heuristic that matches lexical frequencies of morphological expo-nents without regard for their co-occurrence (or lack thereof) with other exponents.Tying this second pressure to the Bayesian concept of prior probabilities, I showthat Bayesian approaches to inflectional morphology predict this interplay, and thatsublexical morphology in particular provides a set of tools for understanding thesephenomena not only qualitatively, but also quantitatively.Similar to the preceding chapter, this chapter seeks to empirically validate onespecific aspect of the instantiation of Bayes’s theorem that forms the core of sub-lexical morphology. Equation 4.1 shows the key claim of a Bayesian model ofinflectional morphology: that the probability of a derivative form d (among thecandidates D) given a sets of known base forms Bcell is proportional to the likeli-hood of those base forms given the derivative form times the prior probability p(D)of the derivative form.105p(D|B1;B2; :::Bn) µ p(B1;B2; :::Bn|D)p(D) (4.1)The major contribution of sublexical morphology as a specific Bayesian theoryof inflectional morphology is to make calculation of the likelihood term and priorprobability term mathematically and intuitively straightforward. This theory setsup correspondences between derivative formsD and sublexicons S, so that equation4.1 can be rewritten as shown in equation 4.2.p(S|B1;B2; :::Bn) = p(B1;B2; :::Bn|S)p(S) (4.2)Crucially for this chapter, the sublexical approach entails the intuitive way todefine p(S) shown in equation 4.3: that the prior probability of a sublexicon isproportional to the size of that sublexicon, where size is defined as the number oflexemes associated with it. Specifically, I describe the priors used in sublexicalmorphology as empirical priors, since they are based on speakers’ observationsabout their language.p(S) µ |S| (4.3)In section 4.1, I use a toy inflectional system to illustrate the differences be-tween a traditional understanding of productivity in inflectional morphology anda Bayesian understanding that incorporates prior probabilities. Returning then tothe Icelandic and Polish experiments introduced in chapter 3, sections 4.2 and 4.3present post hoc analyses of the data from these experiments which demonstrate theinfluence of empirical prior probabilities on experimental participants’ responses.These sections also address questions of how an empirical prior should be defined.A concluding section summarizes the theoretical and empirical importance of em-pirical priors and discusses implications.4.1 Priors in inflectionThis section describes the theoretical and empirical implications of Bayesian priorsin surface-oriented inflectional morphology. First, I describe a traditional—onecould say prescriptive or pedagogical—view of morphological inference. I providea Bayesian interpretation of this conception of how such inference should proceed.106I then generalize this model to allow for regularization, by which priors cometo play a greater role in inference, and discuss two types of priors: uniform andempirical.Figure 4.1 shows a small, simple inflectional system for nouns in a hypotheticallanguage. These nouns have only two forms, a singular and a plural, and eachnoun belongs to one of three inflectional classes. I assume for now that these threeclasses are equal in frequency, and that there are no nouns in the language whichdo not belong to one of these three classes.Singular PluralClass 1 -a -eClass 2 -o -eClass 3 -u -iFigure 4.1: A toy nominal inflectional system, showing the “singular” and“plural” forms of nouns in three classes.One can now consider the implicational relationships within this system for-mally using the notation of probability theory. Because the singular exponents areall distinct from the plural exponents, I will use a shorthand by which, for example,p(i|u) is written to mean p(plural = i|singular = u), that is, the probability thatthe plural exponent is i given that the singular exponent is [u].Suppose that a speaker of this language hears the singular form of an unfamil-iar noun which ends in [u]. According to a prescriptive or pedagogical view ofinflectional morphology, which I will call the traditional view, one might expectthis speaker to determine conclusively that this lexeme belongs to inflectional class3. One would then conclude that the speaker should be able to predict with perfectconfidence that its plural form should end in [i]. It is also possible to come to sucha conclusion without the intermediate abstraction of inflectional classes, by ob-serving that forms whose singulars end in [u] always (in the lexicon) have pluralsending in [i]. Mathematically, this view of inflectional morphology predicts thatthe speaker’s grammar should set p(i|u) equal to 1:0. Similarly, this view wouldpredict, for example, that p(i|o) = 0:0.107Such predictions can be understood in a Bayesian way. Recall that according toBayes’s theorem as applied to surface-oriented morphological inference, p(D|B) isproportional to p(B|D)p(D), i.e. proportional to the likelihood of the base form(s)conditioned on the derivative form times the prior probability of the derivativeform. It is helpful to think of the likelihood term and the prior term as servingdistinct purposes. Informally, the likelihood term evaluates how much the newlyavailable information (in this case the base form) shifts the balance of probabilitiestoward or away from each outcome (derivative), while the prior term implementsa bias toward pre-existing probabilities which ignores new information (like theshapes of a base form). Crucially, then, the balance between how much the overallsystem depends on new information like base forms rather than defaulting to priorprobabilities depends on the perceived quality of the new information, that is, howuseful the system considers that information to be. As an example, the abilityof the likelihood term to shift judgments away from those based solely on priorprobabilities would be greater when a speaker observes that derivative forms areeasily predictable from base forms, as opposed to when a speaker considers baseforms poor predictors of derivative forms.For now, it should suffice to think in the abstract of regularization as a pressurewhich forces a new-information-sensitive likelihood term toward conservatism, inthe sense of preventing it from “making strong judgments” on the basis of newinformation. The more regularization the system exhibits, the less the system willmake use of the newly available information and the more it will rely on its priorprobabilities. In practice, such regularization is a critical ingredient in creatingmodels which are able to generalize effectively to novel data rather than “over-learning” accidental regularities in their training data. For reviews of the usefulnessof regularization in statistics and machine learning, see Bickel et al. (2006) andFriedman et al. (2004), respectively, and see Wilson (2006) and Hayes (2011) forexamples of the demonstrated importance of regularization in phonology.The traditional view of inflectional morphology, by which for example p(i|u)=1:0, corresponds to a complete lack of regularization. Such a model predicts ex-actly the conditional probabilities that it observes, and because all words with a sin-gular [u] have a plural [i], the model makes an equivalent prediction. By preventingregularization entirely, it becomes possible to quantitatively mimic the predicted108probabilities above in a Bayesian setting, as equations 4.4 show. These equationsassume for now empirical priors based on lexical frequencies: because each classhas the same frequency, and because the plural [i] exponent occurs in only one classwhereas the [i] exponent occurs in two, the prior probabilities of [i] and [e] are 0:3and 0:6, respectively. This equation shows that with no regularization, the priorprobabilities are rendered irrelevant by the great difference in the base likelihoodterms.p(i|u) µ p(u|i)p(i) = 1:0∗0:3= 0:3p(e|u) µ p(u|e)p(e) = 0:0∗0:6= 0:0) p(i|u) = 0:3=(0:3+0:0) = 1:0(4.4)The opposite of this scenario is one with maximal regularization, which intu-itively corresponds to the model ignoring any information provided by base forms.When regularization is maximized, conditional probability distributions over baseswill be maximally entropic, i.e. every possible base form will be given the sameconditional probability. The actual value of that probability depends only on thenumber of possible base forms, because they must sum to 1. In equation 4.5, max-imal regularization has set both p(u|i) and p(u|e) equal to the same value, 0:3,because for either plural exponent ([i] or [e]), all three singular exponents are con-sidered equally likely. As a result, only the prior probabilities rather than the baselikelihoods make a difference in the final probabilities.p(i|u) µ p(u|i)p(i) = 0:3∗0:3= 0:1p(e|u) µ p(u|e)p(e) = 0:3∗0:6= 0:2) p(i|u) = 0:1=(0:1+0:2) = 0:3(4.5)In practice, of course, most useful models will fall somewhere between thesetwo extremes of regularization. As I argue in the remainder of this chapter, ex-perimental evidence from Icelandic and Polish speakers suggest that while priorprobabilities play a major role in determining speakers’ predictions, speakers arenot doomed to recapitulate lexical frequencies: they can and do make limited butprincipled use of information in provided base forms.109This section has so far assumed that the prior probabilities at play are empiricalpriors which define a distribution over forms that mirrors their relative frequenciesin the lexicon. At this point I note that this is not by any means the only way todefine a prior distribution. One alternative worth discussing is the use of uniformpriors, which ignore information about lexical frequencies. Given n candidates, auniform prior distribution over them would assign each candidate a probability of1:0=n. In the case of this example, the n= 2 candidates are [i] and [e]. Like empir-ical priors, uniform priors are compatible with any degree of regularization of thelikelihood term.1 Equations 4.6 show the calculation of p(i|u) under the assump-tion of maximal regularization and uniform priors. Note that only the frequenciesof exponents themselves are relevant, and the fact that [e] is used in two classes asopposed to one does not affect the calculations.p(i|u) µ p(u|i)p(i) = 0:3∗0:5= 0:15p(e|u) µ p(u|e)p(e) = 0:3∗0:5= 0:15) p(i|u) = 0:15=(0:15+0:15) = 0:5(4.6)To summarize this section, through choices of regularization parameters andtypes of priors, it is possible for a Bayesian model of inflectional morphology todescribe various types of morphological grammars. When there is no regulariza-tion, the model predicts strict adherence to the implicational relationships, i.e. con-ditional probabilities, found in the lexicon, similar to a prescriptive or pedagogicalview of inflectional morphology. Conversely, when regularization is maximized inthe conditional likelihood term, this prevents the model from making any produc-tive use of known base forms of the target lexeme, reducing inference to either anexercise in matching lexical frequencies (with an empirical prior) or assignmentof equal probability mass to every possible derivative candidate (with a uniformprior). The following sections demonstrate that none of these extremes adequatelyexplain the Icelandic or Polish experimental data, and that instead the models withthe best explanatory power make use of empirical priors with a moderate amountof regularization.1Without any regularization, p(i|u) for this inflectional system would equal 1, as shown in 4.4,regardless of the prior type. However, this result depends crucially on one plural exponent having anunregularized conditional probability of 0. Otherwise, the prior will affect the final probabilities.1104.2 Assessing prior influence in IcelandicThe previous chapter described wug tests (Berko, 1958) on speakers of Icelandicand Polish, which were designed specifically to test the single surface base hy-pothesis and the base independence hypothesis, respectively. In both experiments,participants did not exhibit anywhere near perfect recapitulation of the strong—insome cases exceptionless—implicational relationships within their lexicons. Thissection revisits the results from the Icelandic experiment and develops the claimthat participants’ response patterns constitute more than just noise. Instead, I pro-pose an explanation of participant behavior based on the concepts of Bayesianpriors and regularization introduced earlier in this chapter. The explanatory powerof this model of speaker behavior validates the Bayesian view of inflectional mor-phology with empirical priors, supporting the central equation of sublexical mor-phology.This section assumes a familiarity with the experimental methodology describedin 3.3.1. Discussion from this point onward assumes little about the nature of basereference in Icelandic, and so it is not necessary for readers to be familiar withthe hypotheses that the experiment was designed to test. It should suffice insteadto understand that the GenSg and NomPl forms of Icelandic nouns provide usefulinformation—which speakers do indeed use—in predicting AccPl forms. Finally,note that this section presents strictly post hoc analysis of the experimental data.4.2.1 Lexical frequencies in IcelandicAccording to the hypothesis that Icelandic speakers make use of empirical priors intheir inference of inflected forms, speakers’ predicted probabilities of morphologi-cal exponents should measurably correspond to the proportions of those exponentsin the Icelandic lexicon, especially in the absence of substantial information to thecontrary. Determining the lexical frequencies of these exponents therefore consti-tutes the first step in assessing this hypothesis.Type frequencies of AccPl exponents among Icelandic nouns were extractedfrom the Database of Modern Icelandic Inflection (Bjarnadóttir, 2012) using au-tomated searches based on regular expressions, followed by manual checks per-formed by a native speaker to eliminate false positives. Proper nouns in Icelandicoften inflect idiosyncratically, and so since Icelandic marks proper nouns with cap-111italization, words with capital letters were excluded. Moreover, because partici-pants’ responses in the experiment were limited to the four AccPl exponent choices-ar, -a, -ir, and -i, I limit the rest of this discussion to only forms with one of thoseendings. According to the regular expression search, the AccPl forms of fifty-fivepercent of noun lexemes in Icelandic end with one of these four sequences.2The search procedure using regular expressions only produces counts of Ac-cPl forms which end with particular sequences, regardless of whether or not thesesequences constitute a suffix. Even under the assumption that Icelandic speakersmake use of empirical priors based on lexical frequencies, there remains the ques-tion of whether these priors are based on frequencies of surface patterns, i.e. thepresence of particular sequences of segments/characters in the AccPl, or based onfrequencies of morphological exponents like suffixes themselves. If consideringonly true suffixes, then the procedure matching AccPl forms against regular ex-pressions would overcount forms which take a null suffix but coincidentally endin the specified sequence.3 Because there are no nouns in Icelandic with a nullAccPl suffix but a non-null NomSg suffix, it is possible to exclude such cases byremoving all lexemes whose AccPl form is identical to its NomSg form. I term thesimple, surface-based regular expression search method the surface method, andthe more complex method the suffixal method.2To my knowledge there is no direct way to incorporate into statistical analyses the fact thatresponse options were limited to exponents comprising only 55% of the lexicon. As this sectionshows, the frequencies within this 55% closely mirror participant responses, but perhaps some of thedeviation from a perfect recapitulation of the lexical frequencies of -ar, -a, -ir, and -i is attributableto this issue.3This procedure would also overcount forms ending in a suffix which coincidentally ends in thespecified sequence, but there are no such nuisance suffixes in Icelandic for any of the four targetsuffixes.112Surface Suffixal-a 31959 (43.2%) 31495 (52.3%)-i 19892 (26.9%) 7260 (12.1%)-ir 14637 (19.8%) 14415 (24.0%)-ar 7445 (10.1%) 7060 (11.7%)Figure 4.2: Counts of lexemes whose AccPl forms take each of the four targetendings, based on the Database of Modern Icelandic Inflection (Bjar-nadóttir, 2012). Data include surface counts, based on regular expres-sion searches on AccPl forms, and suffixal counts, which include onlyAccPl forms bearing a non-null suffix as compared with its NomSgform.113Figure 4.3: Visualized counts of lexemes whose AccPl forms take each ofthe four target endings, based on the Database of Modern IcelandicInflection (Bjarnadóttir, 2012). Data include surface counts, based onregular expression searches on AccPl forms, and suffixal counts, whichinclude only AccPl forms with a non-null suffix.Figure 4.3 visualizes the results of these two counting procedures: surfacecounts of AccPl regular expression matches in blue on the left of each pair, andcounts of true AccPl suffixes in green on the right of each pair. Counts for -i dif-fer substantially between the two procedures—largely due to -i-final neuter AccPlswith null suffixes—while counts for the other three suffixes differ minimally. Asa result of these differences, not only the proportions but also the by-frequencyorderings of the four suffixes vary between the two counting procedures. Specifi-114cally, whereas -i is the second-most frequent ending according to the raw counts, inthe purely suffixal counts -i is barely more frequent than the least frequent ending,-ar.4.2.2 Evidence for empirical priors in IcelandicThe most straightforward way to evaluate participants’ prior distributions over thefour available suffixes is to inspect how participants behaved when provided nonovel information that might significantly influence their responses. The presenta-tion condition in which lexemes were introduced using only their DatPl forms—that in which neither the GenSg nor NomPl was provided—meets this criterion.Note that while stem shape may also affect judgments about suffix appropriate-ness, the stimuli were designed and balanced so as to control for such effects; see3.3 for further detail.For comparison, we can inspect participants’ response patterns in the presen-tation condition in which all base forms are provided, which in principle (see fig-ure 3.8) should provide speakers the information necessary to virtually disqualifyall but one suffix candidate. Because stimuli were balanced across the presenta-tion conditions and the four inflectional classes, such “perfect” (traditional, in thesense introduced previously) behavior in the condition providing all base formswould predict that overall response counts should be even across the four possiblesuffixes.115Figure 4.4: Frequencies of participants in the Icelandic experiment selectingan AccPl with each of the four possible suffixes. Blue bars on the leftshow response frequencies when only a lexeme’s DatPl was provided,while the green bars on the right show responses when all three baseforms were provided.Figure 4.4 shows response patterns in both presentation conditions. Three con-clusions can be drawn from these data. First, these responses are inconsistent withthe hypothesis that Icelandic speakers make use of uniform priors. In the onlyDatPl presentation condition, which should reveal participants’ prior beliefs aboutsuffix distributions, rates of selecting each of the four suffixes differed substan-tially. Impressionistically speaking, rather than a uniform distribution, responses116constitute something closer to an exponential distribution, in which the frequencyof -a responses is roughly twice as great as that of -i responses, which in turn isroughly twice as great as that of -ir responses, which finally is itself roughly twiceas frequent as -ar responses. It would be implausible for a uniform prior distribu-tion to generate the observed responses.To validate this conclusion, and throughout the rest of this chapter, I use chi-squared tests for goodness of fit as implemented in the chisq.test function inR (R Core Team, 2013). In a general sense, this test is ideal for the task of deter-mining the compatibility of a hypothetical distribution (such as a particular priordistribution over AccPl forms) with a set of responses ranging over the same val-ues. This is because the test provides an estimate of the probability of the providedhypothetical distribution having generated the empirical data themselves. Chi-squared values for tests with d degrees of freedom are indicated as c2(d) = value,and I use these values to arrive at p-values which indicate the likelihood of the datagiven a specified hypothetical distribution. I acknowledge, however, that one as-sumption of chi-squared tests is not met: responses are not all strictly independentfrom each other, since experiment participants each provided multiple responsesand each lexeme was seen by multiple participants. Because there is to my knowl-edge no convenient test that is comparable to a chi-squared test while allowing forsuch groupings of data, I opt to report chi-squared values with this caveat, primar-ily as a quantitative, heuristic companion to qualitative evaluations. In the future, itmay be useful (and better aligned with the arguments of this dissertation) to take anexplicitly Bayesian approach here, using Monte Carlo markov chains to estimatedata probabilities and performing Bayesian model comparison (Robert, 2007).Returning to the specific question of whether speakers show evidence of em-ploying uniform priors, I performed a chi-squared test of the experimental data forgiven probabilities [0.25, 0.25, 0.25, 0.25]. This test yields a chi-squared value ofc2(3) = 709:7 and a p-value of less than 0:001 for responses in the only DatPlcondition. Therefore the response data in this presentation condition are highlyunlikely to have been generated from a uniform underlying prior distribution.Second, these response patterns are more compatible with the hypothesis thatparticipants make use of empirical priors, specifically those based on frequenciesof surface segment/character sequences, rather than on frequencies of true suf-117fixes. Qualitatively, the rate of -i responses in particular relative to other responsesmore closely resembles the distribution of surface counts than it does the distri-bution of suffixal counts. A chi-squared test for the lexical proportions of thefour target sequences taking only true suffixes into account yields a chi-squaredvalue of c2(3) = 227:42, lower than that for the uniform distribution, but a p-valuestill less than 0:001. The same test on the proportions of AccPl forms ending inthe target surface sequences (including null suffixes) yields a chi-squared value ofc2(3) = 129:69, the lowest of any distributions considered. However, the p-valuefor this test is still less than 0:001. From the chi-squared values, one can concludethat even though the p-values are all too small to directly compare, the surfacedistribution gives the highest likelihood to the experimental data.The fact that even the best model’s p-value is so low indicates that while theempirical prior based on frequencies of surface patterns rather than suffixes bestcorresponds to the experimental data, there are unaccounted-for sources of noiseaffecting participant responses. This may be due to the fact mentioned previouslythat the assumptions of chi-squared tests are not strictly met. Future research onthis topic may benefit from taking a Bayesian approach to testing such questions.As a sanity check, I also compared the fits of these three hypothetical dis-tributions to the DatPl-only experimental data using Kullback-Leibler divergence(Kullback & Leibler, 1951). Kullback-Leibler divergence is an information the-oretic measure that quantifies how well one probability distribution approximatesanother. Specifically, this measure indicates how much information (in the infor-mation theoretic sense) is lost by approximating one distribution using another.Thus this measure is useful as a second way of evaluating how well the varioushypothetical prior distributions predict the experimental data. Since the theoreticalfoundations of this measure are quite different from those of the chi-squared test,I consider it a useful way to ensure that differences in chi-squared values are notattributable only to the the structure of that test itself.As shown in Figure 4.5, the Kullback-Leibler divergences of the three hypo-thetical distributions from the observed distribution of responses also support theconclusion that the surface distribution best explains these responses.118Uniform EmpiricalSurface Suffixal0.341 0.069 0.118Figure 4.5: Kullback-Leibler divergences of hypothetical prior distributionsover AccPl endings from the observed response distribution from theIcelandic study in the presentation condition providing only DatPlforms.Finally, based on the results in the DatPl+GenSg+NomPl presentation condi-tion in which all bases were provided, it is clear that prior probabilities greatlyinfluence speakers’ judgments. A complete lack of regularization would predictevenly distributed responses in the condition in which all bases were provided.However, beyond the obvious qualitative mismatch between this prediction and theobserved response patterns in the DatPl+GenSg+NomPl condition, a chi-squaredtest on these response frequencies and the uniform distribution [0.25, 0.25, 0.25,0.25] yields a chi-squared value of 313.7 and a p-value of less than 0:001. How-ever, the response patterns in this condition were more consistent with this uniformdistribution than responses in the only DatPl condition, corroborating the findingfrom 3.3.2 that information in base forms shifts speaker judgments toward consis-tency with the conditional probabilities of AccPl forms in the lexicon. In the termsintroduced earlier in this chapter, we can conclude that under a sublexical mor-phology interpretation of these results, Icelandic speakers’ exhibit neither maximalregularization nor a complete lack of it, but rather some intermediate amount.Among alternative causes of this response distribution that I have considered,none explain the experimental responses as well as the hypothesis that speakers’prior beliefs are quantitatively grounded in the surface frequencies of the targetAccPl endings. One might consider the experimental context to unfairly promotean even distribution of responses among the four options; but while this may bethe case to an extent, such an explanation cannot account for the varying responsefrequencies in the only DatPl condition. Given that of the four AccPl suffixes un-der consideration, -a is the most frequent in neologisms and loanwords (Gunnar119Ó. Hansson, p.c.), speakers should disproportionately prefer -a responses. While-a was the most common response in the experiment, this explanation cannot ac-count for the non-negligible frequencies of other responses which, again, mirrorlexical frequencies. However, although these alternative explanations cannot bythemselves supplant the utility of empirical priors in modeling the experimentalresults, some combination of them may constitute part of the aforementioned noisewhich resulted in the chi-squared tests’ low p-values.4.3 Assessing prior influence in PolishThis section follows the methods of the previous one, revisiting the response datafrom the Polish experiment described in 3.4 and evaluating what information thesedata provide about Polish speakers’ use of prior probabilities. The Polish patternsunder investigation are in some ways simpler and in other ways more complexthan the Icelandic patterns, and so the conclusions that can be drawn from thePolish experimental data differ somewhat from those that can be drawn from theIcelandic response data. Primarily, the findings of this section serve to support thefindings in the previous section. Overall, the Polish data exhibit an influence ofempirical priors very similar to that observed in the Icelandic data.4.3.1 Lexical frequencies in PolishIn the Polish experiment, participants were asked to select their preferred NomPlform for each lexeme from just two choices: one form ending in -e and one endingin -a. Unlike in the Icelandic experiment, there is not a one-to-one correspondencebetween these endings and inflectional classes: the -e suffix is consistent with thesoft masculine and soft feminine classes, while the -a suffix is consistent only withneuter classes. Figure 4.6 repeats the table from chapter 3. Note that while theseNomPl suffixes may also correspond to non-soft inflectional classes, especially inthe case of the neuter -a, the soft consonants ending each of the stems used asexperimental stimuli should in principle force participants to consider the lexemesmembers of a soft declension; whether participants did so is one empirical questionthat this section addresses.120GenSg GenPl NomPlSoft neut. -a /0 -aSoft masc. -a -y -eSoft fem. -y /0 -eFigure 4.6: The suffixes associated with the GenSg, GenPl, and NomPl formsof soft neuter, masculine, and feminine nouns in Polish.As in the case of Icelandic, there are multiple ways that lexical frequencies asusable by an empirical prior could be construed. The simplest way is to count thenumber of surface occurrences of -e and -a endings on Polish NomPl forms, and asomewhat more nuanced method excludes AccPl forms which actually have a nullsuffix but have a stem ending in one of these characters. I maintain the conventionsof the previous section in calling these methods surface and suffixal, respectively.Additionally, in Polish there is a third count which may be relevant: that of -e and-a NomPl forms only among lexemes ending in a soft consonant, i.e. only withinsoft declensions. Since stem shapes make softness/hardness clear, such knowledgecould plausibly be used in establishing participants’ prior distributions over theirresponse choices.To determine these lexical frequencies, I extracted counts of common nounsfrom PoliMorf (Wolin´ski et al., 2012), the self-described “ultimate morphologi-cal resource for Polish” which builds off of the Grammatical Dictionary of Polish(SGJP) (Saloni et al., 2007). Counts of e-final and a-final NomPl forms were pro-duced by performing regular expression searches on the corpus. To arrive at countsexcluding null suffixes, I compared NomPl forms to their lexemes’ NomSg forms.As in Icelandic, there are no inflectional classes in Polish whose NomPl forms havenull suffixes but whose NomSg forms have non-null suffixes, and so null-suffixedNomPl forms were excluded by removing NomPl forms which are identical to theirlexemes’ NomPl forms. Finally, counts of only -e-final and -a-final soft lexemeswere extracted using regular expressions that combined the two target suffixes withthe soft stem endings used in the experiment: -ni, -si, -zi, and -ci.4 Figure 4.7 shows4The soft consonants which end these stems are typically represented orthographically as n´, s´, z´,121the results of these searches.Surface Suffixal Soft-e 28630 (77.4%) 22530 (74.0%) 4152 (61.3%)-a 8346 (22.6%) 7908 (26.0%) 2626 (38.7%)Figure 4.7: Counts of lexemes whose NomPl forms end in each of the tar-get characters, based on PoliMorf (Wolin´ski et al., 2012). Data includesurface counts based on regular expression searches on NomPl forms,suffixal counts which include only NomPl forms bearing a non-null suf-fix as compared with its NomSg form, and soft counts which includeonly NomPl forms whose stems end in one of the four soft consonantsused in the Polish experiment.and c´, respectively. However, according to the orthographical conventions of Polish, these sounds arewritten without a diacritic and with a following i when preceding a vowel. For example, the lexemeLOVE is spelled miłos´c´ in the NomSg and miłos´ci in the NomPl.122Figure 4.8: Visualized counts of lexemes whose NomPl forms end in each ofthe target characters, based on PoliMorf (Wolin´ski et al., 2012). Datainclude surface counts based on regular expression searches on NomPlforms, suffixal counts which include only NomPl forms bearing a non-null suffix as compared with its NomSg form, and soft counts whichinclude only NomPl forms whose stems end in one of the four soft con-sonants used in the Polish experiment.Figure 4.8 visualizes the results of these three counting procedures: surfacecounts of NomPl regular expression matches in blue on the left of each group,counts of true NomPl suffixes in green in the middle of each group, and in orangeon the right side of each group, counts of only NomPl endings whose lexemesend in one of the four soft consonants used in the experiment. According to all123of these counting procedures, -e endings outnumber -a endings. This differenceis most pronounced, however, in the surface counts of -e-final and -a-final NomPlforms, and least pronounced among the counts of soft lexemes. These results alsodemonstrate that only small minorities of the forms with these endings belong to asoft inflectional class, especially among -e-final forms.4.3.2 Evidence for empirical priors in PolishAs in the preceding section, comparisons of Polish response patterns with lexicalpatterns bear on three central questions: whether Polish speakers make use of uni-form or empirical priors, what types of lexical patterns empirical priors are basedon (in the case that they are used at all), and how much regularization is at play inspeakers’ judgments. I proceed through these topics in the above order.The most straightforward way to evaluate participants’ prior distributions overthe two available endings is, again, to inspect how participants behaved when pro-vided no novel information that might significantly influence their responses. Thepresentation condition in which lexemes were introduced using only their DatPlforms—that in which neither the GenSg nor GenPl was provided—meets this cri-terion. Whether prior distributions are also affected by stem shape is discussedfurther below. For comparison, we can inspect participants’ response patterns inthe presentation condition in which all base forms are provided, which in princi-ple (see figure 4.6) should provide speakers the information necessary to decideon a single candidate ending. Because stimuli were balanced across the presen-tation conditions and the three genders, such “perfect” behavior in the conditionproviding all base forms would predict response rates of approximately 67% for-e and 33% for -a. Figure 4.9 visualizes response rates in these two presentationconditions.124Figure 4.9: Frequencies of participants in the Polish experiment selecting aNomPl with each of the two possible suffixes. Blue bars on the leftshow response frequencies when only a lexeme’s DatPl was provided,while the green bars on the right show responses when all three baseforms were provided.These response patterns are inconsistent with the hypothesis that speakers makeno use of lexical frequencies when performing morphological inference, i.e. thatthey make use of uniform rather than empirical priors. A chi-squared test of thelikelihood of the only DatPl experimental data given a uniform distribution assign-ing 0.5 probability to both -e and -a yields a chi-squared value of c2(1) = 1292:2and a p-value of less than 0:001. This indicates that it is highly unlikely that partic-125ipants with a uniform prior distribution over the two options would produce theseresponse patterns.Among the three possible interpretations of an empirical prior, the one makinguse only of surface frequencies of the endings -e and -a is most compatible withdata from the only DatPl condition. While all three versions yielded p-values lessthan 0:001, the chi-squared values for the surface, suffixal, and soft distributions arec2(1) = 78:80, c2(1) = 156:21, and c2(1) = 616:38, respectively. These resultsindicate that even though the p-values are too small to compare directly, the empir-ical prior model based on surface frequencies of the two endings gives the highestlikelihood of the response patterns. As shown in Figure 4.10, the Kullback-Leiblerdivergence values of these hypothetical distributions from the response distributionfurther support these conclusions, indicating that the surface distribution is closestto the experimental response distribution. However, because the chi-squared testdata likelihood even given the best model is so small, this interpretation suggeststhat there must be some unaccounted-for sources of noise in the results.Uniform EmpiricalSurface Suffixal Soft0.262 0.016 0.032 0.130Figure 4.10: Kullback-Leibler divergences of hypothetical prior distributionsover NomPl endings from the observed response distribution in theDatPl-only condition of the Polish study.In addition, based on the results in the presentation condition in which all baseswere provided, it is clear that prior probabilities greatly influence speakers’ judg-ments. While overall response patterns in this condition do look qualitatively sim-ilar to the two-to-one odds predicted by the hypothesis that participants made per-fect use of information in the provided base forms, participants’ success rate inselecting the appropriate endings falls far short of the high success rate that thishypothesis predicts. Specifically, participants in this condition—which, based onimplicational relationships in the lexicon, should allow perfect accuracy—selectedthe ending consistent with these relationships only 20.8% of the time (751 out of1263614 responses). A chi-squared test of response frequencies in this condition giventhe [0:6, 0:3] distribution predicted if participants make perfect use of these impli-cational relationships, i.e. the response distribution predicted if there is no regular-ization at play in these judgments, yields a chi-squared value of c2(1) = 103:23.However, performing the same test using the [0.774, 0.226] distribution of thebest-performing empirical prior yields a much lower chi-squared value of justc2(1) = 3:64. I conclude accordingly that since the prior distribution is a better fitto these data, there is some regularization bringing participants’ responses closerto the empirical prior distribution of AccPl exponents, even though—as the testsin chapter 3 show—participants also make significant use of the relevant lexicalco-occurrence patterns.4.4 Summary and discussionThis chapter has served two purposes. First, it has offered a Bayesian interpretationof a traditional model of inflectional morphology, and has used this interpretationto set up a typology of inflectional productivity patterns predicted by various pa-rameter values. Second, it has offered experimental evidence that bears on thequestion of how speakers make use of Bayesian priors in inflectional morphology.Evidence from the Icelandic experiment introduced in the previous chapter sup-ports three conclusions about the influence of prior distributions in inflectional in-ference, and evidence from the related Polish experiment corroborates these find-ings. According to this evidence, speakers make use of empirical priors whenperforming morphological inference. These priors correspond to the lexical fre-quencies of the surface-based phonological shapes corresponding to their deriva-tive form options, rather than to frequencies of exponents like suffixes per se orfrequencies of inflectional classes. Moreover, these priors exhibit a strong influ-ence on morphological judgments even when they conflict with novel informationprovided by other inflected forms; in a model like sublexical morphology, this be-havior indicates strong regularization on conditional likelihoods, but not enough soas to prevent any substantial use of information in provided base forms.These results speak more generally to the notion of productivity in inflectionalmorphology. Between this chapter and the previous one, I have shown that evenwhen speaker judgments diverge from those predicted by strong implicational rela-127tionships in the lexicon, such divergence is largely principled, deriving from simplelexical frequencies, and such divergence does not mean that speakers are makingno productive use of those implicational relationships. For example, as Kawahara(2011, 2016) summarizes, in light of experimental findings that Japanese speak-ers often fail to “correctly” generalize Japanese verbal inflection to novel verbs(Batchelder, 1999; Griner, 2001; Vance, 1991, 1987), some linguists have con-cluded that the Japanese verbal system lacks productivity. The Bayesian view ofproductivity that I propose, in which real productivity can be partially hidden bythe influence of prior probabilities, would not so hastily lead to this conclusion. Ihope that this discussion encourages researchers to revisit questions of morpholog-ical productivity with an eye to how the concepts of priors and regularization canimpact our assessment of whether patterns are indeed productive.Lastly, I acknowledge that these post hoc investigations of evidence from myIcelandic and Polish experiments do not themselves constitute incontrovertible ev-idence for surface-based empirical priors. From the tests I have performed on theseresults, it appears that some unknown sources of noise are affecting speakers’ judg-ments in addition to—or, perhaps, instead of—their empirical priors: the highestp-values I obtained from any of my chi-squared tests for goodness of fit were stilllower than 2:2∗10−16. Curiously, the more moderated response patterns in the all-bases conditions actually achieve better chi-squared values and KL-divergencesthan responses in the only DatPl conditions, suggesting that whatever other noisethere may be, it has the anti-moderation effect of distributing responses less evenly.Additionally, one could argue that the Polish results in themselves prove little, sincein Polish the expected response distribution given no regularization (the traditionalmodel) is [0:6, 0:3], close to the observed response distribution. This is why Iuse the Polish results here primarily as a way of supporting the stronger claimsmade from the Icelandic study, in which the predicted response distributions varysubstantially. In part because of these mitigating factors, I look forward to thepossibility of future experiments designed specifically to assess the influence ofempirical priors.128Chapter 5ConclusionThis dissertation has presented and validated a novel approach to modeling theparadigm cell filling problem, that is, the task of inferring unknown forms withinan inflectional paradigm. Humans perform this task in ways that evidence thelearning of a complex generative morphological system, and yet investigations ofthis specific topic that combine formal modeling with experimental methods arerare. The goal of the research presented here has been to contribute to our collec-tive understanding of how humans use their native languages’ inflectional systems.By proposing not only a theoretical framework for understanding the use of in-flectional morphology, but also a concrete implementation of the theory with aconcomitant learning algorithm, I have laid the groundwork for future research inboth theoretical linguistics and natural language processing.This chapter concludes the dissertation with a summary of the previous chap-ters and further discussion. Section 5.1 reviews the general Bayesian view of in-flectional morphology and my specific proposal of sublexical morphology, addingin-line references to the experiments I performed in order to validate the claimsthat make up my proposal. Section 5.2 then explores ways in which sublexical mor-phology may be of use to theoretical linguists beyond its core use case of “solving”the paradigm cell filling problem, including its potential for investigating paradigmleveling, paradigmatic gaps, and hypotheses about paradigm entropy. Finally, 5.3addresses ways that follow-up research could address some limitations of the sub-lexical approach.1295.1 Summary of proposals and evidenceAt the most fundamental level, I have proposed that we can conceive of the paradigmcell filling problem probabilistically, within a Bayesian framework whose variablesare the surface-level inflected forms of lexemes, i.e. observed forms rather than ab-stract underlying representations. To “fill” a paradigm cell, a native speaker, know-ing only a set of base forms of a lexeme, must infer and produce some theretoforeunknown derivative form of the same lexeme. Under the probabilistic interpre-tation that I have proposed, filling a paradigm cell means not the selection of anoptimal output form for the derivative, but rather the inference of a probabilitydistribution over derivative forms followed by sampling from that distribution.To formalize these ideas, I define a discrete distribution D which ranges overthe possible forms of the derivative being inferred. (The set of possible formsis determined, for example, by the sublexicons of sublexical morphology.) Thisdistribution is conditioned on a set of base form variables B, one for each basecell for which the speaker has observed a form of the lexeme, and each variableranging over the observed forms of that lexeme in that base cell. Equation 5.1shows the conditional probability distribution that a model of the paradigm cellfilling problem generates.p(D|B1;B2; :::Bn) (5.1)The first hypothesis about morphological inference that I placed under scrutinywas the single surface base hypothesis of Albright (2002) and subsequent papers.In brief, given the probabilistic interpretation that I proposed, this hypothesis con-stitutes a claim that inference about a derivative form makes use of only informa-tion contained in a single privileged base form. Equation 5.2 uses the notationdeveloped so far to show the equality predicted by the single surface base hypoth-esis.p(D|B1;B2; :::Bn) = p(D|Bprivileged) (5.2)To assess this hypothesis, I carried out the experiment on Icelandic speakersdescribed in the first part of chapter 3. Targeting a specific pattern within the Ice-landic nominal inflection system, this experiment introduced a novel variation on130the wug test paradigm (Berko, 1958) to address the question of whether speakersare able to combine information from multiple bases in a single inference task. Theexperimental results are inconsistent with the single surface base hypothesis, sug-gesting that Icelandic speakers combine information from all available base formswhen inferring unknown derivative forms. Despite the theoretical and computa-tional appeal of the single surface base hypothesis, then, I concluded that speakers’grammars place no such limitation on their inferential capabilities.Beyond simply proposing a probabilistic interpretation of the paradigm cellfilling problem, perhaps the most essential claim I have made in this dissertationis that we can use Bayes’s theorem to better understand and predict how speakers“solve” this “problem”. As equation 5.3 shows, Bayes’s theorem makes it possibleto decompose the target conditional probability distribution into a likelihood termconditioned on the derivative and a prior term.p(D|B1;B2; :::Bn) µ p(B1;B2; :::Bn|D)p(D) (5.3)This manipulation itself does not lead to any particular increase in ease of mod-eling, because there are no clearly useful direct interpretations of the notions ofa joint probability distribution over bases given a derivative p(B1;B2; :::Bn|D) orthe notion of a prior probability of a derivative p(D); the probabilistic view ofthese distributions under-determines how they should be calculated. This is whyI have gone one step further, introducing the framework of sublexical morphologywhich makes these distributions readily interpretable and calculable. Accordingto sublexical morphology, the entire lexicon is partitioned into morphologicallyhomogeneous sub-parts called (paradigm) sublexicons. Given a sublexicon andat least one base form, one can generate a derivative form for any derivative cell.Because of this near equivalence of sublexicons with derivative forms, sublexicalmorphology claims that distributions over derivative forms in the equations abovecan be replaced with distributions over sublexicons S, with derivative distributionsthen generable from sublexicon distributions. Essentially, morphological inferencereduces to a straightforward classification problem in which selection of a sublex-icon for some lexeme is tantamount to selecting its derivative form. Equation 5.4summarizes these claims.131p(D|B1;B2; :::Bn) = p(S|B1;B2; :::Bn) µ p(B1;B2; :::Bn|S)p(S) (5.4)By substituting sublexicon distributions for derivative distributions, the distri-butions in this equation become both more intuitive and easier to calculate. As ademonstration of these properties, I will describe the sublexical interpretation ofthe two terms on the far-right-hand side of equation 5.4, first the likelihood termp(B1;B2; :::Bn|S) and then the prior term p(S).The term p(B1;B2; :::Bn|S) indicates the joint probability distribution over baseforms given a sublexicon. Within sublexical morphology, the probability of aset of base forms given a sublexicon is determined by that sublexicon’s Maxi-mum Entropy harmonic grammar (Goldwater & Johnson, 2003; Hayes & Wilson,2008), also called its gatekeeper grammar. These grammars are parameterized byweighted constraints, whose violation profiles over the provided base forms com-bine to result in an overall probability of the forms.However, although the sublexical approach provides this method of calculatingbase likelihoods for each sublexicon, a system that models the entire joint prob-ability distribution over base forms may pose a problem from the standpoint oflearnability. Under the gatekeeper grammar interpretation of base likelihoods, forexample, defining the joint space over base forms would require a massive prolifer-ation of cross-base constraint conjunctions, constraints which refer to phonologi-cal material in multiple base forms at once. Since phonological constraint learningalready poses substantial challenges even without this addition of orders of magni-tude more complexity (Hayes & Wilson, 2008), it would be highly desirable if onecould empirically determine that humans make use of only a small portion of thisconstraint space.In order to evaluate whether such simplifications are empirically justified, I de-fined the base independence hypothesis, whereby the probabilities of base formsare conditionally independent of each other given a sublexicon. Equation 5.5 rep-resents this hypothesis mathematically, using the definition of conditional inde-pendence. If the base independence hypothesis is valid, then determining the jointdistribution over bases given a sublexicon becomes far easier; if calculated using agatekeeper grammar, for example, the search space (and possible set of constraints132to evaluate during inference) would be restricted only to constraints which eachevaluate some property of a single base form.(B1;B2; :::Bn|S) = p(B1|S)p(B2|S):::p(Bn|S) (5.5)Seeking to test this hypothesis, I performed an experiment with native speak-ers of Polish, as described in the second part of chapter 3. This experiment useda methodology similar to that of the Icelandic experiment, but targeted a part ofthe Polish nominal system whose implicational relationships render it a viable test-ing ground for the base independence hypothesis. Participants’ behavior in thisexperiment was consistent with the base independence hypothesis, although someunexplained irregularities in their overall response patterns make me wary of con-sidering these results definitive. If the validity of the base independence hypothesiscan be confirmed, e.g. by additional experimentation, then these findings will bene-fit theoretical and—especially—computational models of inflectional morphology,simplifying analyses for the former and facilitating learning and inference for thelatter.I turn now to the term in equation 5.4 indicating the prior probability distribu-tion over sublexicons, p(S). Chapter 4 contains the bulk of the discussion of thisaspect of the model. In sublexical morphology, this distribution forms an empiri-cal prior distribution matching the relative “sizes” of the various sublexicons in aninflectional system. The precise manner in which the size of a sublexicon is quan-tified, however, was not known a priori, nor was there any particular theoreticalreason to define it in some particular way. Moreover, there was no empirical evi-dence that prior distributions in sublexical morphology should be based on lexicalfrequencies at all.Intending to assess whether speakers do indeed make use of empirical priors,and to determine how speakers arrive at them, I revisited the results of the Ice-landic and Polish experiments introduced originally in chapter 3. These analysesof participant responses were performed on a strictly post hoc basis, but taken to-gether they suggest that speakers of Icelandic and Polish make use of empiricalpriors when performing morphological inference, and moreover that the influenceof these priors largely—but not completely—overshadows the effect of base forms133on their posterior distributions over derivative candidates. More specifically, theseanalyses suggest that speakers’ empirical priors reflect the surface frequencies ofsegment/character sequences associated with each derivative candidate, rather thanthe frequencies of exponents (e.g. suffixes) themselves or frequencies of inflec-tional classes.The theoretical and experimental findings of this dissertation validate my pro-posal of a probabilistic, Bayesian approach to inference in inflectional morphology(the paradigm cell filling problem), demonstrating in particular the explanatory andpredictive power of sublexical morphology. This research also constitutes the foun-dation of further research on probabilistic models of inflectional morphology and,more generally, on the formal limits on human abilities to perform such inference.5.2 Other applications of sublexical morphologySublexical morphology directly models the paradigm cell filling problem, but itsusefulness within linguistic theory extends beyond this scope. This section surveysthree specific research topics within theoretical morphology about which sublexi-cal morphology may help yield new and valuable insights. These topics includediachronic phenomena—paradigm leveling and the emergence of paradigmaticgaps—and hypotheses about the nature of predictability in inflectional systems—paradigmatic gaps and paradigm entropy conjectures.5.2.1 Paradigm levelingAs discussed in chapter 3, Albright (2002), Albright (2008), and Albright (2010)among others have argued that historical patterns of paradigm leveling in Latinand Yiddish can be explained and predicted by the single surface base hypothesis.This hypothesis states that for any inflectional system, speakers can use only theform in a single privileged base cell to generate unfamiliar derivative forms, andthat this privileged base cell is the cell whose forms are most informative aboutthe forms in other cells, given the implicational relationships in the lexicon. Thesingle surface base hypothesis accurately predicts the directionality of paradigmleveling in several cases discussed by Albright: the limitation to use of only asingle base form means that phonological distinctions among inflectional classeswhich are only present in non-privileged cells are the ones at risk of diachronicloss. However, the results of the Icelandic experiment described in this dissertation134refute the claim that this limitation to a single base form holds cross-linguistically.While it may appear then that I have traded an explanation of one phenomenon(paradigm leveling) for an explanation of others (those discussed in this disserta-tion) by falsifying the single surface base hypothesis, I propose that sublexical mor-phology may in fact also offer an explanation of at least some observed paradigmleveling patterns. In this subsection, I review the definition of paradigm levelingwith reference to a standard instance of the phenomenon, the Latin HONOR “anal-ogy”, and then I demonstrate that even with only the mechanisms introduced inprevious chapters, sublexical morphology is able to predict the directionality ofthis historical change.Old Latin Golden Age LatinClass 1 Class 2 Class 1 Class 2(high freq.) (low freq.) (high freq.) (low freq.)NomSg soror honos soror honorGenSg sororis honoris sororis honorisFigure 5.1: A schematic of Old Latin and Golden Age Latin NomSg andGenSg forms relevant to the leveling of HONOR-like words. The formsfor SISTER and HONOR are used as examples of forms in classes 1 and2, respectively.Figure 5.1 illustrates the key facts in Old Latin, which preceded this case ofparadigm leveling, and in Golden Age Latin, which was spoken after the leveling(Albright, 2002). While there was some individual variation among lexical items,in general there are two classes of nouns thought to be relevant to the phenomenon:a group which I label as Class 1, which included many lexical items in Old Latinincluding the SISTER word soror, sororis, and a group which I label as Class 2,which included fewer lexical items including the HONOR word honos, honoris.Here I use the forms of those two exemplar lexical items to show the patterns ofthose classes in general. Note that while I have listed GenSg forms, the relevantcontrast is more properly NomSg versus the oblique forms, which include GenSgand all other non-NomSg forms.135Crucially, the [-s] suffix that was the exponent of the NomSg in Class 2 in OldLatin became an [-r] suffix in Golden Age Latin, rendering the two classes mor-phologically equivalent in the more recent language. Because this change occurredamong words like HONOR, and because these NomSg forms appear to have beenrebuilt analogously to the NomSg forms of Class 1, this phenomenon is calledthe Latin HONOR analogy. Moreover, because this historical change resulted in aneutralization of a prior contrast between inflectional classes, it is an example ofparadigm leveling.There are two key questions about the Latin facts which any theory of paradigmleveling must address. First, why was it the NomSg form that changed instead ofthe GenSg form? It is conceivable that instead of the Class 2 NomSg forms chang-ing, the Class 2 GenSg forms could have ended up taking a [-sis] suffix. Second,why did Class 2 forms change instead of Class 1 forms? It is also conceivable thatSISTER-like words could have been leveled, taking the [-s] suffix of Class 2 words,rather than the other way around as was observed.The sublexical morphology view of how paradigm leveling might emerge of-fers explanations for both of these directionalities of change. For an inflectionalclass to be leveled, i.e. undergo neutralization with some other class, its inflectedforms which distinguish it from the class to which it levels must at some point beproduced with the morphology of the leveled-to class. In other words, speakersfaced with the paradigm cell filling problem and needing to infer these forms infer“incorrectly” that these forms take the morphology of a class other than the oneto which they originally belonged. Once this process begins, these novel inferredforms are presumably heard by other speakers, are memorized by them, and thenpropagate with a decreasing need for the (mis-)inferential process. In the Latincase, for example, perhaps some speaker(s) innovated the form [honor] and simi-lar forms for other words in its class, and these novel forms spread throughout thecommunity of speakers.In response to the first question, sublexical morphology predicts accurately thatNomSg forms rather than GenSg forms would be altered by paradigm leveling. Us-ing morphological operations plausibly learned by the algorithm described in sec-tion 2.5, the Old Latin sub-paradigm in Figure 5.1 would need only one operationto derive GenSg forms from NomSg forms, but two operations (one for each class)136to derive NomSg forms from GenSg forms. Figure 5.2 shows these operations.Sublexicon 1 Sublexicon 2(high freq.) (low freq.)NomSg→ GenSg final segment→ [ris] final segment→ [ris]GenSg→ NomSg final [ris]→ [r] final [ris]→ [s]Figure 5.2: The morphological operations deriving NomSg and GenSg formsfrom each other in the paradigm sublexicons of Old Latin.Under the assumptions of sublexical morphology, the selection of a sublexiconstands as a proxy for the generation of a derivative. For the “wrong” derivativeto be produced, all that is required is for the lexeme to be associated with the“wrong” sublexicon. As Figure 5.2 makes clear, when deriving a GenSg from aNomSg, it does not matter at all which sublexicon a speaker considers a lexeme likeHONOR to belong to; both sublexicons would result in the same derivative form,one ending in [-ris]. When generating a NomSg form from a GenSg form, however,the choice of sublexicon matters a great deal: this choice is equivalent to the choicebetween an [-r] suffix and an [-s] suffix on the NomSg. Paradigm leveling amountsto the consolidation of two sublexicons into one, and with the sublexicons shownin the figure above, such consolidation would only produce noticeable changes ininflected forms among NomSg forms, not among GenSg forms.As for the question of why the class 2 forms changed rather than the class 1forms, a sublexical morphology account of paradigm leveling would attribute thisfact to the prior probabilities of the two sublexicons. It is no coincidence, in thisaccount, that the forms in the less frequent class 2 were rebuilt to be more simi-lar to the forms in the more frequent class 1. If a speaker of Old Latin relied onempirical priors as defined according to the empirical results from chapter 4, andespecially if speakers relied on these priors to the extent that participants in the Ice-landic and Polish studies appear to have, then we would expect the morphologicaloperation of class/sublexicon 1 to be (mis-)applied to lexemes in class 2 far moreoften than the morphological operation of class/sublexicon 2 being (mis-)appliedto lexemes in class 1. This behavior would result in speakers producing forms like137[honor] frequently and [soros] much less so, feeding the canonicalization which,by conjecture, results diachronically in paradigm leveling.Taken as a whole, sublexical morphology successfully predicts both aspectsof the directionality of the Latin HONOR analogy, at least as viewed as narrowlyconcerning the two noun classes mentioned here. A fuller account of the anal-ogy based on sublexical morphology would need to demonstrate that the theorydoes not predict other NomSg–GenSg ambiguities in Old Latin yielding levelingchanges—although the probabilistic nature of sublexical morphology’s predictionsmake it difficult to clearly falsify using historical data. In general, my propos-als here predict that paradigm leveling should be more likely in cases where twosublexicons (or inflectional classes) exhibit both a severe difference in lexical fre-quencies and a lack of robust phonological differences. Additionally, while I donot intend to suggest that sublexical morphology alone can account for all cases ofparadigm leveling, it goes without saying that in order to constitute a universal the-ory of paradigm leveling, the framework would need to be tested on other datasets,especially the Yiddish dataset on which Albright’s (2002 et seq.) model performsso uncannily well. I hope that this brief discussion of Latin can serve as the seed offuture research into Bayesian and sublexical interpretations of paradigm leveling.5.2.2 Paradigmatic gapsDifferent strands of research on paradigmatic gaps have converged on the findingthat such gaps correspond to parts of an inflectional system exhibiting a lack ofpredictability (Albright, 2003, 2009; Hansson, 1999; Sims, 2006). By paradigmaticgaps, I refer to logically possible forms of lexemes in an inflectional system whichspeakers avoid producing; for example, some verbs in Spanish including [asir]GRASP have no (canonical) first person singular present indicative form. Researchon paradigmatic gaps typically addresses the question of why such gaps come aboutin the first place, as well as the question of why gaps occur in some parts of aparadigm but not others.Directly or indirectly, the papers on paradigmatic gaps that I have cited herefocus in part on the hypothesis that paradigmatic gaps tend to occur in less pre-dictable parts of an inflectional system. The predictability of a derivative cell isdefined in terms of how useful the morpho-phonological sub-regularities in other138cells of the paradigm would be when used to infer forms in that derivative cell.Perhaps, for example, even with knowledge of other forms of the lexeme GRASP inSpanish, speakers are unable to confidently predict a single most viable candidatefor its first person singular present indicative form. Probabilistically, one coulddescribe this situation by defining a probability distribution over the candidates forthis derivative form, conditioned on the known base forms of the lexeme. If there isno single obvious “winning” candidate, then probability mass is distributed moreevenly among the candidates in this distribution. In the terminology of informationtheory, this distribution is highly entropic.Under a Bayesian view of inflectional morphology, the paradigm cell fillingproblem amounts to the inference of exactly such a conditional probability distri-bution. Because an entropy value can be calculated from any probability distri-bution, sublexical morphology can therefore be used to calculate the conditionalentropy of any cell for any lexeme in an inflectional system. This property ofsublexical morphology—especially using the implementation and learning algo-rithm described in chapter 2—makes it possible to directly test hypotheses aboutparadigmatic gaps and the entropy of individual cells in a paradigm. For exam-ple, a researcher could continue in the footsteps of Sims (2006) by performing amulti-base wug test like those described in this dissertation, but with a “declineto respond” answer option, and then test whether participants rely more on thisextra answer option when the entropy of an inference task is high. Unlike othermethods for calculating the predictability of parts of inflectional paradigms, sub-lexical morphology takes into account the prior probabilities of derivative forms,potentially making its estimates of entropy more reliable. A follow-up study couldalso test whether or not prior probabilities play a role in experimental participants’likelihood to decline to respond.5.2.3 Paradigm entropyMoving beyond the properties of individual cells in a paradigm, Ackerman &Mal-ouf (2013) have performed information theoretic analyses of inflectional paradigmsacross a variety of languages, showing that while some information theoretic mea-sures vary widely from language to language, others cluster tightly cross-linguistically,and that these results have interesting and useful theoretical consequences. For ex-139ample, they define a paradigm’s average conditional entropy as “the average un-certainty in guessing the realization of one randomly selected cell in the paradigmof a lexeme given the realization of one other randomly selected cell.”Just as for paradigmatic gaps, sublexical morphology could be useful to re-searchers interested in testing these and other variations of hypotheses about the in-formation theoretic properties of entire inflectional paradigms. Since the measuresdiscussed by Ackerman & Malouf (2013) can all be derived from the probabil-ity distributions for lexemes’ individual cells conditioned on some subset of thoselexemes’ other forms, a model of the paradigm cell filling problem like sublexicalmorphology can indirectly calculate these measures. Moreover, the Bayesian char-acter of sublexical morphology also allows researchers to evaluate the impact ofprior probabilities on these entropy values.5.3 Limitations and future directionsThe simplicity and utility of sublexical morphology derive mainly from its estab-lishment of a deterministic mapping from sublexicon to derivative form via themorphological operations of each sublexicon. Because of this property, assigninga probability to a sublexicon is equivalent to assigning a probability to its corre-sponding derivative candidate. However, the price of this system is that a sublex-icon must be completely homogeneous in terms of its morphological operations.This strict homogeneity requirement means that lexemes which differ morphologi-cally in even a single respect (i.e. a single cell) must belong to separate sublexicons.There are two inter-related negative consequences of this property of sublexicalmorphology models. The first is that sublexical morphology’s intuitively overzeal-ous partitioning of the lexicon could make it difficult for gatekeeper grammarsto accurately assess the characteristic phonological properties of each sublexicon,weakening the gatekeepers’ empirical accuracy. For example, in an extreme case,there might be two large classes of nouns which take the same morphological expo-nents (same within class, different between classes) except in one particular cell, inwhich each lexeme in the two classes has its own idiosyncratic exponent. In such asituation, there would need to be a sublexicon of size one (i.e. with only one associ-ated lexeme) for every lexeme. Because a gatekeeper grammar’s constraint weightsare calculated by comparing the forms in its sublexicon to all other forms, this pro-140liferation of sublexicons would in all likelihood prevent the general phonologicalproperties distinguishing the two greater classes from being effectively encoded inthe constraint weights.This problem mirrors a common tension in descriptive morphology: whethertwo lexemes which are mostly morphologically homogeneous ought to be con-sidered members of the same or separate inflectional classes. Ambiguous casesabound, including Latin 3rd conjugation verbs vs. “3rd -io” verbs, and how Span-ish verbs exhibiting diphthongization should be distinguished from those without.I take the long-standing difficulty of this problem as evidence that there may be nosimple solution (although see e.g. Brown & Hippisley 2012 for work in this direc-tion). Even so, one could conceive of a version of sublexical morphology in whichthere is a “soft” requirement of morphological homogeneity in a sublexicon, sothat the grammar can include fewer sublexicons (and therefore, perhaps, more use-ful constraint weights) at the cost of guessing randomly—according, for example,to lexical frequencies—when needing to select an exponent for the heterogeneouscells merged into a single sublexicon. A more complex but potentially more power-ful solution might treat sublexicon membership hierarchically, so that lexemes thatare morphologically identical in some cells are treated as belonging to the samesublexicon at some level of a hierarchy, but are then split into separate sublexiconsat a lower level due to differences in other cells. In such a system, weighted con-straint violations could be summed along each path down the hierarchy to evaluatesublexicon probabilities.Sublexical morphology’s lack of phonology-driven processes to derive differ-ent surface realizations of forms in the same sublexicon exacerbates this problem.For example, sublexical morphology has no mechanism for using English phono-tactics to derive the regular [-s], [-z], and [-Iz] plurals from the same sublexicon.Although the appeal of sublexical morphology stems in large part from its lack of aneed to learn global phonotactics and abstract underlying representations, it is pos-sible that some limited abstraction of stored forms could help unify sublexicons.This approach resembles the “bundling” process described by Moore-Cantwell &Staubs (2014), which could be generalized from modeling pairs of cells to mod-eling entire inflectional systems much as sublexical morphology has generalizedsublexical phonology (Allen & Becker, in review; Gouskova & Newlin-Łukowicz,1412013).The second, related consequence of the homogeneous sublexicons requirementis that the problem of sublexicon proliferation described above becomes more se-rious as the number of cells being modeled increases. As the number of cells in-creases, the odds that two lexemes will be (perhaps undesirably) split into separatesublexicons increases commensurately. This is one reason that I have limited mydiscussions so far to relatively small inflectional systems and parts of inflectionalsystems. The solutions described above may help alleviate this problem as well.However, truly massive inflectional systems, including “agglutinative” inflec-tional systems, pose further problems for sublexical morphology. From the stand-point of learnability, because every pair of cells’ base sublexicon divisions mustbe learned before paradigm sublexicons can be inferred, as the number of cells nincreases, the time it takes to learn the inflectional system’s paradigm sublexiconsincreases at a rate of at least n2. Additionally, sublexical morphology treats eachcell as equally distinct from each other cell, and it therefore has no way of rec-ognizing similarities among cells that may be useful. For example, the operationthat maps Japanese past tense forms onto past conditional forms (concatenation of[-ra]) is the same regardless of whether a verb’s polarity is affirmative or negative,or whether or not it is a conditional verb, etc.I can think of no way of surmounting these issues within the general assump-tions of sublexical morphology except by learning groupings of cells. For exam-ple, the learning algorithm could discover the identity relationship described inthe previous paragraph, and could thereafter treat all past tense cells as a single“super-cell” for purposes of deriving past conditional forms. Such consolidationcould proceed, for example, by finding pairs of cells with only one base sublexi-con between them bidirectionally. However, this approach would still require thatbase sublexicons be learned for every pair of cells. As a potential workaround, thegrammar might initially treat all cells as belonging to the same “super-cell” andonly split off cells when required by data encountered by the learner.Finally, the procedures for learning sublexicons that I have described in thisdissertation have generally assumed that all inflected forms (within the part of aninflectional system being learned) are present in the training data for all lexemes.The need for this unrealistic assumption stems from the fact that when only a subset142of inflected forms are available, it may be unclear which paradigm sublexicon(s)a lexeme in the training data belongs to. A more sophisticated learning algorithmwould need to deal with this ambiguity, perhaps by allowing lexemes to be asso-ciated probabilistically with as many sublexicons as appropriate. Calculation ofderivative probabilities would therefore require marginalization across sublexiconmembership probabilities.143BibliographyACKERMAN, FARRELL, JAMES P. BLEVINS, & ROBERT MALOUF. 2009. Partsand wholes: Patterns of relatedness in complex morphological systems and whythey matter. In Analogy in grammar: Form and acquisition, 54–82. Oxford:Oxford University Press. → pages 3——, & ROBERT MALOUF. 2013. Morphological organization: The lowconditional entropy conjecture. Language 89.429–464. → pages 139, 140ALARCOS LLORACH, EMILIO. 1994. Gramática de la lengua española,volume 61. Madrid: Espasa Calpe. → pages 2, 13ALBRIGHT, ADAM, 2002. The identification of bases in morphologicalparadigms. University of California, Los Angeles dissertation. → pages 21, 52,53, 55, 57, 58, 103, 130, 134, 135, 138——. 2003. A quantitative study of Spanish paradigm gaps. In West CoastConference on Formal Linguistics 22 Proceedings, ed. by G. Garding &M. Tsujimura, 1–14, Somerville. Cascadilla. → pages 138——. 2008. Explaining universal tendencies and language particulars inanalogical change. In Language universals and language change, ed. by JeffGood, 144–181. Oxford: Oxford University Press. → pages 53, 57, 58, 134——. 2009. Lexical and morphological conditioning of paradigm gaps. InModeling Ungrammaticality in Optimality Theory, ed. by Curt Rice & SylviaBlaho. London: Equinox. → pages 138——. 2010. Base-driven leveling in Yiddish verb paradigms. Natural Language& Linguistic Theory 28.475–537. → pages 134——, & BRUCE HAYES. 2002. Modeling English past tense intuitions withminimal generalization. In Proceedings of the ACL-02 workshop on144morphological and phonological learning, volume 6, 58–69. Association forComputational Linguistics. → pages 9, 49, 58——, & BRUCE HAYES. 2003. Rules vs. analogy in English past tenses: Acomputational/experimental study. Cognition 90.119–161. → pages 58——, & BRUCE HAYES. 2011. Learning and learnability in phonology. In Thehandbook of phonological theory, ed. by John Goldsmith, Jason Riggle, & AlanYu, 661–690. Hoboken: Wiley-Blackwell. → pages 10ALLEN, BLAKE, & MICHAEL BECKER, in review. Learning alternations fromsurface forms with sublexical phonology. → pages 9, 17, 24, 30, 41, 42, 43, 49,141ANDERSON, STEPHEN R. 1992. A-morphous morphology. Cambridge:Cambridge University Press. → pages 5, 47ARCHANGELI, DIANA, & DOUG PULLEYBLANK. 2012. Emergent phonology:evidence from English. In Issues in English linguistics, ed. by Ik-Hwan Lee,Young-Se Kang, Kyoung-Ae Kim, Kee-Ho Kim, Il-Kon Kim, Seong-Ha Rhee,Jin-Hyung Kim, Hyo-Young Kim, Ki-Jeong Lee, Kye-Kyung Kang, &Sung-Ho Ahn, 1–26. Seoul: Hankookmunhwasa. → pages 10ÁRNASON, MÖRÐ UR. 2007. Íslensk orðabók. 4th edition. Reykjavík: . → pages63ARONOFF, MARK. 1994. Morphology by itself: Stems and inflectional classes.Number 22 in Linguistic Inquiry Monographs. Cambridge: MIT press. →pages 5, 47BARR, DALE J, ROGER LEVY, CHRISTOPH SCHEEPERS, & HARRY J TILY.2013. Random effects structure for confirmatory hypothesis testing: Keep itmaximal. Journal of memory and language 68.255–278. → pages 74BATCHELDER, ELEANOR OLDS. 1999. Rule or rote? Native-speaker knowledgeof Japanese verb inflection. In Proceedings of the Second InternationalConference on Cognitive Science, 141–146. → pages 7, 128BATES, DOUGLAS, REINHOLD KLIEGL, SHRAVAN VASISHTH, & HARALDBAAYEN. 2015a. Parsimonious mixed models. arXiv preprintarXiv:1506.04967 . → pages 74——, MARTIN MÄCHLER, BEN BOLKER, & STEVE WALKER. 2015b. Fittinglinear mixed-effects models using lme4. Journal of statistical software67.1–48. → pages 73, 96145BEARD, ROBERT. 1995. Lexeme-morpheme base morphology: a general theoryof inflection and word formation. Albany: SUNY Press. → pages 5, 47BECKER, M., A. NEVINS, & J. LEVINE. 2012. Asymmetries in generalizingalternations to and from initial syllables. Language 88.231–268. → pages 7BECKER, MICHAEL, & MARIA GOUSKOVA, 2013. Source-orientedgeneralizations as grammar inference in Russian vowel deletion. Ms.lingbuzz/001622. → pages 7, 17——, & JONATHAN LEVINE, 2012. Experigen - an online experiment platform.Available at https://github.com/tlozoot/experigen. → pages 67, 89BERKO, J., 1958. The child’s learning of English morphology. Radcliffe Collegedissertation. → pages 8, 51, 111, 131BICKEL, PETER J, BO LI, ALEXANDRE B TSYBAKOV, SARA A VAN DE GEER,BIN YU, TEÓFILO VALDÉS, CARLOS RIVERO, JIANQING FAN, & AADVAN DER VAART. 2006. Regularization in statistics. Test 15.271–344. →pages 108BISHOP, CHRISTOPHER M. 2006. Pattern recognition and machine learning.New York: Springer. → pages 46BJARNADÓTTIR, KRISTÍN. 2012. The Database of Modern Icelandic Inflection(Beygingarlýsing íslensks nútímamáls). In Proceedings of Workshop onLanguage Technology for Normalisation of Less-Resourced Languages(SALTMIL 8 / AfLaT 2012), ed. by Guy De Pauw, Gilles-Maurice de Schryver,Mikel L. Forcada, Kepa Sarasola, Francis M. Tyers, & Peter W. Wagacha,13–18, Istanbul. European Language Resources Association (ELRA). → pages63, 64, 65, 69, 111, 113, 114BLEVINS, JAMES. 2006. Word-based morphology. Journal of Linguistics42.531–573. → pages 47BONAMI, OLIVIER, & GILLES BOYÉ. 2007. French pronominal clitics and thedesign of Paradigm Function Morphology. In Proceedings of the FifthMediterranean Morphology Meeting, 291–322, Bologna. → pages 47BROWN, DUNSTAN, & ANDREW HIPPISLEY. 2012. Network morphology: Adefaults-based theory of word structure. Cambridge: Cambridge UniversityPress. → pages 9, 37, 49, 141146CHILDS, G. TUCKER. 2003. An introduction to African languages. Amsterdam:John Benjamins Publishing. → pages 24CHOMSKY, N., & M. HALLE. 1968. The sound pattern of English. New York:Harper & Row. → pages 4CHOMSKY, NOAM. 1956. Three models for the description of language. IREtransactions on information theory 2.113–124. → pages 4——. 1957. Syntactic structures. The Hague/Paris: Mouton. → pages 4——. 1995. The minimalist program. Cambridge: MIT Press. → pages 4COLEMAN, JOHN, & JANET PIERREHUMBERT. 1997. Stochastic phonologicalgrammars and acceptability. arXiv preprint cmp-lg/9707017 . → pages 4DALAND, ROBERT. 2015. Long words in maximum entropy phonotacticgrammars. Phonology 32.353–383. → pages 25DREYER, MARKUS, & JASON EISNER. 2011. Discovering morphologicalparadigms from plain text using a Dirichlet process mixture model. InProceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP), 616–627, Edinburgh. Supplementary material (9 pages)also available. → pages 50EDDINGTON, DAVID, REBECCA TREIMAN, & DIRK ELZINGA. 2013. Thesyllabification of American English: Evidence from a large-scale experimentpart I. Journal of quantitative linguistics 20.75–93. → pages 7EINARSSON, STEFÁN. 1949. Icelandic: grammar, text and glossary.Baltimore/London: The Johns Hopkins University Press. → pages 61FRIEDMAN, JEROME, TREVOR HASTIE, SAHARON ROSSET, ROBERTTIBSHIRANI, & JI ZHU. 2004. Discussion of boosting papers. Annals ofstatistics 32.102–107. → pages 108GALES, MARK JF, KATE M KNILL, ANTON RAGNI, & SHAKTI P RATH, 2014.Speech recognition and keyword spotting for low resource languages: Babelproject research at CUED. → pages 5GOLDWATER, SHARON, & MARK JOHNSON. 2003. Learning OT constraintrankings using a maximum entropy model. In Proceedings of the Stockholmworkshop on variation within Optimality Theory, ed. by J. Spenader,A. Eriksson, & Ö. Dahl, 111–120, Stockholm. Department of Linguistics. →pages 10, 24, 30, 45, 132147GOUSKOVA, MARIA, & LUIZA NEWLIN-ŁUKOWICZ, 2013. Phonologicalselectional restrictions as sublexical phonotactics. Manuscript. → pages 9, 17,41, 49, 141GRINER, BARRY DAVID, 2001. Productivity of Japanese Verb Tense Inflection: ACase Study. University of California, Los Angeles dissertation. → pages 128HANSSON, GUNNAR ÓLAFUR. 1999. ‘When in doubt...’: intraparadigmaticdependencies and gaps in Icelandic. In Proceedings of the 30th Meeting of theNorth East Linguistic Society, volume 29, 105–120. → pages 138HANSSON, GUNNAR ÓLAFUR. 2006. Málfræðirannsóknir á öldupplýsingatækninnar – lítil reynslusaga. Lesið í hljóði fyrir Kristján Árnasonsextugan 26.desember 2006 . → pages 69HAYES, B., & Z.C. LONDE. 2006. Stochastic phonological knowledge: the caseof Hungarian vowel harmony. Phonology 23.59–104. → pages 7——, & C. WILSON. 2008. A maximum entropy model of phonotactics andphonotactic learning. Linguistic Inquiry 39.379–440. → pages 4, 10, 16, 24,30, 45, 86, 132HAYES, BRUCE. 2011. Interpreting sonority-projection experiments: the role ofphonotactic modeling. In Proceedings of the 17th International Congress ofPhonetic Sciences, 835–838, Hong Kong. City University of Hong Kong. →pages 108——. To appear. Comparative phonotactics. Proceedings of the 50th Meeting ofthe Chicago Linguistic Society . → pages 28——, PÉTER SIPTÁR, KIE ZURAW, & ZSUZSA LONDE. 2009. Natural andunnatural constraints in hungarian vowel harmony. Language 85.822–863. →pages 7——, & JAMES WHITE. 2013. Phonological naturalness and phonotacticlearning. Linguistic Inquiry 44.45–75. → pages 10HLAVAC, MAREK. 2013. stargazer: LaTeX code and ASCII text forwell-formatted regression and summary statistics tables. URL: http://CRAN.R-project.org/package=stargazer . → pages 76, 98, 100HONRUBIA, J.L.C., J.L. CIFUENTES, & S.R. ROSIQUE. 2011. Spanish WordFormation and Lexical Creation. IVITRA research in linguistics and literature.Amsterdam: John Benjamins Publishing Company. → pages 13148HUDSON KAM, CARLA, & ELISSA NEWPORT. 2005. Regularizingunpredictable variation: The roles of adult and child learners in languageformation and change. Language learning and development 1.151–195. →pages 4, 10JAROSZ, GAJA. 2005. Polish yers and the finer structure of output-outputcorrespondence. In Proceedings of the Annual Meeting of the BerkeleyLinguistics Society, volume 31, 181–192, Berkeley. University of CaliforniaPress. → pages 90JESNEY, KAREN, & ANNE-MICHELLE TESSIER. 2009. Gradual learning andfaithfulness: consequences of ranked vs. weighted constraints. In Proceedingsof the North East Linguistic Society 38, ed. by Anisa Schardl, Martin Walkow,& Muhammad Abdurrahman, Amherst. GLSA. → pages 10JI, HENG, JOEL NOTHMAN, BEN HACHEY, & RADU FLORIAN, 2014.Overview of TAC-KBP2015 tri-lingual entity discovery and linking.Procedural Text Analysis Conference (TAC2015). → pages 5JÓNSDÓTTIR, MARGRÉT. 1989. Um ir- og ar-fleirtölu einkvæðra kvenkynsorðaí íslensku. Ísklenskt mál 10–11.57–83. → pages 69——. 1993. Um ar- og ir-fleirtölu karlkynsnafnorða í nútímaíslensku. Ísklensktmál 15.77–98. → pages 69KARTTUNEN, LAURI, & KENNETH R. BEESLEY. 2005. Twenty-five years offinite-state morphology. In Inquiries into Words, a Festschrift for KimmoKoskenniemi on his 60th Birthday, ed. by Antti Arppe, Lauri Carlson, KristerLindén, Jussi Piitulainen, Mickael Suominen, Martti Vainio, Hanna Westerlund,Anssi Yli-Jyrä, & Juno Tupakka, 71–83. Stanford: CSLI. → pages 24KAWAHARA, SHIGETO. 2011. Experimental approaches in theoreticalphonology. The Blackwell Companion to Phonology . → pages 8, 128——, 2016. Psycholinguistic methodology in phonological research. Pre-printversion for publication by Oxford Bibliography Online. → pages 8, 128KNOKE, DAVID, & PETER BURKE. 1980. Log-linear models. Number 20 inQuantitative applications in the social sciences. Thousand Oaks: Sage. →pages 25KRESS, BRUNO. 1982. Isländische Grammatik. Leipzig: Enzyklopädie Leipzig.→ pages 61149KULLBACK, S., & R. A. LEIBLER. 1951. On information and sufficiency.Annals of Mathematical Statistics 22.79–86. → pages 118LEGENDRE, G., Y. MIYATA, & P. SMOLENSKY. 1990. Harmonic grammar: Aformal multi-level connectionist theory of linguistic well-formedness:Theoretical foundations. In Proceedings of the twelfth annual conference of theCognitive Science Society, 388–395. Cambridge: Lawrence Erlbaum. → pages10LEWIS, DAVID D. 1998. Naïve (bayes) at forty: The independence assumption ininformation retrieval. In Machine learning: ECML-98, 4–15. Springer. →pages 47, 85MALOUF, ROB, & FARRELL ACKERMAN. 2010. Paradigm entropy as a measureof morphological simplicity. In Proceedings from the Workshop onMorphological Complexity, Harvard. HUP. → pages 3MATTHEWS, PETER HUGOE. 1972. Inflectional morphology: A theoretical studybased on aspects of Latin verb conjugation, volume 6. CUP Archive. → pages5, 47MCCARTHY, J.J., & A. PRINCE. 1993. Prosodic morphology I: constraintinteraction and satisfaction. Technical report, Rutgers University. → pages 4, 9MCMULLIN, KEVIN JAMES, 2016. Tier-based locality in long-distancephonotactics: learnability and typology. University of British Columbiadissertation. → pages 10MOORE-CANTWELL, CLAIRE, & ROBERT STAUBS. 2014. Modelingmorphological subgeneralizations. In Proceedings of the Annual Meetings onPhonology, volume 1. → pages 141MORETON, ELLIOTT, & JOE PATER. 2012. Structure and substance inartificial-phonology learning, part I: Structure. Language and linguisticscompass 6.686–701. → pages 10MORTENSEN, DAVID, KARTIK GOYAL, SWABHA SWAYAMDIPTA, PATRICKLITTELL, ALEXA LITTLE, LORI LEVIN, & CHRIS DYER, in review.Unorthodox resource use allows rapid development of NER systems for‘low-resource’ languages. → pages 5MÜLLER, GEREON. 2005. Syncretism and iconicity in icelandic noundeclensions: A distributed morphology approach. In Yearbook of Morphology2004, ed. by Geert Booij & Jaap van Maarle, 229–271. → pages 61150PATER, JOE. 2009. Weighted constraints in generative linguistics. CognitiveScience 33.999–1035. → pages 10, 25——, & ANNE-MICHELLE TESSIER. 2003. Phonotactic knowledge and theacquisition of alternations. In Proceedings of the 15th International Congresson Phonetic Sciences, Barcelona. Universitat Autònoma de Barcelona. →pages 4PRINCE, ALAN, & PAUL SMOLENSKY. 2008. Optimality Theory: Constraintinteraction in generative grammar. Hoboken: Wiley-Blackwell. → pages 4, 9,25PRZEPIÓRKOWSKI, ADAM, RAFAL GÓRSKI, MAREK ŁAZIN´SKI, & PIOTRPE˛ZIK. 2010. Recent developments in the National Corpus of Polish. InProceedings of the Seventh International Conference on Language Resourcesand Evaluation (LREC’10), ed. by Nicoletta Calzolari (Conference Chair),Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis,Mike Rosner, & Daniel Tapias, Valletta, Malta. European Language ResourcesAssociation (ELRA). → pages 91R CORE TEAM, 2013. R: A anguage and environment for statistical computing. RFoundation for Statistical Computing, Vienna, Austria. → pages 73, 78, 96, 117ROBERT, CHRISTIAN. 2007. The Bayesian choice: from decision-theoreticfoundations to computational implementation. New York: Springer Verlag. →pages 117SALONI, ZYGMUNT, WŁODZIMIERZ GRUSZCZYN´SKI, MARCIN WOLIN´SKI, &ROBERT WOŁOSZ. 2007. Grammatical dictionary of Polish. Studies in Polishlinguistics 4.5–25. → pages 121SCHEER, TOBIAS. 2012. Variation is in the lexicon: yer-based and epentheticvowel-zero alternations in Polish. Sound, structure and sense. Studies inmemory of Edmund Gussmann. 631–672. → pages 90SCHENKER, ALEXANDER M. 1955. Gender categories in Polish. Language31.402–408. → pages 87SIMS, ANDREA D, 2006. Minding the Gaps: inflectional defectiveness in aparadigmatic theory. Ohio State University dissertation. → pages 138, 139SPENCER, ANDREW. 1991. Morphological theory: An introduction to wordstructure in generative grammar. Hoboken: Wiley-Blackwell. → pages 5, 47151STUMP, GREGORY. 2001. Inflectional morphology: a theory of paradigmstructure. Cambridge: Cambridge University Press. → pages 47——, & RAPHAEL A FINKEL. 2013. Morphological typology: From word toparadigm, volume 138. Cambridge: Cambridge University Press. → pages 6,55TSUJIMURA, NATSUKO, & STUART DAVIS. 2011. A construction approach toinnovative verbs in Japanese. Cognitive Linguistics 22.799–825. → pages 14VAN ROSSUM, GUIDO, & FRED JR. DRAKE. 1995. Python reference manual.Centrum voor Wiskunde en Informatica Amsterdam. → pages 12VANCE, TIMOTHY. 1991. A new experimental study of Japanese verbmorphology. Journal of Japanese linguistics 13.145–156. → pages 128VANCE, TIMOTHY J. 1987. An introduction to Japanese phonology. Albany, NY:State University of New York Press. → pages 128WILSON, C. 2006. Learning phonology with substantive bias: an experimentaland computational study of velar palatalization. Cognitive science 30.945–982.→ pages 24, 30, 45, 108WOLIN´SKI, MARCIN, MARCIN MIŁKOWSKI, MACIEJ OGRODNICZUK, ADAMPRZEPIÓRKOWSKI, & ŁUKASZ SZAŁKIEWICZ. 2012. PoliMorf: a (not so)new open morphological dictionary for Polish. In Proceedings of the EightInternational Conference on Language Resources and Evaluation (LREC’12),ed. by Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet UgˇurDogˇan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, &Stelios Piperidis, Istanbul, Turkey. European Language Resources Association(ELRA). → pages 121, 122, 123WYLLYS, RONALD E. 1981. Empirical and theoretical bases of Zipf’s law.Library trends 30.53–64. → pages 6ZWICKY, ARNOLD. 1985. How to describe inflection. In Proceedings of theAnnual Meeting of the Berkeley Linguistics Society, ed. by Mary Niepokuj,Mary VanClay, Vassiliki Nikiforidou, & Deborah Feder, volume 11, Berkeley.University of California Press. → pages 5, 47152Appendix: supplementarymaterialsIcelandic experiment frame sentences1. Jón safnar [DatPl].2. Í gær baD Jón mig aD gæta [GenSg] sem hann hafDi fundiD.3. [NomSg] eru uppáhaldiD hans Jóns.4. Jón fann sex [AccPl] til viDbótar ídag.Translations:1. Jón collects [DatPl].2. Yesterday, Jón asked me to take care of the [GenSg] he found.3. [NomSg] are Jón’s favorite.4. Jón found six more [AccPl] today.153Icelandic experiment stimuliIn the Stem V column, fu indicates that the stem vowel is a front unrounded vowel,and other indicates that the stem vowel is not a front unrounded vowel.Stem DatPl Stem AccPl DatPl GenSg NomPl Stem Vstrep strep a um s ar funep nep ir um ar ir fu:ret :ret ar um ar ar fupet pet i um s ir fuhrem hrem ir um ar ir fublen blen ar um ar ar fusken sken i um s ir fustem stem a um s ar fugleit gleit ar um ar ar fuspeit speit i um s ir fufreip freip a um s ar fuheip heip ir um ar ir fusplein splein i um s ir fuskeim skeim a um s ar funeim neim ir um ar ir fu:ein :ein ar um ar ar fu:ap :öp a um s ar othertvap tvöp ir um ar ir othersprat spröt ar um ar ar othervat vöt i um s ir otherflam flöm ir um ar ir othertjan tjön ar um ar ar othersan sön i um s ir othergam göm a um s ar othermjót mjót ar um ar ar otherskrót skrót i um s ir otherklóp klóp a um s ar otherhnóp hnóp ir um ar ir other:rón :rón i um s ir otherstróm stróm a um s ar otherkvóm kvóm ir um ar ir otherhón hón ar um ar ar other154Icelandic experiment demographic questionnaire (translation)1. Is Icelandic your native language? [yes/no]2. Have you taken a course in Icelandic grammar or linguistics at university?[yes/no]3. What is your gender? [male/female/other/prefer not to respond]4. When were you born? [ranges from 1937 to 1997/prefer not to respond]5. If you have any questions or comments, please write them here. [text field]155Polish experiment frame sentences1. W sklepie z zabawkami, Małgosia przygla˛dała sie˛ kolorowym [DatPl].2. Naprawiłem ramie˛ [GenSg], które odgryzł mój pies.3. Na półce w pokoju Jasia stało duz˙o zakurzonych [GenPl].4. [NomPl] to ulubione zabawki Jasia.Translations:1. At the toy store, Mary was looking at the colorful [DatPl].2. I fixed the arm of the [GenSg] that my dog bit off.3. On the shelf of Johnny’s room were a lot of dusty [GenPl]4. [NomPl] are Johnny’s favorite toys.156Polish experiment stimuliStem Gender V-Stem DatPl GenSg GenPl NomPlge˛gin´ neut ge˛gini om a /0 akesin´ masc kesini om a y eza˛zin´ fem za˛zini om y /0 emuris´ neut murisi om a /0 ajubis´ masc jubisi om a y ecilis´ fem cilisi om y /0 eca˛giz´ neut ca˛gizi om a /0 anepiz´ masc nepizi om a y eda˛riz´ fem da˛rizi om y /0 emyfic´ neut myfici om a /0 az˙ecic´ masc z˙ecici om a y ehomic´ fem homici om y /0 eza˛zyn´ neut za˛zyni om a /0 agezyn´ masc gezyni om a y ecomyn´ fem comyni om y /0 ecołys´ neut cołysi om a /0 awycys´ masc wycysi om a y ezakys´ fem zakysi om y /0 enotyz´ neut notyzi om a /0 azepyz´ masc zepyzi om a y enusyz´ fem nusyzi om y /0 edubyc´ neut dubyci om a /0 alogyc´ masc logyci om a y emymyc´ fem mymyci om y /0 elyzon´ neut lyzoni om a /0 azaz˙on´ masc zaz˙oni om a y epuchon´ fem puchoni om y /0 epecos´ neut pecosi om a /0 azyzos´ masc zyzosi om a y ecimos´ fem cimosi om y /0 ez˙ópoz´ neut z˙ópozi om a /0 awukoz´ masc wukozi om a y elagoz´ fem lagozi om y /0 eze˛loc´ neut ze˛loci om a /0 ake˛coc´ masc ke˛coci om a y eryboc´ fem ryboci om y /0 e(continued on next page)157Stem Gender V-Stem DatPl GenSg GenPl NomPlcyban´ neut cybani om a /0 acuman´ masc cumani om a y eniman´ fem nimani om y /0 ehulas´ neut hulasi om a /0 afa˛tas´ masc fa˛tasi om a y ełenas´ fem łenasi om y /0 erytaz´ neut rytazi om a /0 agechaz´ masc gechazi om a y ena˛baz´ fem na˛bazi om y /0 erucac´ neut rucaci om a /0 apofac´ masc pofaci om a y ełynac´ fem łynaci om y e /0158Polish experiment demographic questionnaire (translation)1. Is Polish your native language? [yes/no]2. Have you taken a course in Polish grammar or linguistics at university?[yes/no]3. What is your gender? [male/female/other/prefer not to respond]4. When were you born? [ranges from 1937 to 1997/prefer not to respond]5. If you have any questions or comments, please write them here. [text field]159
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Bayesian models of learning and generating inflectional...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Bayesian models of learning and generating inflectional morphology Allen, Blake H. 2016
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Bayesian models of learning and generating inflectional morphology |
Creator |
Allen, Blake H. |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | In many languages of the world, the form of individual words can undergo systematic variation in order to express concepts including tense, gender, and relative social status. Accurate models of these inflectional systems, such as verb conjugation and noun declension systems, are indispensable for purposes of both language research and language technology development. This dissertation presents a theoretical framework for understanding and predicting native speakers’ use of their languages’ inflectional systems. I propose a probabilistic interpretation of the task that speakers face when inferring unfamiliar inflected forms, and I argue in favor of a Bayesian approach to modeling this task. Specifically, I develop the theory of sublexical morphology, which augments the Bayesian approach with intuitive methods for calculating necessary probabilities. Sublexical morphology also possesses the virtue of computational implementability: this dissertation defines all data structures used in sublexical morphology, and it specifies the procedures necessary to use a model for morphological inference. I provide along with this dissertation a Python package that implements all the classes and methods necessary to perform inference with a sublexical morphology model. I also describe an implemented learning algorithm that allows induction of sublexical morphology models from labeled but unparsed training data. As empirical support for my core claims, I describe the outcomes of two behavioral experiments. Evidence from a test of Icelandic speakers’ inflection of novel words demonstrates that speakers are able to additively make use of information from multiple provided inflected forms of a word, and evidence from a similar test on Polish speakers suggests that speakers may be limited to this additive way of combining such pieces of information. In clear support of a Bayesian interpretation of morphological inference, both experiments additionally demonstrate that prior probabilities—understood as reflecting lexical frequencies of different groupings of words—play a major role in speakers’ use of their inflectional systems. This is shown to be true even when influence from prior probabilities results in speakers apparently deviating from exceptionless lexical patterns in those systems. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-10-13 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International |
DOI | 10.14288/1.0319124 |
URI | http://hdl.handle.net/2429/59429 |
Degree |
Doctor of Philosophy - PhD |
Program |
Linguistics |
Affiliation |
Arts, Faculty of Linguistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_november_allen_blake.pdf [ 579.83kB ]
- Metadata
- JSON: 24-1.0319124.json
- JSON-LD: 24-1.0319124-ld.json
- RDF/XML (Pretty): 24-1.0319124-rdf.xml
- RDF/JSON: 24-1.0319124-rdf.json
- Turtle: 24-1.0319124-turtle.txt
- N-Triples: 24-1.0319124-rdf-ntriples.txt
- Original Record: 24-1.0319124-source.json
- Full Text
- 24-1.0319124-fulltext.txt
- Citation
- 24-1.0319124.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0319124/manifest