Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The online linguistic database : software for linguistic fieldwork Dunham, Joel Robert William 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_february_dunham_joel.pdf [ 1.45MB ]
JSON: 24-1.0165582.json
JSON-LD: 24-1.0165582-ld.json
RDF/XML (Pretty): 24-1.0165582-rdf.xml
RDF/JSON: 24-1.0165582-rdf.json
Turtle: 24-1.0165582-turtle.txt
N-Triples: 24-1.0165582-rdf-ntriples.txt
Original Record: 24-1.0165582-source.json
Full Text

Full Text

The Online Linguistic DatabaseSoftware for Linguistic FieldworkbyJoel Robert William DunhamB.A., The University of British Columbia, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Linguistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)December 2014© Joel Robert William Dunham 2014AbstractThe documentation and analysis of endangered languages is a core com-ponent of the linguistic endeavour. Language consultants and linguistic re-searchers collaborate to generate a variety of data which in turn fuel theo-retical discovery and language revitalization. This dissertation describes andevaluates a piece of software designed to facilitate new, and enhance exist-ing, collaboration, documentation, and analysis. But beyond this, it arguesfor the value of a certain methodological approach to linguistics broadlyconstrued, one in which computation is key and where provisions are madefor collaboration, data-sharing and data reuse.The Online Linguistic Database (OLD) is open source software for cre-ating web applications that facilitate collaborative linguistic fieldwork. TheOLD allows fieldworkers to continue doing what they are already doing—eliciting, transcribing, recording, and analyzing forms and creating data setsand papers with them—but collaboratively. This point should not be under-stated: though practises are changing, linguistic fieldwork currently involvesa loose network of relatively isolated practitioners and data sets; simply cre-ating the infrastructure for collaboration and data-sharing is half the battle.The other half is creating features and conveniences that make the softwareworth using. In this domain, the OLD provides automated feedback on lexi-iiAbstractcal consistency of morphological analyses, sophisticated search, the creationand (structural) searching of arbitrarily many corpora and treebanks, andthe specification and computational implementation of models of the lexicon,phonology, and morphology, upon which are built practical morphologicalparsers.The dissertation describes the OLD, motivating its design decisions andarguing that it has the potential to contribute positively to the achievementof the three core goals of linguistic fieldwork, namely documentation, re-search, and language revitalization. Particular attention is paid to the prac-tical and research-related advantages of the morphophonological modellingcapability with examples and evaluations of morphological parsers createdfor the Blackfoot language.iiiPrefaceThis dissertation consists of original and independent work by the author,Joel Dunham. The data set described in chapter four contains data gatheredduring fieldwork with speakers of the Blackfoot language which was carriedout by the author under UBC Ethics Certificate number H10-02768 as wellas by a number of other researchers whose contributions are acknowledged atrelevant locations in the text. The content of this dissertation is summarizedin Dunham, J., Cook, G., and Horner, J. (2014), LingSync & the OnlineLinguistic Database: New models for the collection and management of datafor language communities, linguists and language learners, ACL 2014. Apartfrom the small amount of overlap with that publication, this dissertationconsists of wholly unpublished work.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The argument . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 Endangered languages are valuable . . . . . . . . . . 51.1.2 Fieldwork . . . . . . . . . . . . . . . . . . . . . . . . . 91.1.3 Better fieldwork . . . . . . . . . . . . . . . . . . . . . 131.1.4 The Online Linguistic Database . . . . . . . . . . . . 151.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Fieldwork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Overview of the OLD . . . . . . . . . . . . . . . . . . . . . . 212.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1.2 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 252.1.3 Description . . . . . . . . . . . . . . . . . . . . . . . . 262.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . 352.2 Other fieldwork tools . . . . . . . . . . . . . . . . . . . . . . 372.2.1 SIL International . . . . . . . . . . . . . . . . . . . . 402.2.2 Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . 44vTable of Contents2.2.3 FLEx . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.2.4 LingSync . . . . . . . . . . . . . . . . . . . . . . . . . 702.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 992.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012.3.1 Sharing data . . . . . . . . . . . . . . . . . . . . . . . 1032.3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . 1082.3.3 Open source & web-based . . . . . . . . . . . . . . . 1122.3.4 Data structure . . . . . . . . . . . . . . . . . . . . . . 1132.3.5 Interface . . . . . . . . . . . . . . . . . . . . . . . . . 1232.3.6 Search . . . . . . . . . . . . . . . . . . . . . . . . . . 1272.3.7 Automation . . . . . . . . . . . . . . . . . . . . . . . 1402.3.8 Consistency . . . . . . . . . . . . . . . . . . . . . . . 1412.3.9 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . 1472.3.10 Documents . . . . . . . . . . . . . . . . . . . . . . . . 1482.3.11 Access . . . . . . . . . . . . . . . . . . . . . . . . . . 1522.3.12 Documentation . . . . . . . . . . . . . . . . . . . . . 1562.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1583 Morphological Parser Creator . . . . . . . . . . . . . . . . . . 1593.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 1603.2 Finite-state machines & grammars . . . . . . . . . . . . . . . 1643.3 Phonologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1703.4 Morphologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 1763.5 Morphophonologies . . . . . . . . . . . . . . . . . . . . . . . 1863.6 N -gram language models . . . . . . . . . . . . . . . . . . . . 1923.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2024 Blackfoot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2054.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2054.2 The Blackfoot data set . . . . . . . . . . . . . . . . . . . . . 2134.2.1 Morphemes . . . . . . . . . . . . . . . . . . . . . . . . 2174.2.2 Morphologically well analyzed words . . . . . . . . . 2254.3 Issues in the data set . . . . . . . . . . . . . . . . . . . . . . 2324.3.1 Morpheme categorization . . . . . . . . . . . . . . . . 2334.3.2 Variability in morphological analyses . . . . . . . . . 2354.3.3 Orthographic inconsistencies . . . . . . . . . . . . . . 2384.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242viTable of Contents5 Morphological Parsers for Blackfoot . . . . . . . . . . . . . . 2445.1 Parser creation and evaluation procedures . . . . . . . . . . . 2465.2 Parser 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2495.2.1 Phonology . . . . . . . . . . . . . . . . . . . . . . . . 2505.2.2 Morphology . . . . . . . . . . . . . . . . . . . . . . . 2865.2.3 Language Model . . . . . . . . . . . . . . . . . . . . . 2905.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 2925.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 2935.3 Parser 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2975.3.1 Phonology . . . . . . . . . . . . . . . . . . . . . . . . 2985.3.2 Morphology . . . . . . . . . . . . . . . . . . . . . . . 2995.3.3 Language Model . . . . . . . . . . . . . . . . . . . . . 2995.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 3025.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 3035.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3096 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320AppendixA OLD 1.0 Data Structure . . . . . . . . . . . . . . . . . . . . . 337A.1 Relational databases . . . . . . . . . . . . . . . . . . . . . . . 338A.2 OLD objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 340A.2.1 Application settings . . . . . . . . . . . . . . . . . . . 341A.2.2 Collection . . . . . . . . . . . . . . . . . . . . . . . . 349A.2.3 Collection backup . . . . . . . . . . . . . . . . . . . . 358A.2.4 Elicitation method . . . . . . . . . . . . . . . . . . . . 359A.2.5 File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360A.2.6 Form . . . . . . . . . . . . . . . . . . . . . . . . . . . 367A.2.7 Form backup . . . . . . . . . . . . . . . . . . . . . . . 376A.2.8 Form search . . . . . . . . . . . . . . . . . . . . . . . 377A.2.9 Language . . . . . . . . . . . . . . . . . . . . . . . . . 378A.2.10 Orthography . . . . . . . . . . . . . . . . . . . . . . . 379A.2.11 Page . . . . . . . . . . . . . . . . . . . . . . . . . . . 380A.2.12 Source . . . . . . . . . . . . . . . . . . . . . . . . . . 382A.2.13 Speaker . . . . . . . . . . . . . . . . . . . . . . . . . . 392A.2.14 Syntactic category . . . . . . . . . . . . . . . . . . . . 393viiTable of ContentsA.2.15 Tag . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394A.2.16 Translation . . . . . . . . . . . . . . . . . . . . . . . . 396A.2.17 User . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396viiiList of Tables2.1 Data in OLD applications (Feb 14, 2014) . . . . . . . . . . . 242.2 Comparison of fieldwork software. . . . . . . . . . . . . . . . . 392.3 Features across OLD versions . . . . . . . . . . . . . . . . . . 1023.1 Morphology types. . . . . . . . . . . . . . . . . . . . . . . . . 1784.1 Morpheme categories of the Blackfoot dictionary (Frantz andRussell, 1995). . . . . . . . . . . . . . . . . . . . . . . . . . . 2134.2 Sources of the Blackfoot data set. . . . . . . . . . . . . . . . . 2144.3 Forms in the Blackfoot data set, grouped by category. . . . . 2174.4 Morphologically well analyzed word sets in the Blackfoot dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2274.5 20 most common well analyzed words in the Blackfoot dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2294.6 Ten most common well analyzed verbals in the Blackfoot dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2304.7 Ten most common well analyzed nominals in the Blackfootdata set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2314.8 20 most common category strings of well analyzed words inthe Blackfoot data set. . . . . . . . . . . . . . . . . . . . . . . 2325.1 20 most frequent trigrams in one of Parser 1’s training lan-guage models (LMs). . . . . . . . . . . . . . . . . . . . . . . . 2915.2 Parser 1 results. . . . . . . . . . . . . . . . . . . . . . . . . . . 2925.3 20 most frequent category trigrams in one of Parser 2’s train-ing LMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3015.4 Parser 2 results. . . . . . . . . . . . . . . . . . . . . . . . . . . 302A.1 BibTex source types and required attributes. . . . . . . . . . 391ixList of Figures2.1 Dictionary-style representations targeted by FLEx (cf. SILInternational, 2014a). . . . . . . . . . . . . . . . . . . . . . . 542.2 OLD UML diagram. . . . . . . . . . . . . . . . . . . . . . . . 1092.3 Screen shot of the Blackfoot OLD interface showing an IGTform display. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243.1 Schema of an OLD morphological parser. . . . . . . . . . . . 1613.2 Finite-state automaton (FSA) network diagram for a b* a. . . 1663.3 Network diagram for a -> b || \b _. . . . . . . . . . . . . . 1693.4 Finite-state transducer (FST) network diagram for "-" -> s|| k _ I ;. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1733.5 Partial Blackfoot phonology. . . . . . . . . . . . . . . . . . . . 1743.6 Partial Blackfoot phonology with test declarations. . . . . . . 1763.7 Morphology rewrite rule script examples. . . . . . . . . . . . 1813.8 Context-free (CF) grammar for a simple morphology. . . . . . 1823.9 Network diagram for l a | l e. . . . . . . . . . . . . . . . . 1823.10 Network diagram for l a 0:"|the|D" | l e 0:"|the|D". . . . 1833.11 Network diagram for l a %| t h e %| D | l e %| t h e %| D. . . 1843.12 Lexc morphology scripts. . . . . . . . . . . . . . . . . . . . . . 1853.13 Toy French phonology. . . . . . . . . . . . . . . . . . . . . . . 1873.14 Toy French morphophonology. . . . . . . . . . . . . . . . . . . 1883.15 Network diagram for the toy French morphophonology. . . . . 1903.16 Toy French morphophonology with rurl morphology. . . . . . 1913.17 Simple LM corpus file (first four lines, repeated 100 times). . 1983.18 MITLM-generated ARPA LM file: order = 3, smoothing =ModKN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1994.1 Blackfoot phoneme inventory (IPA). . . . . . . . . . . . . . . 2064.2 Blackfoot phoneme inventory (orthographic). . . . . . . . . . 2074.3 Blackfoot vowel phones. . . . . . . . . . . . . . . . . . . . . . 2095.1 Parser 1 phonology: ordered phonological rules. . . . . . . . . 254xList of Figures5.2 Phoneme class regular expressions. . . . . . . . . . . . . . . . 2555.3 Coalescence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2565.4 Semivowel loss. . . . . . . . . . . . . . . . . . . . . . . . . . . 2565.5 Gemination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2575.6 s-connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2595.7 y-reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2595.8 Breaking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2605.9 o-replacement. . . . . . . . . . . . . . . . . . . . . . . . . . . 2605.10 ih-loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2615.11 Presibilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2615.12 sss-shortening. . . . . . . . . . . . . . . . . . . . . . . . . . . 2615.13 Semivowel drop. . . . . . . . . . . . . . . . . . . . . . . . . . 2625.14 Vowel shortening. . . . . . . . . . . . . . . . . . . . . . . . . . 2625.15 t-affrication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2625.16 Postsibilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2635.17 i-absorption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2635.18 Desyllabification. . . . . . . . . . . . . . . . . . . . . . . . . . 2645.19 Glottal metathesis. . . . . . . . . . . . . . . . . . . . . . . . . 2655.20 Vowel epenthesis. . . . . . . . . . . . . . . . . . . . . . . . . . 2665.21 Glottal reduction. . . . . . . . . . . . . . . . . . . . . . . . . . 2665.22 Glottal loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2675.23 Glottal assimilation. . . . . . . . . . . . . . . . . . . . . . . . 2675.24 Accent spread. . . . . . . . . . . . . . . . . . . . . . . . . . . 2685.25 Break delete. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2685.26 i-loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2695.27 Parser 1 phonology: lexical phonological rules. . . . . . . . . . 2705.28 Irregular breaking. . . . . . . . . . . . . . . . . . . . . . . . . 2715.29 Semivowel alternation. . . . . . . . . . . . . . . . . . . . . . . 2725.30 Nasal-initial verbs. . . . . . . . . . . . . . . . . . . . . . . . . 2745.31 Initial vowel elision in imperatives. . . . . . . . . . . . . . . . 2755.32 Codafication. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2755.33 Stop-stop epenthesis. . . . . . . . . . . . . . . . . . . . . . . . 2765.34 Become /o/. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2775.35 Become /a/. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2775.36 Nasal loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2775.37 Variable-length vowels. . . . . . . . . . . . . . . . . . . . . . . 2785.38 3mm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2795.39 Vowel elision before the DTP suffixes. . . . . . . . . . . . . . 2805.40 Non-permanent consonants. . . . . . . . . . . . . . . . . . . . 2815.41 Diphthongization. . . . . . . . . . . . . . . . . . . . . . . . . . 282xiList of Figures5.42 Initial change /-ay-/. . . . . . . . . . . . . . . . . . . . . . . . 2835.43 Initial change /-ii-/. . . . . . . . . . . . . . . . . . . . . . . . 2845.44 /-yi/ loss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2855.45 Inverse clipping. . . . . . . . . . . . . . . . . . . . . . . . . . 2855.46 Network diagram depicting the 50 most common categorysequences in the morphology of Parser 1. . . . . . . . . . . . . 2885.47 No accented vowels. . . . . . . . . . . . . . . . . . . . . . . . 2985.48 Shorten. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299xiiList of AcronymsAPI Application Programming Interface. 36, 39, 76, 94, 98, 102, 110, 123,126–128, 151, 246, 316CF context-free. x, 181, 182CRUD Create, Read, Update, and Delete. 174, 176CS context-sensitive. 63, 162, 163, 167, 168, 170–172, 246CSV Comma-Separated Values. 85FSA finite-state automaton. x, 165–167FST finite-state transducer. x, 92, 159, 161–163, 167–174, 176, 177, 180–183, 186–189, 202, 204, 212, 234, 235, 245, 249, 250, 252, 254, 255, 258,265, 269, 274, 277, 279, 285, 287, 289, 296, 297, 302, 308, 311, 313GUI Graphical User Interface. 31, 72, 80, 88, 97, 100, 102, 123, 125, 126,131, 140, 144, 146, 148, 151, 154, 156, 164, 203, 204, 298, 303, 315HTML Hypertext Markup Language. 24, 28, 29, 36, 47, 89, 110, 113, 148,149, 156, 157, 349, 355, 381, 392, 393, 397, 398HTTP Hypertext Transfer Protocol. 36, 37, 78, 108, 110, 246IGT interlinear glossed text. 35, 100, 207IPA International Phonetic Alphabet. 206–209, 311JSON JavaScript Object Notation. A standard for converting a number ofdata structures to and from strings. 36, 48, 75, 76, 78, 110, 174, 175,318xiiiList of AcronymsLM language model. ix, x, 92, 159–163, 190, 192–202, 226, 227, 238, 245,246, 248, 249, 258, 289–293, 296, 297, 299–303, 307–309NLTK Natural Language Toolkit. A library of Python modules for analyz-ing and processing language data. 16, 78, 90NoSQL Variably unpacked as no SQL or not only SQL, NoSQL refers to atype of database wherein data are not modelled as relations betweentables as they are in relational databases. 70, 75, 77OCR optical character recognition. 215OLAC the Open Language Archives Community. 83SQL Structured Query Language. A declarative language for querying re-lational databases. 49, 70, 75, 77, 127, 128, 134, 340URL Uniform Resource Locator. 25UUID Universally Unique Identifier. 132, 174, 185, 357, 358, 375, 376, 400W3C World Wide Web Consortium. The primary standardization body forweb technologies and specifications. 76xivAcknowledgementsI am fortunate to have the opportunity to acknowledge here the many peoplewho have supported, challenged, taught and inspired me in the pursuit ofthis degree. Foremost amongst these is my partner and staunchest supporterTracy Merry, who always encouraged me to persevere, who graced me withmore patience than I deserve, and who suffered as only a civilian can throughthe proofreading of linguistic prose.I am especially grateful to my committee for allowing and encouragingme to pursue a non-standard dissertation topic. Henry Davis, my supervi-sor, has an infectious passion for linguistic theory and endangered languagesfieldwork and his overtly subversive sentiments are a siren call to a malcon-tent such as I. In our discussions about computational and collaborativeapproaches to fieldwork, he was ever supportive, insightful, and laden withthe knowledge of experience. Lisa Matthewson is a fearfully impressive se-manticist and fieldworker, a paragon whose abilities subtly encouraged meto assume an auxiliary role in those enterprises; her thorough and discern-ing feedback on my work was returned with marvelous speed. I am thankfulfor Alexandre Bouchard-Coˆte´’s enthusiasm for my work and his guidance inaspects involving statistics, computational linguistics, and software develop-ment.xvAcknowledgementsThough I have spent more time at UBC Linguistics than I sometimescare to admit, this extended tenure afforded the opportunity to learn fromand be guided by two faculty members in particular. Martina Wiltschkois a dedicated and innovative teacher and prescient mentor. Without herinitial encouragement I may never have begun this degree and if it werenot for the vitality of the Blackfoot research group that she fostered andhosted I probably never would have been inspired to build collaborativefieldwork software. An audacious researcher and theoretician, her criticismsand challengings of my work have consistently pushed me to go further thanI otherwise would have. I am also thankful to Hotze Rullmann for teachingme the basics of the only linguistic subfield that really means anything;in another life I may have trained as a semanticist under his clearheadedtutelage.With respect to the dissertation proper, my two department-externalexaminers Fei Xia and Mark Turin deserve much thanks for feedback whichresulted in significant improvements to its structure and clarity and to atempering of otherwise hyperbolic claims. I am also notably appreciative ofthe efforts of Natalie Weber, a fellow student who went above and beyondin taking the time to read through an early draft and providing valuablefeedback.Gina Cook’s in-depth discussions with me concerning software for lin-guistic fieldwork have had and will continue to have a profound effect on mywork in this area. James Crippen and Masaru Kiyota’s help in setting upweb servers was critical. Notable contributors of OLD-related code, docu-mentation, fieldwork data, and feedback are Patrick Littell, Meagan Louie,xviAcknowledgementsRose-Marie De´chaine, Michael McAuliffe, Sara Johansson, Erin Guntly, andMichael Schwan.An exhaustive list of the many other very fine linguists, fieldworkers,and all-round stellar individuals that I have had the good fortune to knowand/or learn from would of necessity include John Lyon, Heather Bliss,Solveiga Armoskaite, Amelia Reis Silva, Gunnar O´lafur Hansson, DonaldDerrick, Douglas Pulleyblank, Jeff Mu¨hlbauer, Shujun (Seok Koon) Chin,Giuseppe Carenini, Guy Carden, Jennifer Glougie, Mark Scott, and DonaldFrantz.Last but certainly not least it is a pleasure to express my most sinceregratitude to my language consultants and teachers. My generous Okanaganspeakers and guides in the beautiful lands surrounding Douglas Lake—LottieLindley, Sarah McLeod, Nancy Saddleman, and Sharon Lindley—showedme a strength of community that I marvel at to this day. Finally, I ameternally grateful to Beatrice Bullshields for teaching me something of herlanguage and culture, for inviting me to her home, and for introducing meto her family and friends. She taught me much about Blackfoot, self-respect,and resilience. Ever contemplative and patient, her influence on my life isprobably greater than she knows.xviiFor Orson and Phoebe.xviiiChapter 1IntroductionThe primary claim of this work is that the Online Linguistic Database1(OLD)—open source software written by the author—facilitates and expe-dites linguistic fieldwork. The OLD supports collaboration and the sharingand reuse of data while providing features that promote consistency andallow for powerful search, the computational implementation and testing ofmodels of the grammar, and the automation of morphological analysis cre-ation. Independent of whether the OLD is the tool that best implementsthis feature set, the present work argues that the field of linguistics standsto benefit from using such tools, from supporting and contributing to theirdevelopment, and from fostering critical discussion of the collaborative andinterdisciplinary methodology that they entail. The endangered status ofmany languages at present brings an urgency to this exhortation: the morewe can do with small and declining populations of fluent speakers, the betterfor all parties involved. As the present work seeks to convey, many poten-tially rewarding opportunities and challenges await the linguists, engineers,1The main page of the OLD is Thesource code for the OLD web service (i.e., the OLD v. 1.0) can be found at and the source for the OLD web application (i.e., the OLDv. 0.2) can be found at See chapter 2 foran explanation of these distinctions.1Chapter 1. Introductioncomputer scientists, educators, and community language activists who ven-ture into this domain. These efforts have the potential to bring about thedocumentation of aspects of particular languages (and language families),the forming of linguistically interesting generalizations and analyses, andvaluable contributions to efforts in endangered language revitalization.I do not presume the OLD to be the authority or the final word in howthis approach to linguistics should take form. The OLD is a contributionto the evolution of linguistic fieldwork methodology. That said, the softwareunder discussion is a valuable tool for those interested in pursuing said ap-proach. In addition to being practical, it exhibits interesting design decisionsand features that are informed by the integrated and distilled expertise ofmany linguistic fieldworkers. It may be used as an inspiration or a foil forsimilar projects and, as it is open source and modularized, it, or parts of it,may be incorporated into such projects.The claim that the OLD facilitates and contributes to linguistic field-work is supported by objective scientific evaluation in one instance (i.e., inevaluating the morphological parser creator’s performance on Blackfoot),but for the most part the arguments are based on my own experience withthe software (qua developer, fieldworker and theoretical linguist), anecdotalattestations of its value from users, feature comparison with similar tools,and a priori argumentation for the plausible value of its features and de-sign. The primary original contribution to knowledge of my doctoral work,however, is the OLD itself. The main claim of this dissertation could, ofcourse, be better supported with recourse to more objective and scientificevaluations. One evaluative approach would be to analyze user responses to2Chapter 1. Introductionquestionnaires about the software. Another would be to design controlled ex-periments to measure and compare the speed, accuracy, and thoroughness inthe accomplishment of circumscribed fieldwork tasks of a variety of fieldworkmethodologies, some including the OLD and others not. A final approachwould involve real-world demonstrations of the value of the OLD’s searchand modelling features to theoretical linguistic research—e.g., by writing aresearch paper with claims supported by data extracted via queries on OLDdata sets—and the value of its data-sharing interfaces to language revital-ization efforts—e.g., by creating prototypes of language learning softwarethat make use of the OLD as a data resource.2 However, since there is onlyso much that can be done in the finite interval that is a graduate program,I leave such extended evaluation for future work.The dissertation is structured as follows. Chapter 2 argues for the ef-fectiveness of the OLD as a tool for linguistic fieldwork by describing andevaluating its most valuable and innovative features and by comparing itto similar tools. Chapter 3 describes the morphological parser creator, anapplication added to the core feature set of the OLD. Chapter 4 brieflydescribes the Blackfoot language and details a data set of that languagecreated using the OLD. Chapter 5 describes and evaluates the performanceof two morphological parsers for Blackfoot that were built using the parsercreator and data set described in the preceding two chapters. Finally, Chap-ter 6 concludes with a summary and a brief discussion of potential futureimprovements to the software. The remainder of this introductory chapter2See Bird et al. (2013) for an example of using this evaluative approach on a specificset of language documentation methodologies and technologies.31.1. The argumentfleshes out the argument for the OLD.1.1 The argumentUnderstanding a piece of software requires understanding the problem it istrying to solve. The problem addressed here is how linguistic fieldwork mightbe better accomplished.Many of the world’s languages are in danger of losing their last remainingfluent speakers and thus becoming “sleeping languages” (Leonard, 2008).3It is widely predicted that by the end of the century more than half of theworld’s 7,000 languages will no longer be spoken (cf. Harrison, 2007). Morethan twenty years ago, Hale et al. (1992) estimated that 80% of the then-spoken Native North American languages were moribund. More recently,one source (First Peoples’ Heritage, Language and Culture Council, 2010,p. 22) asserts that all 32 of British Columbia’s “First Nations languages arecritically endangered, if not sleeping already.”If one accepts that endangered languages are valuable, that their docu-mentation, analysis, and perpetuation are goals worth pursuing, then a nextlogical question is how such goals might be better accomplished. This is abig question with a multifaceted answer.4 However, one aspect of linguisticfieldwork that, in my opinion, shows ample room for improvement is thedevelopment of software that facilitates collaboration, data-sharing, and the3Leonard (2008) uses the term sleeping—in contrast to extinct—to emphasize the pos-sibility that, through community-lead revitalization efforts, currently unspoken languagesmay one day be spoken again.4For an argument for the value of a theoretically-informed hypothesis-driven approachto the study of linguistic diversity see Davis et al. (2014).41.1. The argumentautomation and acceleration of menial tasks and, to the extent possible, ofhigher-level tasks such as the testing of hypotheses.Assessing exactly which features are most crucial to such software re-quires an understanding of the workflow of linguistic fieldwork. While thenature of this workflow is explored below, a preview of the requirements isgiven in the following list of exhortations. It should be easy to find relevantdata quickly; data should not be unnecessarily re-elicited; fieldworkers witha diverse range of goals should be able to access and make use of one an-others’ data; and features should exist which facilitate the computationalimplementation of analytical models such that both the empirical testing ofand the generation of analytical representations based on such models canbe automated. The OLD seeks to meet these requirements and this disser-tation argues that it succeeds. In what follows, I motivate and elaborate onthe sub-claims of this argument.1.1.1 Endangered languages are valuableLanguages—even and especially those with very few fluent speakers—arevaluable from a variety of perspectives and efforts that seek to document,analyze, and perpetuate them should be supported. This statement has po-litical significance and is not uncontentious. Hale et al. (1992) argues thatlinguists working on endangered languages have a duty to contribute to theirperpetuation and revitalization. Ladefoged (1992) responds that the exhor-tation of Hale et al. (1992) is too general since some communities wouldrather pursue linguistic homogeneity over minority language vitality for,say, purposes of national cohesion. Dorian (1993) responds by asserting that51.1. The argumentthe scenarios discussed by Ladefoged (1992) are exceptional and that, evenin these rare cases, subsequent generations tend, as a rule, to regret the lossof their minority languages. In my experience, communities do desire to seetheir languages better documented and more widely spoken and they areactively engaged in efforts to achieve these ends;5 in these cases, academiclinguists do have both an ethical obligation and a self-interested motivationto reciprocate by supporting these endeavours.It should come as no surprise that communities value their endangeredand sleeping languages as vessels of culture; however, sometimes unconsid-ered are the scientific and more broadly sociological values of language. Fromthe theoretical linguistic point of view, thorough study of the widest possiblerange of human languages is crucial to understanding human cognition andsocial anthropology. Equally if not more important is the value of a languageto the social well-being of its community and, as a result, to the larger socialgroups within which that language community is embedded. In the NorthAmerican context, this means that aside from any reparations owed for thelanguage and culture-eroding practises so infamously epitomized by the res-idential/boarding schools (Churchill, 2004), the dominant cultures shouldrecognize a selfish interest in fostering a healthier society by supporting theflourishing of endangered languages.Endangered languages are scientifically valuable. Theoretical linguistics5Examples of community-led language revitalization and documentation effortsabound. Two examples are the Breath of Life biennial language restoration workshop(sponsored by the Advocates for Indigenous California Language Survival, the Departmentof Linguistics at the University of California at Berkeley, and the Townsend Center) andthe community-based revitalization certificate program and N’syilxcen language classesoffered by the En’owkin Centre of British Columbia, Canada.61.1. The argumentis centrally concerned with exploring the possibility space of natural lan-guage, i.e., the extent to which languages can vary and the properties thatare universal to all. Unsurprisingly, linguistic analyses since the rise of thegenerative tradition in the 1950s have been supported primarily by datafrom majority languages, especially the western European ones spoken bymost linguists. However, the past several decades have witnessed a resur-gence6 in interest in greater empirical coverage. In the generative linguistictradition, it has long been argued that linguistic competence is too complexto be plausibly learnable via general cognitive mechanisms, given the rela-tively impoverished data available during acquisition (Chomsky, 1980). Itis therefore assumed that a significant portion of human linguistic knowl-edge must be innate. The dominant research program in this tradition thusconsists in rigorously describing and analyzing the properties of the greatestpossible subset of natural languages with a mind to discovering universalprinciples and well-defined parameters of variation.7Other traditions within theoretical linguistics and other fields related tolinguistics are also interested in the analysis of endangered languages. His-torical linguistics, for example, seeks to understand the relatedness of extantlanguages and to reconstruct proto-language ancestors. Anthropological lin-guistics has long been interested in the analysis of endangered languages and6Of course, prior to the rise of the generative tradition, much work was done onminority languages by structuralist linguists such as Leonard Bloomfield.7Note that there is an ongoing debate over what is the best approach to studyinglinguistic diversity. Davis et al. (2014) argues the merits of a theoretically-informed,hypothesis-driven generative/universalist approach. Levinson and Evans (2010) (amongothers) argue for large-scale typological analyses of descriptive works. However, this de-bate is orthogonal to an argument for the OLD since this tool should be useful to adherentsof either point of view.71.1. The argumenttheir interrelation with culture. Clearly, endangered languages are crucialobjects of study for those seeking to understand human cognition, culture,and history. The common denominator here is an interest in endangeredlanguages for their value in relation to our knowledge of humanity.Communities whose ancestral languages are endangered are also very in-terested in their documentation and revitalization, though for a somewhatdifferent set of reasons. Their languages encode ways of thinking, tradi-tional knowledge, and narratives that are vital to their cultures, identities,and health. With respect to the last point, at least one study has shown asignificant positive correlation between ancestral language use and metricsof societal health: Hallett et al. (2007, p. 398) finds a correlation betweenhigher levels of Aboriginal language use in British Columbia First Nationsbands and lower youth suicide rates and concludes that “indigenous lan-guage use, as a marker of cultural persistence, is a strong predictor of healthand wellbeing in Canada’s Aboriginal communities.” Within endangeredlanguage communities there are many motivated, hard-working, and skilledindividuals who are contributing a great deal to these efforts. Unfortunately,communities with endangered languages tend also to be politically marginal-ized, economically disadvantaged, and small and, as a result, the resourcesfor fully successful documentation and revitalization are often not available.Because endangered languages are valuable scientifically and to the healthand well-being of their communities, they should also be considered valuableto society at large. That is, in purely practical terms, the spiritual/culturalhealth of a community—directly tied to language—means its individualmembers are healthier, and therefore use considerably fewer resources of the81.1. The argumentlarger societies in which they are embedded. The more we can do to docu-ment, analyze, and revitalize these languages, the more we can learn abouthuman cognition, culture, and history. Perhaps more importantly, such ef-forts can help us to begin to redress the legitimate grievances of indigenouscommunities for the unethical language- and culture-eroding practises in-flicted by our governments and religious institutions. Beyond recognitionand atonement, success in the documentation and revitalization of endan-gered languages has the potential to strengthen disadvantaged subgroupswithin our societies and thereby benefit us all.Assuming, then, that the study of endangered languages is worthy ofsupport, I would at this point like to discuss the nature and practise oflinguistic fieldwork so that the reader may come to an understanding of theprocess by which speaker knowledge is transformed into linguistic artifact.1.1.2 FieldworkFieldwork, in one sense, is the act of transforming linguistic knowledge intoreal-world artifact. Different scenes within this act are labelled according tothe complexity and nature of the artifacts generated. First-order documen-tation is a collaboration between researcher and language consultant whichproduces primary data types like transcriptions, translations, annotations,basic metadata, and audio recordings as well as morphophonological anal-yses. From these primary artifacts, second-order documentation constructshigher-level data types such as dictionaries, grammars and representationsof narratives. The products of documentation are used by researchers forthe creation of academic publications and, ultimately, the accumulation and91.1. The argumentrefinement of scientific knowledge (cf. Woodbury, 2003). All of these linguis-tic data types are used in the creation of artifacts relevant to the task ofrevitalization, such as language textbooks, exercises, and software.The view of fieldwork as the reification, transportation, and transforma-tion of linguistic data highlights the interrelatedness of its auxiliary endeav-ours. Under this view, successful revitalization is the production of languagedata in its richest form, viz. as grammatical knowledge in the minds of fu-ture generations of speakers. In the context of endangered and moribundlanguages, therefore, revitalization is both a primary goal of documentationefforts and a requirement for the perpetuation of research efforts.Fieldwork is carried out by academics, language communities, and pas-sionate individuals and groups. As a member of a linguistics department ina research-oriented university, I am most knowledgeable of the motivationsand workflow of academic linguistic fieldworkers. However, because of theinterconnectedness just elaborated, I am also familiar with the documentaryand revitalization-related domains. This imbalance in perspective is a nat-ural outcome of the division of labour necessitated by a complex endeavourlike linguistic fieldwork. More importantly, it is directly relevant to the aimof this dissertation, that being to demonstrate methodologies and technolo-gies which increase the availability and usefulness of researcher-generatedartifacts to fellow fieldworkers.Theoretical linguists seek to increase scientific knowledge of linguisticcompetence. The data they elicit (i.e., gather) from speakers are designedto answer research questions relevant to the verification and falsification oflinguistic theories. However, getting to a stage where such questions might101.1. The argumenteven begin to be asked requires at least a basic level of proficiency withthe language under study. Linguists interested in complex syntactic, seman-tic, and pragmatic phenomena are likely to require an even higher levelof proficiency. In addition, linguists often have a holistic appreciation forthe languages they study, often including a motivation to give back to thespeakers and communities that make their work possible. From the oppo-site perspective, communities do not want to be the objects of academicresearch unless academics reciprocate by donating their skills and resourcesto revitalization. For these reasons, linguists end up eliciting foundationallinguistic forms and also contribute to the production of materials that arerelevant to a broader audience, e.g., dictionaries (Lyon and Greene-Wood,2007), learning grammars (Davis, 2012), collections of translated and ana-lyzed texts (Matthewson, 2005; Lindley and Lyon, 2012), and expositionsof theoretical analyses in terms appropriate to non-linguist learners (Davis,2012).Linguistic data are elicited in a variety of ways. The technique with thelongest history, and arguably that which is least artificial, involves record-ing and transcribing the speech of one or more speakers, as when narrativesand conversations are gathered to produce texts (Boas, 1917). A commonpractise is to ask for translations of metalanguage (e.g., English) forms,sometimes supplying relevant contextual information via verbal description,images and/or videos. Some fieldworkers seek to sidestep unwanted intru-sions of the metalanguage by constructing stimuli (e.g., images or videos)that elicit a response in the speaker without providing a metalanguage formfor translation (Yegerlehner, 1955). Another technique with similar moti-111.1. The argumentvations involves describing contexts and asking questions using the objectlanguage itself, assuming the researcher is proficient enough to do so. How-ever, see Matthewson (2004) and AnderBois and Henderson (to appear)for nuanced discussions of the use of the metalanguage in translation-basedelicitation and a defence of that practice in certain cases. Finally, a fruit-ful elicitation method—which is contentious to some, cf. Mithun (2001);Dimmendaal (2001)—involves requesting speaker judgments on forms pro-duced by the researcher, thus making it possible to gather data that specifyungrammatical, questionable, or contextually infelicitous forms. For a dis-cussion of elicitation techniques in linguistic fieldwork generally see Bowern(2008) and Newman and Ratliff (2001), and for semantic fieldwork in partic-ular see Matthewson (2004), Krifka (2011), and Bochnak and Matthewson(to appear).The foundational products of linguistic fieldwork are representations oflinguistic forms at a variety of levels, from morphemes to words, phrases,sentences, and multi-sentential objects such as conversations and narratives.The core components of such representations are transcriptions; these maybe written in an orthography specific to the language or in a more general-purpose system, e.g., the International Phonetic Alphabet (IPA) (Interna-tional Phonetic Association, 1999) or the Americanist Phonetic Alphabet(APA) (Goddard, 1996); they may be phonetic, phonemic, or some hybrid.In addition to transcriptions, also generated are translations, indications ofgrammaticality, comments and questions of the researcher, comments of thespeaker(s), and metadata specifying when, where, how, and with whom thedata were elicited. If stimuli are used to build context or otherwise con-121.1. The argumentstrain experimental variables, these should also be considered products oflinguistic fieldwork. Audio-video recordings are a data type that has a lot ofpotential for reuse, especially if transcriptions and recordings of utterancesare time-aligned.A widespread and increasingly standardized (cf. Bickel et al., 2008) prac-tise is the morphological analysis of linguistic forms and the generation of aparticular type of multi-line representation of such analyses known as inter-linear glossed text (IGT). This typically involves, minimally, a transcriptionline indicating morpheme boundaries, followed beneath by a line of glossescorresponding to the morphemes, and terminated by a list of one or moretranslations. Though there are variations on this pattern, linguists from abroad spectrum of traditions make use of this type of representation.Other types of analyses generated in the course of fieldwork may includesyntactic category tags, phrase structure representations, semantic represen-tations, and annotation data structures, e.g., Praat TextGrids and ELANfiles.Building upon such foundational artifacts, linguists of various stripes goon to generate one or more of the following: descriptive grammars, dictionar-ies, publication-worthy representations of narratives and stories, pedagogicaltools, corpora, treebanks, databases, and research papers.1.1.3 Better fieldworkGiven an understanding of fieldwork and its importance from a range of per-spectives, this dissertation asks how can we do fieldwork better. The answerproposed is share data and automate tasks.131.1. The argumentAs we have just seen, linguistic fieldworkers generate a lot of primarydata as well as higher-level artifacts that are potentially useful to their fel-low fieldworkers. In practise, however, a significant portion of primary dataelicited are stored away in private collections of handwritten notes or digitaldocuments and are never reused or even reviewed. Even higher-level artifactssuch as dictionaries may be underexploited because the data they containare not structured for accessibility.In order to move from potentially useful to maximally reused, we needto make the sharing of data both easy and desirable. First of all, this meansbuilding up the basic infrastructure for transferring data from researcher toresearcher, while structuring and delivering the data in a way that facili-tates reuse. A system that encourages the sharing of data will have higheradoption rates if individual fieldworkers can see how it will immediately ben-efit their own work. In addition, therefore, to the enticement of accessingthe data of other researchers, a data-sharing system should provide furtherconveniences, the automation of fieldwork-related tasks being a primary ex-ample of such.If all fieldworkers could quickly, accurately, and ethically gain access torelevant data in their own collections and those of their peers, then theoret-ical questions could be answered, documentary artifacts could be generated,and learners could become fluent speakers more quickly. Another importantbenefit arising from greater transparency and access with respect to primarylinguistic data is that theoretical claims based on such data could be morereadily and rigorously scrutinized and criticized, thus leading to more ro-bust analyses and greater confidence in theoretical claims grounded in data141.1. The argumentfrom minority languages.8 Depending on the nature of the analysis (e.g.,statistical corpus-based), the ready availability of the data may also allowfor verification via experimental reproduction.1.1.4 The Online Linguistic DatabaseThe OLD is software designed to help fieldworkers do fieldwork better. TheOLD is used to create language-specific web applications that facilitate thecollaborative creation and curation of databases of fieldwork data. An OLDapplication is at its core a web interface to a database which has a structure(i.e., schema) designed for linguistic fieldwork and which can be altered andviewed by multiple users concurrently. This core feature—i.e., the multi-user,web-based, fieldwork-oriented database-ness—is what makes collaborationand data-sharing possible: contributors create representations of linguisticforms and related data types and other users can immediately use these datato inform their own research, documentation, and revitalization projects.Other salient features of the OLD contribute to achieving the goal ofmore efficient use of linguistic fieldwork data. Advanced search facilitatesfast and targeted data retrieval. Features encouraging consistency and stan-dardization of both analysis and representation make data easier to under-stand and reuse. Model-implementing features contribute to the testing ofthe analyses and their underlying frameworks and assumptions, as well as to8This is not to say that researchers whose work is grounded in endangered languagefieldwork data are less trustworthy. It is simply an acknowledgement that it is very dif-ficult for a researcher to independently re-elicit the endangered language data which isasserted, by a fellow researcher, to provide evidence for a given theoretical claim. If thedata were more accessible, critics might be better able to discover cases where an analysisis inadequately supported by them.151.1. The argumentthe automation of analytical tasks, including the generation of further rep-resentations (e.g., morphological analyses), the existence of which in turnimproves targeted search.Some implementation details are crucial enough to achieving the goalof more efficient fieldwork that they could be considered core features. TheOLD is open source software written primarily in Python, a programminglanguage that is designed with readability in mind and which, because ofits extensive string manipulation constructs and because of the existence ofthe Natural Language Toolkit (NLTK) (Bird et al., 2009), a library writtenin the language, is well-suited to linguistic and language-processing applica-tions. These properties allow for users to contribute to the code, use partsof it in their own projects, and/or read it for feature implementation ideas.As of the current development version (i.e., 1.0), an OLD applicationis also a web service, which means that it exposes a standardized interfacefor programmatic interaction. This makes the data stored easy to access fora variety of purposes. An OLD web service can be used as a data-servingmodule within a larger application or service. For example, an OLD webservice could be used in multi-language applications for cross-linguistic andtypological analysis as well as revitalization-relevant applications such asaudio dictionaries and learning tools. More mundanely, the data stored in alanguage-specific OLD web service can be downloaded to a contributor’s lo-cal system and processed as necessary. Chapter 3 illustrates this by showinghow morphologically analyzed words can be extracted from an OLD web ser-vice in order to evaluate the performance of computationally implementedmorphophonological models.161.2. SummaryThe OLD has been in existence (in one form or another) for six years.There are currently nine language-specific OLD web applications target-ing Blackfoot (Algonquian), Coeur d’Alene (Salish), Gitksan (Tsimshianic),Ktunaxa (isolate), Kwak’wala (Wakashan), Nata (Bantu), Okanagan (Sal-ish), Plains Cree (Algonquian) and Tlingit (Na-Dene´). Data in these applica-tions have been contributed and used by a geographically and motivationallydiverse group of fieldworkers, including four field methods classes offered bythe UBC Department of Linguistics. The process of adapting to these di-verse languages and fieldworker types has helped to improve and mature thesoftware.1.2 SummaryLinguistic fieldwork, broadly construed, is the generation of language arti-facts and is valuable scientifically, culturally, and socially. Because of theirinterrelatedness, distinct sub-endeavours contributing to the fieldwork enter-prise (i.e., documentation, research, and revitalization) stand to benefit fromcooperation. There are, therefore, exciting opportunities for methodologicalimprovement in this domain. The Online Linguistic Database illustrates aparticular approach to improving fieldwork methodology which involves thedevelopment of software that facilitates collaboration and data-sharing, aswell as the provision of conveniences such as task automation and compu-tational analysis implementation.This dissertation describes and evaluates the OLD as a tool for improv-ing fieldwork methodology. Chapter 2 introduces and describes the software,171.2. Summarycompares it to similar tools, justifies its design decisions, and argues that itis an effective fieldwork tool. Chapter 3 describes the morphological parsercreator, an application added to the core functionality of the OLD, which canbe used to build morphological parsers to expedite data entry and test theo-retical models. Chapter 4 describes the Blackfoot language and a data set forthat language created using the OLD. Chapter 5 describes and evaluates twoparsers built using the morphological parser creator and the Blackfoot dataset. Finally, chapter 6 summarizes the argument for the OLD and discussessome exciting possible developments to the software.18Chapter 2FieldworkThis chapter describes the Online Linguistic Database (OLD) and arguesthat it is a valuable fieldwork tool. Its primary goal is to facilitate collabo-rative language documentation and the sharing and reuse of fieldwork data.The salient secondary features discussed here contribute to the primary data-sharing goal by facilitating the creation of consistently structured, easilyre-purposable, effectively searchable, and well presented data. The descrip-tions of these features and the demonstrations of their utility to fieldworkconstitute the argument for the software.The OLD is open source software for creating web applications designedto facilitate collaborative linguistic fieldwork. A language-specific OLD ap-plication is a web-accessible repository of linguistic artifacts contributed bymultiple fieldworkers. The system provides data structures, representations,interfaces, and conveniences that make it easy for diverse researchers to findand use the data they need.Linguistic fieldwork is the transformation of speaker knowledge into lan-guage artifacts. Language documentation, research and revitalization are allendeavours that involve fieldwork. Documentation seeks to describe, recordand preserve a language. Research seeks to analyze linguistic data and arrive19Chapter 2. Fieldworkat insightful generalizations and explanations. Revitalization seeks the per-petuation of waning languages in the minds of future generations of speakers,and involves the creation of artifacts relevant to language acquisition andflorescence. All of these endeavours involve elicitation, i.e., the generationby fieldworkers and speakers of primary language artifacts such as transcrip-tions and recordings.Moreover, all of these endeavours are interrelated such that successes inany one facilitate successes in all of the others. For instance, a descriptivegrammar produced as part of a documentation effort invariably containsdata and generalizations that will constitute the groundwork for the formu-lation of research questions and the production of revitalization-specific ar-tifacts. Conversely, descriptive grammars are never truly theory-neutral butare informed by the assumptions and theoretical linguistic research of sometradition; in addition, writers of descriptive grammars often cite researchpapers from outside of their tradition for generalizations or data points.9The OLD is founded upon the premise that the primary language datagenerated in all of the diverse types of fieldwork are ripe for reuse. A re-searcher investigating, say, the syntax of nominal expressions will benefitfrom easy access to peer-generated data. Of course, from a pure researchperspective, the value of peer data decreases as differences in analyticalframework and research focus increase. However, the reader of a treatise onthe methodology of linguistic fieldwork can be assumed to hold to a widerperspective, to be involved in projects for the advancement of documentation9I am sympathetic to the arguments of those (cf. Murray, 2014) who argue that theory-driven research questions can uncover data points and insights that would not otherwisebe discovered and which are valuable to documentation and revitalization goals.202.1. Overview of the OLDand revitalization, and to find value in data elicited by fellow fieldworkerswith a diverse range of goals. Even without this proviso, it is clear that,at the very least, research projects in their initial stages will find value indiversely motivated and authored fieldwork data.This chapter is divided into three sections. Section 2.1, Overview of theOLD, provides a high-level summary of the software, including a succinctdescription, a history, and a review of the implementation. Section 2.2, Otherfieldwork tools, reviews Toolbox, FLEx, and LingSync and compares these tothe OLD. Section 2.3, Features, describes the core features of the OLD andargues for their value to the fieldwork enterprise. Section 2.4 summarizes.2.1 Overview of the OLDThis section provides an overview of what the OLD is and how it works.It begins with a short history of the circumstances that gave rise to thesoftware and the events and considerations that influenced its development.From there it moves on to a mid-level description of the software in itsentirety and ends with some technical implementation details.2.1.1 HistoryThe OLD, in one form or another, has been continuously in production andunder development for the past six years. It arose from my own experiencesdoing fieldwork on Blackfoot. The current production10 version (0.2) hasnine language-specific web application instantiations. The current develop-10Software developers use the jargon of development and production versions of softwareto refer to versions that are under active construction and in use, respectively.212.1. Overview of the OLDment version (1.0) introduces the morphophonological modelling function-ality and the architectural restructuring that makes an OLD application aweb service. Once a user interface for the development version is completed,currently active applications will be migrated to it.I began my doctoral studies with the intention of pursuing formal seman-tic analyses of aspectual phenomena in understudied languages. In particu-lar, I was interesting in understanding the form, distribution, and meaningof the temporal functional morphology of Okanagan. The department of Lin-guistics at the University of British Columbia is well-known for its strengthin theoretical linguistic research grounded in understudied and endangeredlanguage data and, as such, is an excellent location to pursue such a pro-posed course of work.However, I soon became frustrated with various obstacles that discouragetimely access to relevant data. Linguistic fieldwork is difficult and data arehard-won. Often one realizes that evidence pertinent to a particular theo-retical claim has already been elicited. Yet scouring one’s handwritten notesor digital documents is slow going and not guaranteed to pay off. Even moredifficult to retrieve are the relevant forms that colleagues mention havingelicited. Re-elicitation is often the ultimate course of action, an inefficiencywhose undesirability can readily be grasped when one recalls that the lan-guages under study are in danger of extinction.11While working with groups of fieldworkers in two full-year linguistic field11This is not to say that re-elicitation is necessarily an undesirable inefficiency. In manycases it can be extremely useful to re-elicit data in order to ensure their replicability and therobustness of the generalizations that they evince. That said, in other cases re-elicitationis clearly undesirable, redundant, and inefficient.222.1. Overview of the OLDmethods courses (on Blackfoot and St’a´t’imcets) and with an independentgroup on the Okanagan language, I came to believe that the data elicitedby such groups could, if consolidated, consistently structured, and renderedaccessible, prove highly useful to documentation, research, and revitalizationefforts.The precursor to the OLD was the Blackfoot Language Database (BLD)(Dunham, 2013b), a web application designed to facilitate collaborative field-work for myself and a group of Blackfoot researchers which had emerged froma field methods course on that language. As my fieldwork experience broad-ened into other languages, I generalized the BLD to the Online LinguisticDatabase (OLD), language-non-specific collaborative fieldwork software.The current production version of the OLD is 0.2. A demo applicationis currently being served at http://www.onlinelinguisticdatabase.organd the source code is available on GitHub.12 Among the nine language-specific OLD 0.2 applications currently in use13 are applications for twolanguages that I have research and fieldwork experience with, namely Black-foot14 and Okanagan.15In using these applications for my dissertation research, I became increas-ingly interested in potential improvements to the software, of which two—consolidation of the server-side logic into a web service and the provisioningof morphophonological modelling capabilities—necessitated a rewrite. The12GitHub is a “web-based hosting service for software development projects that usethe Git revision control system” (GitHub, 2014). Git is software that helps groups ofdevelopers to collaboratively write complex software in such a way that modificationsfrom multiple contributors can be tracked and handled logically.13See § 2.1.3 for an explanation of what is meant by language-specific OLD application.14 Overview of the OLDsource code for this rewrite, i.e. the OLD 1.0., can be found on GitHub.16Documentation detailing the data structure and installation methodshas been written for the OLD 1.0 and is available in Hypertext MarkupLanguage (HTML) and PDF formats.17 The OLD 0.2-based applicationshave user-oriented documentation embedded within their interfaces.18Within the language-specific OLD applications currently in use, thereare about 19,000 forms (primarily sentences), 300 texts, and 20 GB worthof audio files. There are 180 registered users across all applications, of which98 have entered and 87 have elicited at least one form. The applications forBlackfoot, Nata, Gitksan, Okanagan, and Tlingit are seeing the most use.The exact figures are given in Table 2.1.language forms texts audio GB speakersBlackfoot (bla) 8,847 171 2,057 3.8 3,350Nata (ntk) 3,219 32 0 0 36,000Gitksan (git) 2,174 6 36 3.5 930Okanagan (oka) 1,798 39 87 0.3 770Tlingit (tli) 1,521 32 107 12 630Plains Cree (crk) 686 10 0 0 260Ktunaxa (kut) 467 33 112 0.2 106Coeur d’Alene (crd) 377 0 199 0.0 2Kwak’wala (kwk) 98 1 1 0.0 585TOTAL 19,187 324 2,599 19.8Table 2.1: Data in OLD applications (Feb 14, 2014)Note that the values in the speakers column of Table 2.1 are taken from16The OLD is also available on the Python Package Index and is installablevia both the EasyInstall and Pip Python package managers. See and Overview of the OLDEthnologue19 and are provided only to give a rough indication of the speakerpopulations of the languages. The parenthesized three-character strings inthe first column are the ISO 639-320 identifiers of the languages. The UniformResource Locator (URL) for each language-specific OLD application is theISO 639-3 identifier followed by OLD has evolved in response to usage by these groups and owesmuch to their feedback and suggestions.2.1.2 StrategyThe purpose of the OLD is to make linguistic fieldwork easier, more effi-cient, and more rewarding. I contend that sound strategy in achieving thisinvolves adapting to extant fieldwork practises and supplementing them viaa) collaboration facilitation and b) the provision of improvements to lowand mid-level task completion, thereby freeing up more fieldworker time forhigh-level creative work.Adapting the software to current practises is more than just an olivebranch to methodologically conservative fieldworkers. It falls out from thefact that the OLD is not a manifesto for radical methodological change atthe individual level. On the contrary, the system is built on the premiseof continued additive and complementary change at the communal level. Isee this as a broadening in the sphere of fieldwork tasks and a consequentdivision of labours amongst well-rounded fieldworkers alongside software de-velopers, transcribers, parsers, computational linguists, armchair linguistic19http://www.ethnologue.com20 Overview of the OLDresearchers, pure theoreticians, and all desired permutations thereof.Of course, I do not presume that the OLD can, or should, transform elic-itation, transcription, and translation into easy tasks. The OLD’s primaryaspiration is precisely to free up more fieldworker time for this difficult andtime-consuming yet rewarding work.2.1.3 DescriptionThe OLD is open source software for creating web applications that facilitatecollaborative linguistic fieldwork. Contributors enter their fieldwork dataand are able to search, browse, export, and create documents from thoseentered by themselves and their counterparts.The core data type defined by the OLD, and the one most commonlymanipulated by contributors, is the linguistic form. A form is a representa-tion of a component of a language that is at least a morpheme and at most asentence. Linguists will recognize this seemingly arbitrarily demarcated unitas the ubiquitous datum of most research papers. Distinctions between mor-phemes, words, phrases, and sentences are made either implicitly—e.g., viathe presence of morpheme and word delimiters in transcriptions—or explic-itly via user-specified syntactic categories. Abstractly, a form is an objectwith properties (a.k.a. attributes) whose values are variously optional ormandatory, singular or plural, free-form text or forced choice.Minimally, a form has values for a transcription property and for atleast one translation property. Multiple transcription types may be sup-plied, viz. orthographic, narrow phonetic, and broad phonetic. Multipletranslation values are also possible, as is a morphological analysis speci-262.1. Overview of the OLDfied via the assignment of values to the two attributes morpheme break—asequence of phonemes segmented into words and morphemes by whitespaceand delimiters—and morpheme gloss—a sequence of corresponding glossessegmented just like the morphemes in the morpheme break value. Secondarycontentful form attributes include grammaticality (including felicity in a con-text), appropriateness of translations, general comments, comments from thespeaker, syntactic category, and user-defined tags. Information about dataprovenance can also be specified via values for textual source, speaker (con-sultant), elicitor, elicitation method, and date elicited.Certain form properties are assigned values automatically by the sys-tem. These include a unique integer identifier, the entry timestamp, andthe enterer. If morphemes are recognized in the values supplied for themorpheme break and gloss properties, the system will also auto-generatecategory string and break-gloss-category values. The former is a string ofmorpheme categories implicit in the user-supplied morphological analysis.The latter is an interleaving serialization of the tripartite (i.e., break, gloss,category) morphological analysis data. Thus a form for French chiens ‘dogs’might have N-Num as its category string value and chien|dog|N-s|plural|Numas its break-gloss-category value. These auto-generated analytic values arehighly useful in improving search; details of their representation, generationand utility are given in § 2.3.8.The OLD interface strives to make form entry and update quick andpleasant by providing conveniences such as keyboard shortcuts, tab-basednavigation, and strategic auto-retention of previously entered values.After creation or modification, a form is displayed in highly readable272.1. Overview of the OLDinterlinear glossed text (IGT) format, i.e., with transcriptions, break, gloss,and translations in rows and their words aligned into columns. Within amorphologically complex form, morphemes specified via the break and glossvalues are displayed as HTML links to matching lexical entries already in thedatabase. To illustrate, an OLD application for French containing a form forthe noun chien ‘dog’ and another for the number agreement suffix s ‘plural’would, upon creation of a chiens form with chien-s and dog-plural as breakand gloss values, respectively, display these component morphemes as linksto the corresponding mono-morphemic form objects. This furnishes imme-diate feedback about whether (and the extent to which) the morphologicalanalyses provided are consistent with the lexical items already in the sys-tem. This seemingly minor feature actually turns out to be quite effective atpromoting (both inter- and intra-researcher) consistency in morphologicalanalysis and is thus indirectly a boon to search.OLD forms may also be associated to one or more representations ofmedia files. These file objects are a data type built upon digital files such asaudio or video recordings, images, and textual documents (e.g., PDFs). Anillustration of the practical application of many-to-many form-file associa-tion would be a form representing a sentence associated to a video recordingof the entire elicitation session during which it was gathered, an audio fileof the speaker uttering just the sentence, an audio file of the elicitor utter-ing the translation, an image file used to provide context during elicitation,and a PDF of a research paper containing claims to which the sentence isrelevant.The digital file resource that constitutes the only mandatory value of282.1. Overview of the OLDan OLD file may be hosted within the OLD application or on an externalsite (e.g., YouTube, Vimeo, etc.). Additional file attributes may be givenvalues, including user-created general-purpose tags, a description string, andprovenance metadata.Forms and files can be used to construct instances of the third primarydata type of an OLD application: collections. Collections, at their most ba-sic, are ordered lists of forms and, at their most complex, are multimediadocuments such as research papers with embedded media files. The coreattribute of a collection is its contents, a string of text which may containreferences to forms. These references adhere to a simple syntax that makesuse of the unique integer identifier value of a form in order to embed a rep-resentation of the form within the collection. The text may, in addition tothe form-embedding references, contain prose and this may be given format-ting (via a lightweight markup language). Finally, the collection as a wholemay be associated to multiple file objects. When compiled, a collection isdisplayed as HTML and the end result is a document with formatted prose(e.g., boldface, italics, itemized and enumerated lists, section headers, ta-bles, etc.), enumerated linguistic examples (i.e., forms) in IGT format, andembedded media files. Collections may thus be used to create a range of doc-uments, e.g., elicitation records, representations of narratives, lesson plans,and research papers. These can be exported as LaTeX source files and com-piled to PDF. Given the technologies used, export to Office Open XML (i.e.,.docx) format is also possible, though this has not yet been implemented.The last core object is the corpus. An OLD corpus is (like a collection)fundamentally an ordered list of form objects, and thus a sequence of mor-292.1. Overview of the OLDphemes, words, and/or sentences. Multiple corpora may be created withina single OLD application. A single-file representation of a corpus is createdin a process dubbed generation, which involves gathering specified values ofthe corpus’ forms and writing these to disk according to a specified format.Currently, the only corpus generation format is treebank,21 which writesto file the corpus’ forms’ non-empty syntax values, by convention phrasestructure representations in Penn Treebank-compatible bracket notation. Agenerated OLD treebank corpus can be compiled to a binary format thatcan be searched structurally via the TGrep2 utility. The OLD provides aninterface to the TGrep2 search function so that treebank corpora can besearched for structural patterns. This means adding to search the capacityto return only forms with, say, an indirect and a direct object.In addition to facilitating structural and cross-sentential search, cor-pora can form the foundation of certain morphological components thatare needed for the morphophonological modelling and parser functionalities(see below). Lexica can be extracted and morphotactic rules induced fromthem. Corpora can also supply the N -gram counts needed to build the mor-pheme language models that are essential to the disambiguators needed forfunctioning parsers (cf. chapter 3).22 Since a corpus can contain (i.e., ref-erence) any number of forms any number of times, language models (and21A simple next corpus generation format would be a string of orthographic transcrip-tions, in which cross-sentential patterns could be searched. A particularly useful addi-tional generation method would be the construction of an NLTK-compatible corpus file;this would permit an OLD interface to the NLTK corpus interface and the search conve-niences that it provides.22I use the term disambiguator here to refer to a function that takes a set of possibleparses as input and returns them as a list sorted according to their probability. Since themost probable parse is the one that the parser will suggest to the user, the net effect is todisambiguate the parse set.302.1. Overview of the OLDthence disambiguators) can be strategically biased via corpus design.This segues nicely into the morphophonological (i.e., morphological andphonological) modelling and parser functionality provided by the OLD anddiscussed in detail in chapter 3. In the OLD, a morphology is a mapping be-tween sequences of phonemes and sequences of morphemes; an OLD phonol-ogy is a mapping between phonetic/orthographic representations and se-quences of phonemes. The lexicon and morphotactic rules that constitute amorphology are extracted by the system from corpora specified by users. Aphonology is specified by the user as an ordered list of rewrite rules. Bothmorphologies and phonologies are implemented as finite-state transducers. Afully assembled OLD morphophonology is a function that parses transcrip-tions into sequences of morphemes and, conversely, generates surface tran-scriptions from morpheme sequences. Multiple morphologies, phonologies,and morphophonologies can be created within a single OLD application.Since these components encode licit mappings, they could be used to pro-vide feedback to users indicating whether their representations are consistentwith a given analysis. For example, a user-specified phonetic transcriptionmay not be consistent with a user-specified morphophonemic segmentationaccording to a certain phonological analysis. The interface could alert theuser to this fact,23 thus facilitating error correction and, more interestingly,analysis modification.Beyond providing feedback on representational consistency, morphopho-23As there is, as yet, no Graphical User Interface (GUI) for the OLD v. 1.0 web service,the morphological parsers and their components are not currently being used to alertusers to inconsistencies between their morphological analyses for particular words and thegrammatical models encoded in their parsers. Future work on the software will involvecreating a GUI which does precisely this.312.1. Overview of the OLDnological components can be combined with statistical disambiguators toform parsers that can automate the creation of morphological analyses.That is, given a transcription specified by a user, an OLD application canuse a specified parser to automatically assign values to the form’s mor-pheme break, morpheme gloss, and category string attributes. In principle(though this is not yet implemented), the reverse operation could also beimplemented using the same components (i.e., morphophonology and disam-biguator), thus allowing users to specify morphological analyses and havethe system auto-generate the transcription.In addition to creating and modifying forms, files, collections, corpora,morphologies, phonologies, disambiguators, and morphological parsers, OLDcontributors create and modify a host of other (relatively) minor objects, in-cluding sources (using the BibTeX data structure), speakers, general-purposetags, (morpho-syntactic) categories, and elicitation methods. These minorobjects can be used to assign values to the attributes of the major objectsdiscussed above; e.g., tags can be used to classify forms, files, and collectionswhile categories are relevant to forms only. Administrators (i.e., contributorswith additional privileges) can specify possible grammaticality values as wellas inventories of graphemes, phones, phonemes, and morpheme delimitersthat can be used to configure input validation and orthography conversion.Input validation means restricting (if desired) which characters (and se-quences of characters) can be used in specific form transcription values, i.e.,orthographic and (narrow and broad) phonetic transcriptions and morpho-phonemic segmentations. This contributes further to consistency of repre-sentation and thus can be seen as a benefit to search. Orthography converter322.1. Overview of the OLDspecification entails defining simple mappings between grapheme inventoriesso that contributors may interact with (i.e., create, modify, browse, search)the data using their desired orthography while maintaining consistency ofunderlying orthographic representation.Accessing an OLD application requires authentication via a valid combi-nation of username and password. Authorization to make particular requestsis determined by user role, one of viewer, contributor or administrator. View-ers have read-only access to an OLD application’s data, contributors cancreate, modify and delete objects, and administrators can do anything thesystem allows. Using this authentication/authorization system, producers ofan OLD application can prevent public access to data and can permit cer-tain users read-only access without opening up the possibility of accidentalor malicious data corruption.24 A further data privacy feature is imple-mented via a special restricted tag on forms and other objects; objects sotagged cannot be accessed by users unless they are administrators or areusers who are themselves tagged as unrestricted. Privacy and access to dataare discussed further in section 2.3.11. A more sophisticated access restric-tion system may be required if an OLD is to act as envisioned here, i.e., asa web service providing data to other specialty applications.The OLD provides powerful search functionality, the more modest goaltherefrom being timely access to relevant data and the more ambitious thediscovery of novel generalizations. A search query is a tree structure of fil-ter expressions whose nodes are conjunctions or disjunctions. There is no24OLD form and collection data are also versioned, i.e., backed up upon each modifi-cation. This is a further safeguard against data corruption.332.1. Overview of the OLDprincipled limit on the complexity of this structure. A filter expression is anassertion that the value of a specified attribute matches (in a specified man-ner) a specified search pattern. Methods of matching between pattern andvalue include exact match, substring match and regular expression match.Filter expressions may also be negated. The end result is that forms (andother core objects) can be filtered with a high level of precision and relevantdata can be quickly accessed. These search queries are themselves OLD ob-jects that can be saved for later reuse or as the basis for modified queries orfor defining corpora.Of notable promise is the potential to create complex queries that useregular expression patterns to match system-supplemented morphologicalrepresentations and structural patterns to match user-created syntactic rep-resentations. Section 2.3.6 discusses the search functionality of the OLD indetail and provides examples of practical queries.Some remarks on the relation between the OLD software and language-specific OLD applications may be useful here. A language-specific OLD ap-plication is simply a web application that was created by installing theOLD software on a server and configuring it for use on a particular lan-guage. Configuring the OLD for use on a particular language does not re-quire modifying the source code since the OLD was designed to be usedon any language. Language-specific configuration involves setting up a do-main name, identifying the language being documented (along with its ISO639-3 identifier, if desired), and possibly also specifying language-specificorthographies/inventories, configuring validation and/or orthography con-version, specifying tags, syntactic/morphological categories, morpheme de-342.1. Overview of the OLDlimiters, possible grammaticality values, etc., and creating HTML pages toguide users in following the conventions of the application. This type ofconfiguration allows the OLD to be useful within a wide range of linguisticconventions and language patterns.25 That said, field linguists working on,for example, sign languages, languages with right-to-left writing systems, orlanguages where the space character is not used to separate words may haverequirements for a linguistic fieldwork application that the OLD does not,at present, meet. As I cannot anticipate all possible use cases, I hope thatfieldworkers will continue to use the OLD and challenge its assumptions sothat it can evolve to be a useful tool to an even wider range of languagesand linguistic conventions.This section has provided an overview of the OLD, showing how thesystem is used to create fieldwork-relevant objects that can be displayed,exported, and searched in ways that are beneficial to fieldworkers.2.1.4 ImplementationBoth versions of the OLD are open source and are written in the Pythonprogramming language using the Pylons web framework.Version 0.2 is a standard Pylons web application whose design adheresclosely to the recommendations and example application of The DefinitiveGuide to Pylons (Gardner, 2008). It is built in accordance with the model-view-controller (MVC) design pattern (cf. Krasner et al., 1988). The model25One hard-coded assumption of the OLD is that the space character will be used todelimit words. Based on this assumption, and the morpheme delimiters specified by admin-istrators, the system creates column-aligned interlinear glossed text (IGT) representationsand identifies words and morphemes for internal cross-referencing.352.1. Overview of the OLDis an SQLAlchemy-based Python interface to a MySQL relational database.The views are HTML pages generated server-side using the Mako templatelibrary. The controllers are Pylons-based Python objects that mediate be-tween user actions on the view interfaces and database queries facilitatedby the model. While some of the interface logic is JavaScript-coded andexecuted client-side and while some requests to the server are made asyn-chronously, the bulk of the user interface is implemented via requests forserver-generated HTML pages. This results in the ungainly page refreshesand generally sub-optimal user experience typical of older generation webapplications. Moreover, since OLD 0.2 server responses are HTML pagesand client requests are standard Hypertext Transfer Protocol (HTTP) formsubmissions, disentangling the core application logic from the user interfacein order to re-purpose the former for, say, an audio dictionary, language-learning software, or command-line interaction, would be prohibitively dif-ficult.While version 1.0 of the OLD is still written using the Pylons web frame-work, the user interface has been removed and replaced with a standards-compliant (RESTful, Fielding (2000)) Application Programming Interface(API). All incoming and outgoing communication is formatted as JavaScriptObject Notation (JSON)26 and the semantics of a request is determined bythe HTTP method used. The effect of this is that multiple interfaces may beused to interact with a single OLD 1.0 web service. Currently under devel-opment is a single-page browser-based JavaScript application that replicatesthe functionality of the OLD 0.2 graphical interface while providing a richer26http://www.json.org362.2. Other fieldwork toolsuser experience. However, I envision additional interfaces to the OLD 1.0web service. Already constructed is a simple Python module that uses theRequests library to interface with a live OLD application via HTTP;27 thishas been put to good use in the creation and evaluation of morphologi-cal parsers (as detailed below). Other possibilities include mobile platformor traditional desktop interfaces, as well as interfaces with other web ser-vices/applications such as talking dictionaries or language learning games.The OLD 1.0 is licensed under the Apache License, Version 2.0 and itssource code can be found on GitHub. The OLD 0.2 is licensed under theGNU General Public License, Version 3 and its source code can also be foundon GitHub.2.2 Other fieldwork toolsThis section reviews a subset of existing fieldwork software tools. Those re-viewed were selected because they are widely used and/or are similar tothe OLD. The intention here is not to demonstrate that the OLD is supe-rior to these tools in all respects but to explore the strengths, weaknesses,and biases of each tool, to discover innovative and useful features that theOLD may borrow, to explore how these tools can complement one anotherand be used in conjunction, and, of course, to argue for the benefits of theOLD in certain respects. Reviewed here are the two products of SIL Inter-national that are probably most widely used by linguistic fieldworkers, i.e.,The Field Linguist’s Toolbox (§ 2.2.2) and FieldWorks Language Explorer27See the module at Other fieldwork tools(FLEx, § 2.2.3). LingSync, a recently released web application for linguisticfieldwork, is discussed in § 2.2.4. Prior to these discussions, § 2.2.1 exploresthe role of SIL International in fieldwork and endangered language documen-tation and the relation of the OLD to that organization and the software ithas put out.There are, of course, many other software tools not discussed here thatare designed for, and used in, linguistic fieldwork. The online journal Lan-guage Documentation & Conservation (LD&C) has published reviews onmany of them. Web-based, general-purpose fieldwork tools include Em-dros (Lowery, 2008), TypeCraft (Farrar, 2010; Beermann and Mihaylov,2010), and LEXUS (Kotcheva, 2009; Ringersma and Kemps-Snijders, 2010).Tools that focus on annotation of digital audio and/or video files includeELAN (Berez, 2007; Hellwig et al., 2013), CLAN (Meakins, 2007), Audiamus(Brotchie, 2007), InqScribe (Garde, 2012), Anvil (Tan and Martin, 2011),Transcribe! (Barwick, 2009), Transana (Afitska, 2009), and EXMARaLDA(Meißner and Slavcheva, 2013). Geared primarily toward the production ofdictionaries are Kirrkirr (McElvenny, 2008), TshwaneLex (Bowern, 2007),and WeSay (Perlin, 2012). Other relevant tools include The Linguist’s As-sistant (Beale, 2012) and Phon (Buchan, 2011). Finally, many fieldworkersuse spreadsheet applications, general-purpose desktop database applications(e.g., Microsoft Access or FileMaker Pro), or (increasingly) their web-basedcounterparts (e.g., Google Spreadsheets) which offer similar functionalitywhile facilitating multi-user web-based collaboration.Table 2.2 compares the salient features of the fieldwork applications dis-cussed in the sections that follow. In general, Toolbox and FLEx place em-382.2. Other fieldwork toolsOLD Toolbox FLEx LingSyncdictionarya partial yes yes notextsb yes yes yes noIGTc yes yes yes yessearch yes partial yes yesparserd yes yes yes partialweb collaboratione yes no partial yesweb API yes no no yesplatformsf WML W WL WMLAopen source yes no yes yesa Support for creating dictionaries.b Support for creating texts.c Interlinear glossed text (IGT) data display.d Morphological parser creation function.e Web-based, multi-user collaboration.f Platform support: W=Windows, M=Macintosh, L=Linux,A=Android.Table 2.2: Comparison of fieldwork software.phasis on the creation of dictionaries with supporting texts; FLEx can beviewed as an upgraded Toolbox with improvements in terms of search, multi-user collaboration, cross-platform support, and an open source software li-cense. The OLD and LingSync are open source web applications that focuson multi-user collaborative database creation and cross-platform operabil-ity. See the reviews of these tools below for justification of the values in thecells of Table 2.2.392.2. Other fieldwork tools2.2.1 SIL InternationalSIL International (SIL) is a global organization that for the past eightdecades28 has made undeniably substantial contributions to the documen-tation and description of endangered languages. However, because one of itsthree primary goals is Bible translation29 and because of its close associa-tion with its partner organization Wycliffe Bible Translators, whose missionis “to see a Bible translation program in progress in every language stillneeding one by 2025” (Wycliffe, 2013), many academic linguists are uncom-fortable with being de facto dependent (Dobrin, 2009) on the organizationfor a wide range of documentation-related tasks.It is undeniable that SIL is a world leader in endangered language doc-umentation. To give a sense of the size and output of the organization,consider that SIL International has linguists in “more than ninety coun-tries [. . . ], includes approximately 6,500 members, [. . . ] recently celebratedthe completion of its five hundredth translation of the New Testament, [hasongoing work in] over 1,100 other languages” (Svelmoe, 2009), has a bib-liography of “over 13,000 [. . . ] books, journal articles, book chapters, dis-sertations, and other academic papers” (Olson, 2009), and “reported over$41,000,000 in revenue on its 2005 tax return” (Epps and Ladley, 2009). Notonly do SIL linguists produce grammars, dictionaries, and collections of textsthat are crucial to academic linguistics, they also produce (or have leadingroles in the production of) core documentation-relevant technologies and28The first Summer Institute of Linguistics (SIL) was held in 1934 in Arkansas. For aconcise history of the organization see Svelmoe (2009).29According to Olson (2009), the other two are (1) research and training of linguists,and (2) literacy and education.402.2. Other fieldwork toolsstandards. These include fieldwork-facilitating software like The Linguist’sShoebox and FieldWorks Language Explorer, the keyboard layout editorUkelele, fonts (e.g., Doulos, Charis, and Gentium) for representing the rarecharacters used in endangered language transcriptions (SIL International,2013a), a comprehensive catalogue of the world’s languages (i.e., Ethno-logue), and (derived from the previous) the internationally recognized ISO639-330 standard code set for identifying the world’s natural languages.31Many academic linguists are uncomfortable with the level of influencethat SIL has in the field of language documentation. Works such as Dobrinand Good (2009) can be considered a call to action for the academy to takeon greater responsibility in this area. Some go further and argue that themere presence of wealthy SIL Bible translators in impoverished indigenouscommunities exerts a powerful proselytizing effect and inevitably works tocorrode culture and language (Epps and Ladley, 2009), even despite protes-tations (cf. Olson, 2009) that SIL is not a missionary organization and thatits members are explicitly prohibited from preaching, baptizing, and cre-ating churches. Olson (2009) views the academic distrust of SIL as thinlyveiled condescension toward indigenous peoples’ ability to determine theirown cultural, religious, and linguistic practices.Handman (2009) argues that secular linguists must acknowledge a de-30ISO, the International Organization for Standardization, is “an internationalstandard-setting body composed of representatives from various national standards or-ganizations” (Wikipedia, 2013).31There is debate in the linguistic community over whether and how linguists shoulduse the ISO 639-3 standard. Morey et al. (2013) represents a highly critical view. Goodand Cysouw (2013) argues that the ISO 639-3 standard is useful in certain cases andproposes a complementary inventory of “doculects”, entities that are rigorously definedaccording to sets of sources, i.e., documentary artifacts, and upon which such concepts aslanguage and dialect may be defined. See Glottolog at Other fieldwork toolsstructive ideology of their own, one which seeks to preserve a perceivedlyauthentic cultural homogeneity that is at odds with the reality of culturalchange and exchange. A response to this charge can be seen in the exhor-tation of Amery (2009) that linguists interested in the future value of theirwork for revitalization elicit more natural conversations containing commonusage and move away from attempting to gather only “pure”, traditionalforms, unaffected by contact.The strongest argument against SIL vis-a`-vis endangered language doc-umentation is, to my mind, the claim that the organization tends to strate-gically move resources and focus away from the most highly endangeredlanguages—“precisely [those] that academic linguistics now deems most ur-gently in need of attention” (Dobrin and Good, 2009, p. 624)—since theyexpect that soon there will be no readership for Bibles produced in thoselanguages (see also Epps and Ladley, 2009). While a valid concern, thisalone clearly does not preclude collaboration between SIL and the widercommunity of linguists.Clearly academic linguists and members of Protestant Bible translationorganizations like SIL have different high-level goals yet share many com-mon interests. Even those who have no issue with collaborating with SIL andusing the technologies generated by that organization should recognize thevalue in the independent development of tools like the OLD. The featuresprovisioned by such tools have the potential to have a direct and positiveimpact on the progress of linguistic research, language documentation, andcommunity-based revitalization. The engineering of such tools requires acreative synthesis of developer, researcher, documentor, and educator ex-422.2. Other fieldwork toolspertise. As technologies evolve and linguistic landscapes change, we need tobe able to adapt in accordance with our own goals.On this topic, it is worth quoting Dobrin and Good (2009) at length:For linguistics to externalize the development of technologicaland community resources because the problems they solve arepractical rather than scholarly, or because they are others’ ratherthan ours, is increasingly untenable. The problems that call uponspecialized linguistic knowledge for their solution are numerousand growing, and indifferent to the traditional boundaries of thediscipline. [. . . ] Linguistics could come to more closely resemblefields like medicine or economics, where interplay between theoryand practice is welcomed as adding to their richness, and where‘applied’ forms of work are not seen as belonging to a separatediscipline (Dobrin and Good, 2009, pp. 628–629).The OLD is, in my estimation, a viable alternative to SIL’s fieldworksoftware tools The Linguist’s Shoebox, The Field Linguist’s Toolbox, andFieldWorks Language Explorer. However, issues with the organization itself(ideological or otherwise) were not motivating factors in its construction.Rather these motivations were primarily technical and methodological innature: the fieldwork software offerings from SIL are focused on a singleplatform (viz. Windows),32 are not open to programmatic alteration by out-side developers,33 and do not support web-based multi-user contribution as32Shoebox has a PowerPC Mac implementation, but that architecture is all but obso-lete. Toolbox has no native Mac or Linux implementations. FLEx can be used on Linuxplatforms but not Macintosh ones.33While FLEx is open source, Shoebox and Toolbox are not.432.2. Other fieldwork toolsa core feature.342.2.2 ToolboxPerhaps the best known and most widely used linguistic fieldwork softwareapplication is The Field Linguist’s Toolbox (and its predecessor The Lin-guist’s Shoebox), developed by SIL International.35 This freely availabledesktop application is designed to help linguists and anthropologists orga-nize and process their linguistic fieldwork data. Though it has advantagesover the OLD, some disadvantages are that that it is closed source, focusedon a single platform (Windows), and not primarily concerned with the fa-cilitation of web-based collaboration. This section describes Toolbox withreference to its own documentation and a published review and addresseshow it compares to the OLD.The following description of Toolbox is quoted from SIL International(2013d).Toolbox is a data management and analysis tool for field lin-guists. It is especially useful for maintaining lexical data, andfor parsing and interlinearizing text, but it can be used to man-age virtually any kind of data. . . . Toolbox also has powerfullinguistic functionality. It includes a morphological parser that34Shoebox and Toolbox do not provide this functionality. FLEx does introduce it,though Butler and van Volkinburg (2007) reports crashes when this feature is employed.I have yet to experiment with FLEx’s support for multi-user concurrent access.35Technically, Shoebox is on SIL’s list of supported products whereas Toolbox isendorsed but “not necessarily support[ed]” (cf. According to Dobrin and Good (2009, p. 623, fn.6), “Individuals associatedwith SIL have continued to develop Shoebox under the new name ‘Toolbox’, but theseactivities do not represent official efforts on the part of SIL’s computing division.”442.2. Other fieldwork toolscan handle almost all types of morphophonemic processes. . . . Ithas a user-definable interlinear text generation system which usesthe morphological parser and lexicon to generate annotated text.Interlinear text can be exported in a form suitable for use in lin-guistic papers. Toolbox has export capabilities that can be usedto produce a publishable dictionary from a dictionary database.The OLD can also be used to maintain lexical data. Using the corpusand search objects, any number of distinct lexica may be defined. OLDapplications also facilitate the generation of interlinear texts—in part byallowing for the creation of an unlimited number of morphological parsers—and these interlinear texts can also be exported in forms suitable for use inpapers. Version 0.2 of the OLD has a limited dictionary interface to formsdeemed lexical by the system according to a simple heuristic. Improvementsto the dictionary-creation functionality of the OLD—including building onlexical corpora to create dictionary objects exportable to XeLaTeX/PDF—are discussed in § 2.3.9.The review of Toolbox in Robinson et al. (2007) extols its flexible datastructure and its easily comprehensible and programmatically manipulablestorage format. The major criticisms from that source are the difficultyin the initial setup of a Toolbox project, issues with auto-synchronizinglexica and interlinear glossed text documents, weaknesses in supporting dataconsistency, and a lack of an integrated scripting language for automatingfunctions.OLD applications are not, in my estimation, difficult to set up. However,452.2. Other fieldwork toolsthis is probably due to the more rigid nature of the OLD data structure incontrast to the more flexible Toolbox one. Installing and serving an OLDapplication does require some basic proficiency with the command-line andserver configuration; however, this is facilitated by mature package man-agement tools (i.e., EasyInstall and pip) and is thoroughly explained in thedocumentation (Dunham, 2013a). Once an OLD web service is being servedonline and has a valid domain name, the actual setup is quite minimal. Itinvolves specifying the object language name, creating some user accounts,and specifying some optional configuration settings, e.g., inventories for in-put validation.Because the OLD prescribes a particular structure for fieldwork data (cf.§ 2.3.4), initial configuration does not require data structure specification,i.e., a potentially complicated setup of a hierarchy of fields with their owndata types. The disadvantage of the OLD approach is, clearly, that the datastructure can only be adapted to a limited extent to the desires or existingstructures of the fieldworkers planning to use it.36 Existing data must beconverted to an OLD-compatible format and imported before it can be usedwithin an OLD application.37 Interested readers should consult § 2.3.4 in36In my experience, idiosyncratic aspects of fieldworkers’ data structures can usuallybe accommodated within the OLD structure via the general purpose tagging mechanism.There is, for example, no need to have dedicated database columns (or the equivalent) forsimple attributes like non-future tense semantics or need to re-elicit since this informationcan be encoded via tags. Even multi-valued attributes such as semantic field can beencoded as OLD tags by establishing conventions within the tag names, e.g., sem fld:mankind or sem fld: animals. Forms can then be searched using regular expressions toretrieve results with a specific set of semantic field values. Other types of informationcan be stored in the general comments attribute of forms and, if necessary, syntacticconventions can be established to facilitate retrieval.37Future versions of the OLD may allow users to extend the data structure via theinterface.462.2. Other fieldwork toolsorder to determine whether the OLD data structure is sensible, general, andflexible enough to handle their fieldwork needs. Of course, since the OLD isopen source, it is entirely possible for interested parties to modify the datastructure directly (though this will require some working knowledge of thetechnologies used).The OLD version 0.2 suffered from lexicon-text synchronization issuessimilar to those attributed to Toolbox in Robinson et al. (2007). In that ver-sion, HTML representations of collection objects (used to generate texts con-taining IGT-formatted language data) are inefficiently generated anew uponeach read request. In the OLD 1.0 revision, HTML representations of collec-tions are now generated upon each successful create/update request and arenever generated during a read request. This is more efficient since texts (i.e.,collections) are more commonly read than modified. However, now whenevera form is updated, the HTML representation of all collections containing itare automatically updated as well. This means that texts (i.e., collections)are always synchronized with the forms that they reference. In addition, theform-referencing attributes of forms (i.e., syntactic category string, break-gloss-category, morpheme break identifiers, and morpheme gloss identifiers;see Appendix A) are also updated when an implicated form is modified (seethe next paragraph and § 2.3.10 for details).In response to the criticism of Toolbox by Robinson et al. (2007) withrespect to supporting data consistency, note that the OLD does this in anumber of ways, as detailed in § 2.3.8. One such consistency-enforcing fea-ture is effected via the use of relational objects in the attributes of forms.472.2. Other fieldwork toolsThat is, if, say, the name value of an elicitation method object is changed,38then all forms associated to that elicitation method reflect the change imme-diately and consistently. This is one of the advantages of the relational datastructure: the attributes of numerous objects may receive their values froma single object, thus allowing modifications on one to percolate to all theothers associated to it (see § A.1). Where necessary, non-relational values(e.g., strings) are modified upon update of a relevant related object. For ex-ample, if a contributor changes the name of a syntactic category object, thesystem will automatically update the syntactic category string value of allforms whose morphological analyses contain a morpheme of that category.39Like Toolbox, the OLD does not include an integrated scripting languagefor automating tasks (cf. Robinson et al., 2007). Such a feature might beuseful for when users wish to make widespread modifications to the data setbased on a series of conditions and context-dependent transformations, e.g.,finding all forms glossed with a certain morpheme and changing the gloss toone of a number of possibilities, as determined by its neighbouring glosses.Other use cases might be programmatically retrieving sets of minimal pairsor statistics on morpheme frequency. However, since the OLD is a web ser-vice that communicates via JSON-encoded data structures across HTTP,such tasks can be accomplished using whatever programming language theso-inclined contributor wishes to use. For example, one could write a Python(or Perl, C++, Java, etc.) script that issues the appropriate HTTP request38In the OLD, elicitation methods are objects (i.e., entities with named attributes) intheir own right. They have two primary attributes: name and description.39See § 2.3.6 for a detailed exposition of the use and method of generation of syntacticcategory string values and related attributes.482.2. Other fieldwork toolsto retrieve all forms in an OLD application matching OLD search crite-ria, then performs the modifications client-side, and then issues a series ofHTTP update requests to alter the server-side data. The possibilities areendless. That said, the particular feature of being able to perform complexsystem-wide updates via a dedicated interface (i.e., “bulk update”) has beenfrequently requested and will probably be implemented in future versions ofthe software.The use of JSON as the OLD’s communication format touches on thecommendation in Robinson et al. (2007) of Toolbox’s data storage formatas simple to understand and manipulate. All OLD 1.0 communication isJSON. JavaScript Object Notation is a widely used standard in the pro-gramming world for creating string representations of commonly used datatypes40 and most major programming languages have at least one libraryfor converting between native data structures and JSON. This makes it veryeasy for interested fieldworkers to access and programmatically manipulateOLD data.41Relative to the OLD, Toolbox has better support for dictionary creationand boasts a more flexible data structure that users can tailor to their needs.On the other hand, the OLD has arguably better search capability, hasbetter facilitation of data consistency, allows for the integration of mediafiles, is open source, is web-based, and, as a result, has better cross-platformcoverage. The differences between these two tools can be traced back to a40JSON can encode JavaScript objects (i.e., associative arrays or dictionaries), arrays(i.e., lists), strings, numbers, and Booleans.41Those with back-end access to an OLD application can also interact with the datamore directly via Python or Structured Query Language (SQL) (through MySQL orSQLite).492.2. Other fieldwork toolsfundamental difference in the type of fieldwork and fieldworker targeted.Toolbox helps fieldworkers who are primarily interested in documentationand description to create lists of lexical items, dictionaries, and interlinearglossed texts. The OLD is arguably more general in that it can be used tofacilitate both language description and theoretical linguistic research. Thissame contrast becomes apparent when comparing FLEx with the OLD andis discussed further in § 2.2.3 below.2.2.3 FLExThis section discusses FieldWorks Language Explorer (FLEx) and comparesit to the OLD. Since this application is being actively developed and pro-moted by SIL International, it is reasonable to assume that it has an exten-sive user base and that many readers will be familiar with the FLEx (andToolbox/Shoebox) approach to computational language documentation anddescription. I therefore devote considerable attention to this tool in orderto illuminate the ways in which it differs from the OLD and the reasonsfor those differences. There are several ways in which the OLD could beimproved by emulating the functionality of FLEx. On the other hand, theOLD is, as argued below, superior to FLEx in a number of key areas.FLEx is SIL International’s currently supported, general-purpose field-work tool. It is a major component of the suite of tools called FieldWorks.FieldWorks Language Explorer (or FLEx, for short) is designedto help field linguists perform many common language documen-tation and analysis tasks. It can help you elicit and record lexi-502.2. Other fieldwork toolscal information, create dictionaries, interlinearize texts, analyzediscourse features, [and] study morphology (SIL International,2014b).It boasts a sophisticated graphical user interface and data structurespecifically tailored for the creation and maintenance of lexical entries tobe assembled into dictionaries. The system also has impressive support forcreating interlinear texts, including features for ensuring that these areconsistent with the lexicon, for configuring morphological parsers to au-tomate the generation of morphemic analyses, and for identifying phrasalconstituents and labelling them according to system-provided (and user-modifiable) grammatical roles, e.g., subject, verb, clause in object posi-tion, etc. (SIL International, 2014a). In addition, FLEx provides powerfulsearch/filtering and bulk editing functionality, is freely available and opensource (MIT License, written in C#), and runs natively on Windows andLinux platforms (though no native Macintosh version is currently available.)SIL’s development of the FieldWorks suite represents a transition fromthe lexicon creation and text analysis focus of Shoebox toward “translation-related tasks for which there have been fewer computer solutions” (SIL Inter-national, 2013c). However, from what I can gather, FLEx itself is still focusedprimarily on the creation of a lexicon and supporting interlinearly analyzedtexts. Consider in this regard the description of the software in Black and Si-mons (2008, p. 37) as “the lexicon and text component of the SIL FieldWorkssuite of tools.” The primary tools for assisting with translation-related tasksare ParaTExt and (the now deprecated) Translation Editor.512.2. Other fieldwork toolsI will begin by discussing FLEx’s cross-platform support. I successfullyinstalled the FieldWorks suite version 7.0.6 on Ubuntu 12.04.3 (Precise Pan-golin) following the instructions at its Linux developer site,42 which is linkedto from the main FieldWorks download page. SIL International’s LanguageSoftware Development division’s Linux team (LSDevLinux) is working toport FLEx to Linux and Mac operating systems. At the time of writing, thelatest stable release of FieldWorks for Windows is 7.2.7 while version 7.0.6is the one that is available from LSDevLinux for Linux. Note that my com-puter runs Mac OS X 10.6.8 (Snow Leopard) and I was able to install theFieldWorks suite on the (open source) Ubuntu Linux operating system thatwas itself installed on my Mac via the (open source) virtualization softwarepackage VirtualBox.43 The installation went relatively smoothly and wasaccomplished entirely through graphical interface tools, i.e., no commandline. Therefore, through some not insignificant level of indirection (i.e., Macto VirtualBox to Ubuntu to FLEx), I was able to run FLEx on a Mac usingopen source and freely available technologies throughout. I cannot providea comparison of the relative performance of FLEx on top of such a technol-ogy stack, but I suspect that it would be degraded using this, or a similar,virtualization approach. That is, a native Mac version of the software is stillsomething to be desired.The most salient difference between FLEx and the OLD is that the for-mer is geared toward practical lexicography. FLEx’s default data explained on, VirtualBox is a “vir-tualization software package [that] is installed on an existing host operating system as anapplication; this host application allows additional guest operating systems, each knownas a Guest OS, to be loaded and run, each with its own virtual environment.”522.2. Other fieldwork tools(though customizable) and its interface conveniences are centred aroundthe creation of lexical entry objects for dictionaries. The primary field (i.e.,attribute) of an entry is the lexeme form and entries may have multiplesenses, each with their own gloss, definition, category, and example sentencevalues (cf. SIL International, 2014a). The system recognizes and appropri-ately displays (using dictionary formatting conventions) homographic en-tries as well as distinct senses of a given entry. For each sense of an entry,users can specify the category (e.g., N) as well as inflectional features (e.g.,neuter gender). Affixal entries can be specified for morpheme type (suffix,prefix, infix, proclitic, etc.), affix type (inflectional or derivational), the cat-egory they attach to, and the categorial change (if any) that they effect.For each entry, multiple allomorphs and their contexts of occurrence can bespecified. This information is used to configure the morphological parser.Users can specify a variety of relations between entries which can affect howthey are displayed within the dictionary view; these relations include syn-onym/antonym of, part-to-whole, and variant of. Morphologically complexentries can be described by decomposing them into existing entries and thesystem can be configured to display these complex entries under the simplexheadwords in various ways. Figure 2.1 repeats the examples used by SIL In-ternational (2014a) to illustrate dictionary representations generated fromFLEx entry objects. Shown here are homographs (bank1 vs. bank2), mor-phologically complex entries (blinder), variants (blinker), and semantically(wood) and categorially (free) ambiguous lexical entries.The FLEx interface also provides a number of conveniences for con-structing lexical entries. These include built-in detailed and hierarchically532.2. Other fieldwork toolsbank1 N financial institutionbank2 N edge of riverblinder (der. of bind, -er, dial. var. of blinker)N either of two flaps on a horse’s bridle tokeep it from seeing objects at its sidesfree 1) Adj unrestricted 2) V to relieve fromrestrictionswood N 1) A dense growth of trees usuallygreater in extent than a grove and smallerthan a forest. 2) The hard fibrous substanceused to make furniture etc.Figure 2.1: Dictionary-style representations targeted by FLEx (cf. SIL In-ternational, 2014a).organized analytic information to help with choosing things like grammaticalcategories and inflectional features for senses. The categorized entry inter-face guides language documentors in creating a broad-coverage dictionaryby providing a hierarchy of semantic domains as suggestions for entries toelicit and then providing a simplified input form for quickly creating lexicalentries that can later be refined via the standard entry interface.In contrast to FLEx, the OLD assumes a user base that is engaged pri-marily in the creation not of dictionaries and texts but of sets of formsthat are relevant as evidence for or against particular theoretical linguisticclaims. This is particularly apparent when one considers that theoreticallinguists are often interested in discovering the grammaticality of specificconstructions and, as a result, end up collecting ungrammatical forms whichreally have no place in lexica, dictionaries, or texts as representations ofutterance events. Unlike a FLEx entry which may have multiple allomorphs542.2. Other fieldwork toolsand multiple senses, each sense with its own gloss and grammatical cat-egory, an OLD form has a single shape, a single gloss, and a single cat-egory. Therefore, additional efforts would be required in order to achievethe lexicographic structure of a FLEx database within an OLD one. Oneapproach would be to designate certain OLD forms as entries and then cre-ate references to the other forms whose phonemic, semantic, and categorialinformation are to be used in the specification of allomorphs and distinctsenses for said entries. Using the current OLD data structure (cf. § 2.3.4),this could be accomplished by creating a lexical entry tag and then estab-lishing syntactic conventions for referencing other form objects within, say,the general comments value of the form-as-entry. For example, a conventioncould be established according to which inserting the strings allomorph[37],allosense[899], and antonym[1123] into the general comments value couldconstitute a reference to other forms whose data could be used to assembleinformation on allomorphs, distinct senses, and antonyms, respectively ofthe lexical entry form.44 Other aspects of the FLEx data structure could beencoded within the current OLD data structure via the creation of tags forinflectional features, morpheme types, affix types, derivational morphemecategorial inputs and outputs, etc. In order to display this information viaan appropriate dictionary-type representation, an interface to an OLD webservice would, of course, need to be aware of any such conventions.However, it may be better strategy to focus on how the OLD can be made44String-based references to other forms within the general comments value is a bit ofa hack. A more robust solution would be to alter the data structure to allow for differenttypes of relations between forms and other forms, e.g., endowing form objects with senseattributes that can refer to zero or more other form objects.552.2. Other fieldwork toolsinteroperable with dictionary-creation software and/or standards—e.g., byimplementing appropriate import and export features—rather than focusingon how the OLD can be modified to replicate the data structures and featuresets of such software. Instead of engaging in a futile wheel-reinventing armsrace, I anticipate that future work on the OLD will stick to refining itsstrengths—viz. data-sharing, collaboration, online accessibility, catering totheoretical linguistic analysis, computational implementation of linguisticmodels—while adding the import/export functionality that is crucial forcomplementary co-existence with extant useful tools such as FLEx.It is interesting to note that the differences between the OLD and FLExdata structures reflect differences in fundamental analytic assumptions aboutthe lexicon and other components of the grammar. The OLD assumes thatthe lexicon is a set of morphemes that are unique form-meaning-categorytriples. Allomorphy is to be encoded in the phonological component as trans-formations on phonemic shapes of morphemes. For example, the surface real-ization of the English morpheme /in/ ‘not’ as [im] before labials is somethingthat should be encoded within a phonology, i.e., an OLD phonology object(cf. § 3.3).45 Similarly, the categorial and semantic ambiguity of FLEx lexicalentries would, under the assumptions underlying the OLD, be better ana-lyzed as distinct morphemes. For example, the English noun dog ‘canine’ andthe verb dog ‘follow’ are distinct morphemes. If a researcher wishes to ana-lyze the verb as derived from the noun via a phonologically null derivationalaffix, that is certainly representable within an OLD application.45OLD phonologies can even capture suppletive transformations since the phonologicalrules can be made to be sensitive to phonemic, semantic, and categorial information. Forexample, an OLD phonology could map /be-z/ ‘’ ‘V-Agr’ to [was], cf. § 3.3.562.2. Other fieldwork toolsTurning to FLEx’s bulk edit functionality, we find an impressive arrayof features that could be emulated by software like the OLD to good effect.When bulk editing, users first create a filter to select the subset of entriesto which the edit should apply, then configure the transformation on theirchosen field, optionally indicate ad hoc exceptions to the transformation,preview the effect of the bulk edit, and then apply, if desired. The transfor-mation may insert a particular value into the column46 of all filtered rows,copy a row’s value for one column into another, replace all column valueswith a specified value, delete the content of a selected column, delete anentire entry, or specify a predefined process to change the value of a columnor copy a changed value to another column. The process illustrated in SILInternational (2014a) maps a Devanagari transcription to an IPA one. Userscan also specify what the system should do when the target column of abulk edit already contains a value: do nothing, overwrite it, or append to it.Using FLEx, users can create any number of interlinearly analyzed texts.Texts are created by first inserting a transcription into the so-called base-line view. The system then provides the analyze interface for specifying themorphological analysis.47 Via this interface, users can specify a segmenta-tion into allophones48 (the morphemes row), a segmentation into lexemecitation forms, lexeme glosses, lexeme grammatical categories, word glosses,46In this discussion I am using the terminology of columns and rows in accordance withthe tabular representation of entries provided by FLEx in the bulk edit interface. That is,each row represents an entry and each labelled column represents an attribute (or field)of all entries.47Technically there is another gloss interface for specifying morphological analyses. Thefields of the more complex analyze interface are a superset of those of the gloss interface.48Note that this is different from the OLD where there is no dedicated attribute forallophones. This reflects a fundamental difference in how morphological analyses are rep-resented and in how parsers are configured by the two systems. See the discussion below.572.2. Other fieldwork toolsword categories, and a free translation.An illustration of how a particular analysis of the French sentence Leschiens courraient might be represented within a FLEx text is provided in(1). Note that the system requires that the segmentation into morphemesin the morphemes field be constructible by inserting one or more morphemedelimiters into the word value; that is, one could not segment courraientinto, say, courrir-aient, i.e., the verb meaning ‘run’ in its infinitival form fol-lowed by a tense-aspect/person agreement suffix. Also note that the valuesfor lexeme entry, lexeme gloss, and lexeme grammatical info49 for a partic-ular morpheme must correspond to values of a lexical entry that is alreadylisted in the lexicon; if the lexicon contains no matching entry, the interfaceallows for the specification of one without leaving the analyze interface. Notealso that if the morphemes value of a given column matches an allomorphform of an existing lexical entry, FLEx will recognize this and populatethe lex. entries, gloss, and gramm. info. fields with that entry’s values, asappropriate.(1) words:morphemes:lex. entries:lex. gloss:lex. gramm. info.:word gloss:word cat.:Leslelethedthed-s-splx:{d,n}chienschienchiendogndogsn-s-splx:{d,n}courraient.courrcourrirrunvwere translation: ‘The dogs were running.’49The lexeme grammatical info field of a FLEx entry can specify the category of thelexeme or the categories of the morphemes to which it can affix and the category of theresulting morphologically complex unit. Thus x:{d,n} in (1) indicates that the plural mor-pheme can suffix to nouns or determiners ({d,n}) and that the category of the resultingcomplex is that of the free morpheme (x ). The way that I have represented this informa-tion, however, may not accurately reflect the syntax accepted by FLEx.582.2. Other fieldwork toolsContrast (1) with an analogous analysis of the same sentence as repre-sented by the OLD in (2).(2) words:morpheme break:morpheme ‘The dogs were running.’In the OLD representation in (2), there is no distinction between mor-phemes and lexemes. The system assumes that the morpheme shapes in themorpheme break value are phonemic transcriptions that can be mapped,via some phonological (or spelling) rules, to phonetic (or orthographic)transcriptions. Thus a researcher may segment courraient into courr-aient,courrir-aient, or anything else, as desired. If the user specifies a phonologythat generates courraient from courrir-aient, then all the better; but this isnot required. Neither is it required that the morpheme shapes and/or glossesused in the analysis correspond to lexical forms already present in the sys-tem. An OLD application provides visual feedback on the lexical consistencyof a specified analysis (cf. § 2.3.5), but it will not enforce such consistency.In the OLD, the categories50 value is automatically generated by thesystem when the form is entered. If the system can find a form with courras its morpheme break/shape value and run as its morpheme gloss value,then the category of that matching form (in this case v) will be inserted50Technically, this is the syntactic category string attribute of form objects. That at-tribute name is actually a misnomer since the categories are not necessarily syntactic. Infuture versions of the OLD, this attribute will probably be renamed simply to categories,the syntactic category relational attribute to category, and the syntactic category objectto category.592.2. Other fieldwork toolsinto the auto-generated categories value of the larger form automatically. Ifno match is found, the category will be specified as ???. Users cannot, atpresent, directly specify a value for the categories attribute of a form.51Note also that, unlike FLEx (cf. 1), the OLD provides no fields for speci-fying word gloss and word category values. Of course, this information mightbe implicit in the system, insofar as there exist form objects correspondingto the words of the multi-word form being entered. That is, if one enters (2),the system could, in principle, be modified so that it would check for word-level forms that match and use this information to generate values for thesefields. In this example, the system would look for a form with a morphemebreak value of courr-aient and a morpheme gloss value of anduse this information to auto-generate word category and word gloss valuesusing the category name and translation values of the matching form, e.g.,v and were running. If this feature is implemented, these fields may also bemade directly user-specifiable, as was discussed for the categories attributeabove.As discussed elsewhere in this dissertation with respect to morphemes,future updates to the OLD will include functionality such that whenever auser enters a multi-morphemic or multi-word form, the system will providean interface that offers to create form objects for all of the morphemes andwords that are implicit in the larger form being entered and which are notalready present in the database. This will significantly increase the rateat which OLD data sets grow and will help researchers to generate lexica51It may be desirable to allow users the option of explicitly specifying the categoriesvalue and have the system suggest a value using the method just described. This may beimplemented in future versions of the OLD.602.2. Other fieldwork toolsas a byproduct of eliciting sentential forms. This approach is also, in myjudgment, superior to the FLEx approach which requires that lexical entriesbe present before they can be used in analyses of texts since it encouragesconsistency but allows the fieldworker to forego it for the sake of enteringelicitation data quickly.Before concluding this discussion of FLEx’s text interlinearization func-tionality, it should be noted that the word gloss and word category fields arenot limited to words as defined by whitespace. That is, words can be groupedtogether and glosses and categories can be provided for these groups. Thisis illustrated in (SIL International, 2014a) by grouping of course into anidiomatic phrasal entry glossed as ‘obviously’ and categorized as an adverb.This is an interesting feature. At present, I am unsure of whether this isdesirable or how it might be implemented within an OLD application.Both the interface and the data structure of a FLEx project are modi-fiable. When viewing entries in tabular representation, users can configurethe ordering of the columns used to display the entries and they can choosewhich columns are visible.52 Idiosyncratic fields can be added to entry ob-jects as suits the user’s needs. Such a dynamic data structure is not availablewithin OLD applications, although implementation thereof is a planned fea-ture.53 One particular type of data structure modification of FLEx that is52Note that I am here talking about the tabular view of multiple lexical entries andnot the table-like IGT representation of sentences in a FLEx text.53The OLD data structure can be modified insofar as any number of named tag ob-jects may be defined within an OLD application and any form, file, or collection may beassociated to zero or more tags. Since tags may follow hierarchical naming conventions(e.g., noun:num:pl, noun:num:sg, noun:gen:fem, etc.) they allow for categorization andsubcategorization of objects. Future modifications to make the data structure dynamicwill allow for users to define string-valued attributes on core objects, such as forms, usingthe entity-attribute-value approach.612.2. Other fieldwork toolsnotable is the ability to subtype textual fields according to writing systems.Thus, for example, the gloss field of a sense of a lexical entry may be subdi-vided into English and Spanish subfields, thereby allowing a documentor toprovide the same gloss for an entry in two distinct metalanguages.54 Spellcheckers can be integrated into the system and values that are misspelledaccording to their field’s writing system can be discovered and corrected.FLEx allows multiple contributors to collaborate on a single projectacross the Internet or within a local network. Contributors have a localcopy of the data that they interact with via their FLEx desktop applica-tion and they can sync their local data with a master repository. Details ofhow this is implemented and how conflicts are handled when merging datasets are provided in SIL International (2013b). Butler and van Volkinburg(2007) attests to issues with the program crashing during multi-user con-current access. However, that review is considerably dated and presumablymany of the bugs with this feature have been worked out. Assuming thatFLEx’s collaborative functionality works as described, it must be admittedthat the application allows for online/offline access to a communally createdrepository of fieldwork data.Both FLEx and the OLD allow users to configure morphological parsersto facilitate the automatic morphemic analysis of complex words. However,the two applications differ in their conceptual modelling, computational im-54While the OLD allows a form to have any number of translations, each form can onlyhave one morpheme gloss value. Therefore, a lexical entry in an OLD application cannotbe simultaneously glossed in two distinct metalanguages. This is perhaps not a majorfailing of the OLD, but it does mean that a morphologically complex form cannot beanalyzed using glosses in an alternate language without creating duplicate lexical entries,one for each metalanguage.622.2. Other fieldwork toolsplementation, and user interface for configuring morphological parsers. OLDparsers consist of three components: a phonology that maps phonetic ororthographic representations to phonemic ones, a morphology that mapsphonemic representations to licit sequences of morphemes, and a disam-biguator which ranks morphemic analyses according to probability. OLDphonologies and morphologies are implemented as finite-state transducersand the disambiguators are built upon N -gram language models (cf. Juraf-sky and Martin, 2008). OLD users can create any number of phonologies,morphologies, and disambiguators and combine them to create any num-ber of parsers. Users specify phonologies as ordered context-sensitive (CS)rewrite rules using a notation and conceptual model with a long traditionof use in phonological research (cf. Chomsky and Halle, 1968). OLD mor-phologies are simply sets of morpheme category sequences that correspondto the category sequences of licit words; these can be specified manually orextracted automatically from a corpus created and/or specified by the user.Disambiguators are built upon language models extracted from a corpusspecified by the user. OLD morphologies and disambiguators can be cre-ated without much dedicated effort on the part of the user, e.g., simply bycreating a corpus of forms that, in their estimation, contain well analyzedword forms. Creating a phonology, however, requires a dedicated effort inorder to formulate and order the requisite rewrite rules. The ability to createmultiple distinct parsers within a single OLD application is useful in that itallows different users to experiment with different parsers that accord withtheir own analytical approaches; it also allows for the creation of parsersthat take orthographic transcriptions as input as well as those that take632.2. Other fieldwork toolsphonetic transcriptions as input. In addition, OLD parsers can be exportedas stand-alone command-line utilities that users can use locally and incor-porate into their own projects, if desired. Also, since OLD 1.0 applicationsare web services, parse functionality may be requested by a variety of ap-plications. The OLD parser creation functionality is described in detail inchapter 3.The FLEx approach to morphological parsers is described in Black andSimons (2008) and the details of the computational implementation can befound in that work and its technical references. In contrast to the OLD ap-proach where users create independent morphology and phonology compo-nents, the FLEx approach to parser creation is more tightly integrated intothe workflow of creating lexical entries. FLEx users specify allomorphs forlexical entries, category inputs and outputs for derivational morpheme en-tries, and inflectional templates for the categories used to categorize entries.FLEx assembles all of this information in order to generate the morpholog-ical parser for an application. In addition to the functional parser tool, themorphological information specified by FLEx users can be used to automat-ically generate a human-readable sketch of the morphological component ofthe grammar.Clearly there are benefits and drawbacks to each of these approaches.The OLD approach emphasizes the phonological mapping and permits onlya relatively simplistic modelling of the morphology, i.e., with no explicituse of the concepts of derivation, inflection, or allomorphy. In contrast, theFLEx approach encodes phonological mappings within allomorph specifi-cations and allows for a more nuanced modelling of the morphology that642.2. Other fieldwork toolsallows for the automatic generation of a morphological grammar sketch. Inresponse to this, it should be pointed out that an OLD phonology consistsof linguist-readable rewrite rules and may also contain comments that elab-orate on the rules as well as tests representing mappings that the phonologyshould account for; OLD applications could, in the future, be made to trans-form this information (as well as the morphology information) into a morehuman-readable format that could constitute a grammar sketch. The abilityto create multiple parsers (and parser components) and use them locally orvia requests to an OLD web service is, in my estimation, a point in favour ofthe OLD parser implementation. As to performance comparisons, I cannotprovide any; however, as Black and Simons (2008) point out, morphologicalparsers within the context of a fieldwork/documentation database applica-tion do not have high performance requirements since the objective is toparse words during user input and not large sets of existing words.Butler and van Volkinburg (2007) reviews a version of FLEx that comeswith the FieldWorks suite version This review asserts that thesystem’s networking functionality—which allows for simultaneous multi-contributor access—is “fairly simple” to configure and use. It also praisesFLEx’s bulk editing capability and its features for creating dictionary repre-sentations. However, the reviewers have a lengthy list of complaints includingcrashes during concurrent access, inability to create reverse (i.e., gloss-to-55Butler and van Volkinburg (2007) lists the version reviewed as 4.0.1 yet Rogers (2010)claim to be reviewing version 3.0 and cite Butler and van Volkinburg (2007) as a review ofan earlier version of the software. I do not know which source is inaccurate in this regard,but I assume that “4.0.1” refers to the version of the larger FieldWorks suite and notFLEx itself.652.2. Other fieldwork toolsvernacular56) dictionaries, poor performance when editing interlinear texts,inability to search lexical entries according to the translation and notesfields, a very long wait for parser loading,57 and difficulties in download-ing the source code.58 The authors also make the excellent observation thatfunctionality for creating a gloss-to-vernacular dictionary creation would bevery useful, especially to community-based revitalization projects.Rogers (2010) is a review of FLEx version 3.0. The author voices hisappreciation for the following feature additions and improvements in re-sponse to the criticisms of Butler and van Volkinburg (2007) (some of whichare mentioned above): functionality for creating dictionaries from the glosslanguage to the vernacular language, ability to specify that distinct lexicalitems are variants of one another, regular expression search across a varietyof fields, and functionality allowing for syntactic labelling of the compo-nents of complex forms, i.e., the syntactic tagger. The reviewer complainsof “sometimes . . . unbearably slow” interface response time (Rogers, 2010,p. 80), inability to access multiple views simultaneously, issues with im-porting IGT texts,59 inadequate export formats (viz. no plain text exportformat), lack of features for annotating digital recordings, no capability forcreating multilingual dictionaries, and no Mac version.Clearly SIL International is a large organization with impressive re-sources and a large user base for developing effective fieldwork software56I would call this a dictionary from the metalanguage to the object language.57It is unclear to me whether this long wait time is for compiling the parser, i.e.,modifying it, or if it is just for loading it for use. The former would make more sense andwould be more understandable. The latter would be a more serious issue.58“Our developer gave up downloading the source code after it had run for 2 days!”(Butler and van Volkinburg, 2007)59The author was unable to import analyzed texts from Toolbox or any other program.662.2. Other fieldwork tools(see § 2.2.1). FLEx improves upon Toolbox in a number of ways that paral-lel the features of the OLD as touted here—i.e., multi-user, network-basedcollaboration; multi-field, regular expression-included search functionality;and open source licensing—as well as in a number of ways not matched bythe OLD—i.e., IGT text import, bulk editing capability, offline capability,and features facilitating dictionary creation, including the ability to createreverse dictionaries.A primary advantage of the OLD over FLEx is the more general, form-focused data structure of the former as contrasted with the more specific,lexical entry-focused data structure of the latter. FLEx targets descriptivelinguists whose primary goals are the creation of a dictionary, supportingtexts, a descriptive grammar, and, ultimately (for many), a Bible translationinto the vernacular. In FLEx, lexical entries and sentences are completelydistinct entities. However, in the OLD, morphemes, words, and sentencesare all represented by the same type of object: the form. These types ofform can be distinguished when necessary (by their category, tags, or bypatterns in their morphological analyses) but they can also be treated asthe same for the purposes of searching, embedding into documents, andbuilding corpora. The OLD, with its support for grammaticality judgmentsand elicitation method categorizations, is currently tailored more towardstheoretical linguists. However, support for dictionary creation can be seen inthe (admittedly incipient) dictionary interface and the fact that the databasecan be used, as is, to amass and curate lexical items. Clearly there areopportunities to improve the OLD in this sphere, e.g., by updating thedictionary interface, allowing users to define custom lexicographic orderings,672.2. Other fieldwork toolsand providing better support for form-to-form cross-referencing, e.g., forreferencing example sentences, synonyms, related forms, etc. However, whilesuch modification to the OLD would be additive, modifying FLEx to bemore OLD-like—e.g., so that morphemes, words, and sentences could allbe queried simultaneously or referenced in research papers—would seem torequire more foundational changes.Arguably another advantage of the OLD is the fact that it is a web ap-plication and not, like FLEx, a desktop application. While experience withWeb 1.0 web sites and poorly constructed Web 2.0 “applications” may leadsome readers to view this as a disadvantage, there are a number of rea-sons to think otherwise. In brief, these are a) a long history of support formulti-user concurrency, b) the benefits of the service-oriented architecture,c) platform agnosticism, d) the existence of high quality and rapidly evolv-ing frameworks and libraries that allow for the creation of browser-basedapplications that are constructed according to road-tested design patterns,are thoroughly testable, and have graphical user interface features compa-rable to those of desktop applications. The first point means that buildingan application which allows multiple fieldworkers to collaborate on creatinga single repository of language data is arguably easier using web technolo-gies which have, since their inception, been required to adapt to multi-agentalteration of centralized resources. The second point alludes to the possibil-ities for the creation of software that creates new value by building uponpre-existing and independent web services, an example of which might bean application that aids in language learning by drawing on data providedby a number of web services that expose fieldwork-generated resources. The682.2. Other fieldwork toolsthird point refers to the fact that by housing application logic in web serversand client-side browsers, one can side-step the fractured platform/operatingsystem environment that has long plagued software developers seeking toreach the broadest possible user base. The last point counters the conven-tional wisdom that web-based applications cannot match desktop-based onesin terms of being reliable, maintainable, testable, and usable. In addition,increasing interest in web application development means that developers’skills are, as a whole, moving in the direction of greater familiarity with webtechnologies and, therefore, the chances are greater that developers may befound when needed. A final advantage of the web-based approach resultsfrom the fact that web hosting services standardly have redundancies andbackups that decrease the chances that valuable fieldwork data may be lost.Though I have not used FLEx in my own linguistic fieldwork, my researchindicates that it is an excellent tool in a number of respects, as describedabove. Indeed, future development of the OLD will involve both borrow-ing certain features and ideas from FLEx—e.g., aspects of its approach tobulk editing, its user-modifiable data structure, and its inclusion of wordcategory and word gloss lines in interlinear analyses—while also improvingimport/export capabilities so as to enhance interoperability and complemen-tarity with FLEx. However, the OLD does, in its present state, respond toa real and present need for a fieldwork tool that is simultaneously generalwhile catering to certain requirements of researcher linguists, focuses on web-based collaboration, and makes use of technologies and design patterns thatfacilitate the cooperative evolution of tools that advance fieldwork-relatedgoals.692.2. Other fieldwork tools2.2.4 LingSyncLingSync60 is “a free tool for creating and maintaining a shared databasefor communities, linguists and language learners” (LingSync, 2014a). It re-sponds to the same needs as the OLD (cf. LingSync, 2013) and, as a result, issimilar in many respects. Both tools seek to foster collaboration, data shar-ing, and data re-purposing among fieldworkers and other individuals andorganizations with an interest in endangered language data. Like the OLD,LingSync is open source,61 web-based software that consists of a number ofindependent web services and client-side user interfaces. Also like the OLD,it is designed not specifically with lexicographic goals in mind (like Shoe-box/Toolbox and FLEx) but with the broader goal of allowing fieldworkersto create general-purpose repositories of linguistic forms (cf. LingSync, 2013,p. 7).While the OLD and LingSync have a lot in common, there are somesalient differences. As is argued below, a foundational yet subtle conceptualdifference consists in opposite rankings of the principle of collaboration rel-ative to data privacy, rankings which help to explain some of the differencesbetween the two systems at the feature level. Another major grouping ofdifferences is largely technical and has to do with architecture, technolo-gies used, and approaches to ensuring data privacy and ethical access. Inbrief, LingSync uses a No SQL (NoSQL) storage solution (as opposed tothe SQL-interfaced, relational database back ends of OLD applications), is60 source code for LingSync can be found at It is released under the Apache License,Version 2.0.702.2. Other fieldwork toolscoded almost entirely in JavaScript (in contrast to the Python/JavaScriptlogic of the OLD), and crucially employs encryption as part of its data accessstrategy.This section begins with these conceptual and technical differences, dis-cussing and evaluating them with reference to the OLD approach. It thenmoves on to a comparison of the two tools in terms of features, coveringthe advantages that each tool has over the other and discussing a few areaswherein both either excel or need work. The advantages of LingSync are, inbrief, its flexible and user-customizable data structure, its use of encryptionto provide improved data security, its functionality for making data publicand discoverable and for automated transmission to institutionally-backedarchives, its deployment approach which allows potential contributors toeasily and immediately begin using the system, its activity feed, its importfunctionality, its offline capability, and its glosser module. The advantagesof the OLD are its prescribed data structure which adheres to de facto stan-dards and which is integrated into the application logic and interfaces, itscolumnar display of IGT data with visual feedback on lexical consistency ofmorphemic analyses, its text (i.e., collection) creation feature, its supportfor structurally searchable treebank corpora, its feature for creating bibli-ographies for source attributions of data, its orthography conversion andinventory-based input validation conveniences, and its functionality for cre-ating morphological parsers and attendant implementation of morphologicaland phonological models. Both tools have partially overlapping yet distinctstrengths in the following domains: software documentation, audio/video in-tegration, data versioning, and search. Finally, both tools need to provide712.2. Other fieldwork toolsbetter support for bulk editing and dictionary creation.Both LingSync and the OLD could stand to benefit from borrowing andemulating certain features and approaches of the other. In fact, in line withthe collaborative nature of the tools themselves, and given the extent towhich we share common goals and approaches, I am currently engaged in adialog with the group behind LingSync62 concerning collaborative develop-ment efforts,63 in particular the creation of web services and GUIs that caninterface with components of both systems, including morphological parsers,automatic annotation-audio aligners, and language learning applications.Note that, like the OLD, LingSync is under active development and isadapting in response to the requirements of its users and ever-changing webtechnologies and web-based resources. Indeed, both tools are presently un-dergoing transitions such that the features discussed here may be spreadacross versions or components. The OLD is moving from a Web 1.0 applica-tion to a more modern Web 2.064 collection of tools consisting of a core webservice and a single-page JavaScript application. LingSync is in the process62LingSync development has been, and continues to be, a collaborative effort between“students, professors, and software developers in the Montre´al area, including: Alan Bale(Concordia, McGill), Gina Cook (iLanguage Lab, Concordia), Jessica Coon (McGill), EliseMcClay (McGill), Gretchen McCulloch (McGill), Hisako Noguchi (Concordia), Tobin Skin-ner (iLanguage Lab, McGill)” (LingSync, 2014a), and others. The software currently hasabout 300 registered users and has been (or is being) used within a number of linguisticfield methods courses offered by institutions including McGill University (Inuktitut), theUniversity of Ottawa (Teenek), the University of Connecticut (Nepali), Yale University(Quechua), and Pomona College (Igikuria) (Gina Cook, p.c., LingSync (2014b)).63I have already made some small contributions to one of the LingSync client-sideapplications and is currently assisting with the development of the LingSync SpreadsheetGUI.64Note that “Web 2.0” refers not to the version number of the OLD but is web jargonthat refers to a qualitative shift in the nature of web sites and applications over the pastdecade or so. According to Wikipedia, “Web 2.0 describes World Wide Web sites that usetechnology beyond the static pages of earlier Web sites”, cf. Other fieldwork toolsof re-implementing the features of its original online/offline Chrome app(hereafter dubbed “LingSync Prototype”65) into “LingSync Spreadsheet,” asingle-page JavaScript application written using the AngularJS frameworkthat currently works online only. In addition, improvements to the core Ling-Sync web service modules are ongoing. Ranking privacy and collaborationBoth LingSync and the OLD seek to facilitate collaborative linguistic field-work while allowing contributors to keep their data private, as needed.Though both pieces of software share these primary goals and implementfeatures towards their attainment, it is fair to say, in my judgment, thatLingSync places relatively more emphasis on privacy (as opposed to collab-oration) while the OLD places relatively more on collaboration (as opposedto privacy). This difference helps to contextualize and make understand-able certain aspects of the feature sets of the two systems, as elaboratedbelow. LingSync assumes that users will, individually or in highly coordi-nated groups, build private corpora and thereafter (if desired) grant accessto other contributors or viewers, assuming that the new contributors will ad-here to the conventions of the host corpus. The OLD, in contrast, assumesfrom the outset a state of affairs where multiple users contribute to a sin-gle, heterogeneously analyzed data set and rely on basic authentication andauthorization plus the honour system to ensure ethical access and curation65A Chrome App is a piece of software written using web technologies (i.e., HTML5,CSS, and JavaScript) but which runs in the Chrome browser and, as a result, has additionalcapabilities, such as being able to run without an Internet connection and having accessto a local file system, cf. Other fieldwork toolsof data.In addition to the obviously related care that LingSync takes toward en-suring data privacy via encryption, this basic difference in priority rankinghelps to explain the differences in the ways that the two tools approach thelexicon and the conveniences surrounding it. The OLD anticipates the possi-bility that different users will (initially, at least) enter distinct and incompat-ible morphological analyses. The system therefore encourages the creation oflexical entries and supplies an interface that provides IGT-embedded feed-back on the consistency of the morphological analyses with the extant lexicalitems. LingSync, in contrast, provides a lexicon module that automaticallyextracts the lexical items and morphotactic patterns implicit in users’ anal-yses, assumes that these are relatively consistent, and uses these to providethe auto-glossing feature.66Another, admittedly relatively minor, area where this basic privacy-vs.-collaboration difference has an effect is in the rationalization of the duplicatedatum/form feature implemented by both applications. In the LingSync doc-umentation,67 this is discussed as a feature for easily creating minimal pairswhereas in the OLD it is discussed as a feature that can allow a contributorto easily create an analytically re-analyzed version of another user’s formwithout the complications inherent in modifying the original.While this is a subtle distinction, understanding how the two pieces of66LingSync Prototype also provides a graphical visualization of the lexicon of a corpusas a network of nodes. Since nodes with few connections can be indicative of lexical outliers,one could argue that this feature is the functional equivalent of the OLD’s morpho-lexicalconsistency feedback.67 Other fieldwork toolssoftware prioritize privacy and collaboration relative to one another canhelp to understand the differences in design decisions and features (or lackthereof) discussed below. Technical differencesThis section discusses some technical differences between LingSync andthe OLD. While it is relevant to LingSync’s flexible data structure andencryption-based data protection feature, some readers may wish to skipahead to the less acronym-filled sub-sections that follow.LingSync data are stored as JSON objects within NoSQL databases:Apache CouchDB on the server and PouchDB68 on the clients (cf. LingSync,2013, 2014b). Since the data within these databases are stored as JSON-serialized JavaScript objects, LingSync avoids the performance costs inher-ent in converting relationally modelled OLD entities to Python instances andthen to JSON objects. Moreover, since CouchDB and PouchDB are schema-less, users can add and remove attributes to the objects that encode theirdata points as they see fit; that is, these NoSQL storage technologies aredesigned with structural flexibility as a core feature. Perhaps the strongestargument for the use of CouchDB/PouchDB arises from the fact that thesetools were expressly designed to facilitate synchronization between a centralserver-side database and client-side replicas; this means that offline accessto data can be implemented atop the technology stack of LingSync moreeasily than atop that of the OLD. In fact, there are currently no tools thatfacilitate SQL-based access to relationally structured client-side data across68See and Other fieldwork toolsall browsers (cf. Lawson and Sharp, 2011). A final advantage of LingSync’schoice of database is that the data are stored in a human-readable format(i.e., JSON), as opposed to the binary data files of the RDBMSs69 thatmanage OLD application data. This is desirable from an archival point ofview. Of course, this criticism can be addressed by a) the proposed auto-matic publishing of OLD data sets to established archives and b) ensuringthat XML database dumps are part of an OLD application’s regular backupprocedure.Note that browser support for size-unconstrained70 persistent storage iscurrently fractured and appears to be at a standstill. The older Web SQLDatabase standard is supported by current versions of Chrome, Safari, andOpera but the World Wide Web Consortium (W3C) has ceased to maintainthe specification71 and it is therefore likely that browsers will stop support-ing it in the near future. The Indexed Database API (a.k.a. IndexedDB72)standard is supported by current versions of Internet Explorer, Firefox andChrome. However, there is no indication that it will be adopted by Safarior Opera any time soon. Tools like PouchDB abstract away from the un-derlying storage mechanism (in this case using Web SQL for Safari and69Note that the OLD has been tested with both MySQL and SQLite, though the lat-ter RDBMS is not built for high levels of concurrency and thus should not be used inproduction. The database ORM abstraction layer, i.e., SQLAlchemy, allows for a rangeof other RDBMSs, including PostgreSQL and Firebird, both of which are open source.If needed, these could be used in OLD applications; though some minimal modifica-tion/parameterization would be required.70There is widespread browser support for the Web Storage (i.e., key/value pairs) stan-dard and its cross-session persistence feature dubbed localStorage. However, the excessivesize limitations of localStorage make it unsuitable for effective client-side persistence of adatabase of linguistic fieldwork.71 Other fieldwork toolsOpera, and IndexedDB elsewhere) in order to provide a uniform interface.However, there exist no comparable SQL-based abstractions.73 Therefore,cross-browser storage of relational fieldwork data (e.g., OLD data) wouldrequire both re-designing the query logic (esp. search, cf. § 2.3.6) for theclient and re-structuring the data as non-relational objects for client-sidestorage,74 neither of which are very attractive propositions.LingSync’s application logic, both on the server and on the client, iswritten entirely in JavaScript.75 This approach has the benefit of freeingdevelopers from switching between different programming language syntaxesand idioms and, in certain domains such as input validation, code can bereused on both server and client without the wasteful duplication that issometimes necessitated by the Python/JavaScript server/client technologystack of the OLD. While JavaScript has been deservedly maligned—witnessthe implicature in the title JavaScript: the good parts (Crockford, 2008),one of the most widely referenced texts on the language—the explosion73Note that NoSQL databases do not expressly preclude relationally structured data,i.e., tables/objects referencing other tables/objects in order to represent one-to-many andmany-to-many relationships, etc. However, they do forego implementation of the declara-tive SQL-based interface that is essentially a necessity for querying relational data.74Until, of course, some enterprising individual writes a JavaScript-based SQL en-gine atop the IndexedDB standard. However, this seems unlikely given that NoSQLsolutions are very much in vogue at the present moment. In brief, the primary ar-guments for NoSQL databases point to their flexibility (i.e., schema-less-ness) andtheir ease of (horizontal) scalability, i.e., their ability to effectively handle extremelylarge data sets by distributing tasks across low-cost hardware. The advantages of re-lational databases include their maturity and familiarity, the conceptual elegance andlack of redundancy of normalized relational data, and the declarative language (i.e.,SQL) for specifying complex queries, functionality leveraged to good effect in OLDsearch (cf. § 2.3.6). The article at provides a well-balanced, and ul-timately pro-NoSQL, overview of the debate.75The one exception to this is the LingSync Android application which is written inJava (cf. LingSync, 2013).772.2. Other fieldwork toolsin browser-based application development over the past decade or so hasresulted in a plethora of tools for creating (e.g., structuring, testing, etc.)sophisticated applications in the language.76 The obvious response from thepro-Python developer would be to point to that language’s relative maturity(and consequently larger catalogue of special-purpose libraries, includingNLTK) and its arguably more readable (and hence maintainable) syntax.However, these considerations are, to a certain extent, subjective and, moreimportantly, largely irrelevant in the context of an ecosystem of fieldwork-facilitating web services that communicate via standard methods (HTTPand JSON) regardless of their underlying implementation. Rigid versus flexible data structuresBoth the OLD’s prescriptive and system-integrated data structure and Ling-Sync’s flexible and user-customizable one have their benefits and drawbacks.These are reviewed here.LingSync emphasizes its flexible model, or data structure, as a designfeature. This dissertation, in contrast, argues for the utility of the OLD wayof structuring linguistic fieldwork (cf. § 2.3.4), while recognizing that the sys-tem should be modified to allow users more freedom in modifying the datastructure to their needs.77 Clearly, fieldworkers will be more likely to use atool that can be tailored to their pre-existing methods or ideas about howlinguistic data should be structured. Since these tools seek to attract contri-76Note that the reason that JavaScript has received such attention from developers isthat it is, for all practical purposes, the only programming language that can be run in aweb browser.77As discussed elsewhere, future versions of the OLD will allow for a more user-configurable data structure.782.2. Other fieldwork toolsbutions from a broad spectrum of fieldworkers, such flexibility is undeniablydesirable. On the other hand, there is a long tradition of scholarship in lan-guage documentation, description, and analysis from which a number of defacto standards have emerged. The various types of interlinearly glossed textformats are one set of examples, as are the lexical structures built into toolslike FLEx and Toolbox, and the time-aligned annotation data structure ofELAN and similar tools. In order to make use of these conventions—e.g.,to provide feedback on morpho-lexical consistency, to automate dictionarycreation, to implement sophisticated input validation, to automate parsingand glossing, and to generate re-usable web-accessible data stores—the soft-ware needs to be aware of their forms and meanings. In short, creating aflexible and customizable model is one thing, but making features flexibleto match is quite another. For practical reasons, therefore, some structuraland semantic assumptions about the data will inevitably be embedded inthe software.In fact, LingSync does impose and assume particular structures for lin-guistic data. At the highest level there is the tree structure of corporacontaining sessions containing datums.78 This is, in fact, somewhat restric-tive since not all data aggregated by fieldworkers comes from elicitationsessions—some may come from published texts such as research papers,grammars, or dictionaries. Contrast this with the OLD where corpora andelicitation sessions (a subtype of OLD collections) are both modelled as78As an aside, observe that since any registered LingSync user can create their owncorpus and specify privacy settings, each user is effectively the administrator of their owncorpus and, therefore, the system is arguably more democratic than the OLD where accesssettings for an entire language-specific database are controlled by administrators.792.2. Other fieldwork toolsordered lists of references (with repeats possible) to forms in the masterrepository.79Another in-built LingSync data structure is present at the level of in-dividual datums, where the four fields utterance, morphemes, glosses, andtranslation are present by default and where their usage is necessary in orderto provide automatic glossing functionality.Also note that the flexibility of the LingSync data structure is limitedby its two extant GUI applications, i.e., LingSync Prototype and LingSyncSpreadsheet; in particular, hierarchical structures are not possible in eitherapplication. LingSync Spreadsheet offers a full template and a compact tem-plate, the first allowing six user-specifiable fields and the latter four. Hereusers may choose from a predefined set of input fields, i.e., column labels orobject attribute names. LingSync Prototype (i.e., the Chrome app) offersmore flexibility: users may add any number of fields to the interface, labelthem as they please, specify whether the data they contain are to be madeconfidential via encryption, and even provide metadata describing what thefield should contain. However, neither interface allows for the creation ofa hierarchical data structure, despite the fact that the underlying models,i.e., databases would permit it. That is, a LingSync user could not createa source attribute on datums comparable to OLD sources, i.e., one whosevalue is itself an object, i.e., a set of labelled fields, with attributes like au-thor and year. Related to this, note that the value of the tags attribute ofLingSync applications is a simple string, whereas in OLD applications a tag79Actually, it is only the Spreadsheet interface of LingSync that enforces the divisionof data points into disjoint session objects. The Prototype interface enforces no suchrequirement (Gina Cook, p.c.).802.2. Other fieldwork toolsis an object in its own right, i.e., an entity with attributes such as name anddescription. This allows for more complex categorization of data and greatersearchability. For example, OLD users can accurately search for forms asso-ciated to tags that contain a certain string in their description value, or forforms drawn from a source that was published between 2005 and 2010, etc.To do the equivalent in LingSync would require setting up conventions forencoding structured information as strings within field values and perform-ing regular expressions to get the equivalent results, e.g., creating a sourcefield in LingSync Prototype, populating it with values such as Chomsky 1957and Sapir 1915,80 and finally performing the search by formulating a regularexpression pattern like / 20(05|06|07|08|09|10)$/.81Note also that the fact that OLD tags, sources, categories, etc. are bonafide objects encourages consistency. That is, a user cannot tag something as,say, generic aspect without first creating a tag with that name. While somemay view this as an unnecessary impediment to rapid tagging, the delayis, in my view, justified by the consistency it encourages. That is, a user isless likely to create a new, semantically redundant but formally distinct, taglike aspect: generic if they have to go through the process of intentionallycreating that tag, a process which, as a matter of convention, should include80These values are oversimplified for exposition. To get the full equivalent of the OLDfunctionality would involve requiring users to compose JSON or some more simplifiedattribute-value serialization.81Note that both FLEx and Toolbox offer greater flexibility than LingSync. In fact,Toolbox is arguably the most flexible since its users can create hierarchical data structuresvia the interface and can configure validation for the fields. FLEx allows users to specifycustom fields/attributes for the hierarchically organized locations/objects that are builtinto the system (e.g., entries, senses, etc.); the OLD’s move towards greater flexibility willprobably follow the FLEx approach.812.2. Other fieldwork toolschecking whether a relevant tag for their purposes exists already.82Of course, such considerations will not be novel to the clever develop-ers behind LingSync. However, it should temper the temptation to appealto an over-simplified rhetoric of “good” flexibility versus “bad” structure.The motivation behind the LingSync approach is clearly to allow users toconfigure their data structures to suit their needs and then later modify theapplication to make use of those structures once trends begin to emerge.In fact, this is already implemented to some extent in LingSync Prototypeinsofar as user-entered fields become visible by default in the interface oncethey pass a certain threshold of use. The point remains, however, that thereis clearly a spectrum between unusably unstructured and smotheringly pre-scriptive with several positions along that spectrum, including the OLD’s,being defensible and having their own set of advantages, as described above. LingSync advantagesHaving discussed LingSync’s flexible and user-customizable data structure,this section reviews what are, in my opinion, the remaining primary ad-vantages of that system. These are encryption-based data access control;functionality for making data public, discoverable, and for automaticallytransmitting them to archives; a low-barrier-to-entry deployment approach;the activity feed; import functionality; offline capability; and the automatic82This is really an argument for the relational model as much as it is for hierarchicalstructuring of data. That is, LingSync JSON objects with source attributes whose valuesare objects will contain a lot of redundancy when multiple datums reference the “same”source object. In order for LingSync to encourage consistency in the way being described,the application logic would need to aggregate past used source or tag objects and presentthem to users as suggestions.822.2. Other fieldwork toolsglosser.Encryption-based privacy. LingSync takes special measures to ensurethat access to data can be effectively controlled. Users create corpora, whichcontain sessions, which contain datums, i.e., words, phrases, sentences, etc.The creator of a corpus specifies which (if any) other users are to be grantedadministrator, writer, or reader privileges to that corpus and encryption isused to control access. That is, data not specified as public are encrypted andcan be decrypted only by authorized users. This ensures that even if confi-dential data are somehow leaked, they will not be decipherable by unautho-rized individuals. This is an added security measure with clear advantages,and one that is not implemented by the OLD.Publicization, discoverability, and archiving. One particularly valu-able set of features implemented by LingSync allows users to tag subsetsof their data as public, to make these data discoverable (e.g., via searchengines), and to automate the communication of these data to languagearchives, e.g., the Open Language Archives Community (OLAC)83 cf. Ling-Sync (2013), thus further enhancing discoverability while also working to-wards long-term preservation of data. This is functionality that the OLDplans to implement but has not yet. Making data publicizable, discover-able, and archivable is, in fact, crucial to the overarching goal of makingendangered and understudied language resources easily available to a wideraudience (i.e., wider than the set of contributors to a corpus/database) so83http://www.language-archives.org832.2. Other fieldwork toolsthat research and revitalization efforts may profit.Easy to start. An appealing feature of LingSync’s deployment approachis its publicly available LingSync server which allows users to get startedquickly creating usable, shareable corpora. By way of contrast, in order touse the OLD to document a language for which there is currently no OLDapplication set up, one must download the software, install it on a server,purchase a domain, and have it resolve to the server’s IP address. Sincethis requires some level of technical proficiency and since it involves a delayin actually beginning to use the software for fieldwork, it is a considerablebarrier to adoption. While LingSync’s public server is not (at least accordingto my understanding) meant to be a permanent solution for large-scale datacollection projects (i.e., such projects will need to set up hosting for theirown LingSync web services) it is nevertheless desirable as a means of allowingcontributors to immediately begin experimenting with the tool.Activity feed. LingSync Prototype’s activity feed—a widget that informsusers of recent updates and other activity on the corpora that they are inter-ested in—is a useful feature that the OLD could benefit from borrowing.84It is helpful for reminding a user of what they were working on previouslyor what other users have been doing in a shared corpus.Import. LingSync (2013, p. 5) proposes the implementation of importfrom a number of formats, including “ELAN XML, CSV, and text file for-84Note that the activity feed is a feature present in the LingSync Prototype Chromeapp (offline/online) but not currently implemented in the LingSync Spreadsheet web ap-plication.842.2. Other fieldwork toolsmats.” As far as I have been able to discern, LingSync Prototype imple-ments import from Comma-Separated Values (CSV) files while the Ling-Sync Spreadsheet web application provides no import functionality as ofyet. Since many software applications (e.g., FileMaker Pro, Microsoft Excel,LibreOffice Spreadsheet, etc.) allow users to export/save structured data inCSV format, this is a useful feature. LingSync Prototype provides an inter-active import interface wherein users can view the data that they are aboutto import as a table and specify how to label the columns; that is, they caninteractively specify how to map their structured data to the structure ofthe LingSync corpus that they are using and add additional fields/columnlabels as needed.85 While LingSync’s import features are clearly still underdevelopment, the OLD, at present, unfortunately implements none whatso-ever.86Offline. Offline functioning of full-featured fieldwork database applica-tions is desirable. In my own experience, fieldwork involves either a speakerworking with researchers in an institution with ubiquitous Internet accessor a fieldworker travelling to a remote location with sporadic Internet accessfor a couple of weeks at most. Under these circumstances, offline capabil-ity is a desirable convenience but not really a necessity. Even in the latterscenario, well-functioning export and import capabilities are sufficient. Thatis, relevant data can be exported and consulted for in-field elicitation plan-85See for a tutorial on how thisworks.86Though, again, since both tools expose an HTTP/JSON API, any competent pro-grammer should have little trouble in writing a script to convert their structured data toJSON and then issue the appropriate POST requests to import their data to the centralserver-side database of either.852.2. Other fieldwork toolsning, data can be recorded in consistently structured formats, and thenuploaded and imported to the system when web access is possible (usuallyin the evening at a motel). However, for many fieldworkers, fieldwork re-ally is “in the field,” i.e., in remote locations where Internet connectivitymay not be possible for several weeks. In such circumstances, inability tomake use of the data and features of one’s web-based fieldwork software(esp. search) may significantly delay progress. Thus LingSync Prototype’soffline functionality87 (as facilitated by its use of CouchDB and PouchDBas mentioned above) is a valuable feature that has no parallel (as yet) in theOLD.Auto-glosser. LingSync implements an automatic glosser, i.e., a featurewhereby the system suggests morpheme breaks and glosses to contributorsduring data entry. These suggestions are based on past analyses of the samewords previously entered into the system. Soon to be implemented are varia-tions on this feature, i.e., where the entering of a morpheme-segmented wordwill result in the auto-suggestion of glosses and transcriptions, or where user-entered morpheme glosses will serve as the trigger. This is a useful featurewhich has the potential to expedite data entry.8887The LingSync Spreadsheet web app, does not, currently, function offline.88It is interesting to compare FLEx’s morphological parsing features here. FLEx sug-gests morphological analyses both a) simply by looking for exact matches of a word in itsdatabase and suggesting past user-supplied analyses of it and b) by using a true, user-configured morphological parser. These two types of suggestion are distinguished in theinterface via colour-coding and the user can choose which, if any, to use (SIL International,2014a).862.2. Other fieldwork tools2.2.4.5 OLD advantagesThe advantage of a prescribed data structure integrated into applicationlogic and interfaces has been argued for above and is also discussed exten-sively elsewhere in this dissertation, notably in § 2.3.4. Additional advan-tages of the OLD relative to LingSync include IGT-embedded feedback onlexical consistency of morphemic analyses, functionality for creating format-ted texts with linguistic examples, structurally searchable treebanks, bibli-ography creation for source attribution, orthography conversion, inventory-based input validation, and functionality for creating morphological parsersand attendant implementation of morphological and phonological models.Morpho-lexical consistency feedback. The graphical user interface ofthe OLD 0.2 displays forms in columnar interlinearly glossed text formatby default and this presentation includes visual feedback on lexical consis-tency of morphological analyses. That is, users can immediately see a) whichglosses correspond to which morpheme shapes (because words are alignedin columns) and b) the extent to which morphological analyses are consis-tent with the lexical entries already in the system. Experience has shownthis latter feature to be much appreciated by groups of fieldworkers look-ing to ensure consistency of analysis.89 The LingSync Spreadsheet interface89One issue with the OLD’s morpho-lexical consistency feedback feature (pointed outto me by Lisa Matthewson, a reviewer and an administrator of an OLD application forTlingit) is that since users can create their own incorrect lexical entries, the system can endup reinforcing multiple internally consistent but mutually inconsistent analysis patterns.To a certain extent, curtailing such behaviour is the responsibility of the administratorsof a given OLD application. Nevertheless, a useful potential improvement to the OLDin response to this would be to have the system interactively alert users during dataentry about possible lexical matches for the morphemes they are entering, perhaps evenincluding usage statistics on said morphemes.872.2. Other fieldwork toolsdoes not at present align words and their analyses into columns (thoughLingSync Prototype does) and neither of the two extant LingSync interfacesimplement a morpho-lexical consistency feedback feature like that of theOLD.That said, LingSync’s auto-glosser feature plays a similar role to theOLD’s morpho-lexical consistency feedback feature in that morphemes usedin previously entered analyses will be suggested to LingSync contributorsduring data entry, thereby promoting consistency. In fact, since the LingSyncauto-glosser presumably ranks its suggestions according to usage counts,it will be better at promoting overall consistency compared to the OLDapproach which can actually promote inconsistency in the case where a roguecontributor continues to enter idiosyncratic analyses (i.e., analyses that arenot consistent with those of the other contributors) and is reinforced in doingso by the visual feedback which indicates consistency with the lexical entriesthat he himself has created. One response to this critique is to point to theOLD’s morphological parser functionality (cf. chapter 3) which is superiorto the LingSync auto-glosser (since it can propose analyses for unattestedwords) and which will, when it is incorporated into the data entry GUI ofthe OLD, similarly promote overall consistency by suggesting a single mostprobable parse for a word. The other response is to assert that in some casesa system which allows for a plurality of analyses is in fact superior since atcertain stages of research it is not clear which analysis should be preferred;a system that is constantly pestering the minority analyst to conform tothe majority view may result in the mistaken abandonment of a superioranalysis.882.2. Other fieldwork toolsFormatted texts with linguistic examples. A feature not availablein LingSync applications, and one which I argue to be highly useful, is theability to create documents consisting of markup90 and representations ofdata. These are the OLD collection objects, which allow users to createcustom-formatted and exportable (to LaTeX) research papers and annotatedrepresentations of narratives, elicitation sessions, etc. out of the data presentin the system. LingSync Prototype does allow for cross-session search resultsto be saved as exportable data lists (cf. LingSync, 2013); however, these donot allow for the interleaving of prose and markup with the data points.OLD collections are useful since they allow users to annotate sequencesof forms and to create relatively sophisticated91 drafts of documents (e.g.,research papers, annotated elicitation records, representations of narratives)containing their data.Treebanks. Similar to the collections-as-texts discussed above are OLDcorpora which are also ordered lists of references to forms. Corpora, how-90Currently, the lightweight markup languages MarkDown and reStructuredText aresupported. Both of these languages can be used to generate HTML; the latter can alsogenerate LaTeX documents.91Note that OLD collections can be associated to multiple files, i.e., images, videos,audio recordings, and/or text documents. I worked with a group on eliciting a traditionalnarrative with a Blackfoot consultant and we used this functionality as follows. The col-lection representing the narrative was itself associated to a) a long audio recording of thespeaker delivering the narrative in a practised and uninterrupted manner and b) a videofile consisting of still images depicting the content of the narrative with bilingual subtitlesand the just-mentioned audio recording. The form objects that constitute the transcrip-tion and analysis of the narrative were then also each associated to a) smaller audio filesof the speaker uttering the relevant form and b) image files that were drawn based onthe story and used as a cue for the speaker during the elicitation of the uninterruptedrecording. All of these media files are embedded within the HTML representation and canbe viewed/played while reading the textual content of the collection. A further innovationon collections in the OLD 1.0, allows them to reference other collections, thus permittingusers to create large, exportable texts from smaller, independent collection objects.892.2. Other fieldwork toolsever, contain no additional text (i.e., prose and markup) but are insteadused a) to generate treebanks that can be searched for structural patternsusing TGrep2 and b), in future versions, to generate other single-file repre-sentations that can be searched for cross-sentential patterns using, for exam-ple, NLTK’s corpus utilities. The ability to search a fieldwork database forstructural patterns is a great boon to search. LingSync does not currentlyallow for structural search of phrase structure treebanks92 nor does it (tomy knowledge) permit cross-sentential (i.e., cross-datum) search.Bibliographies. The OLD allows contributors to create source objectsfor citing the sources of data from published materials and for creatingbibliographies. As of the OLD version 1.0, the data structure for sources istaken from the BibTeX specification and source modification requests arevalidated server-side according to that specification. Since an OLD form canreference a source, one can accurately and consistently indicate the originof data that come from texts such as dictionaries or grammars. In futureversions of the OLD, users will be given the option of auto-generating abibliography for collections that contain forms with sources, and LaTeXexports will include the appropriately generated BibTeX bibliography file.Though relatively minor, this convenience further facilitates the creationof professionally formatted research papers and other texts from within anOLD application.92LingSync Prototype does currently provide a syntacticTreeLatex field by default,which the system assumes will contain valid phrase structures in the bracket notationconventions of the LaTeX tree-drawing package Qtree. The LaTeX export function thenlists these tree representations under the (gb4e) IGT data representations. However, thesetrees can only be searched for structural patterns insofar as one can concoct elaborateregular expression patterns over them, i.e., not very far.902.2. Other fieldwork toolsOrthography conversion. The OLD provides support for orthographyconversion (a feature not implemented by LingSync). That is, administratorscan specify simple mappings between orthographies so that users can create,read, update, and search transcriptions using a chosen orthography while thesystem transparently converts input to and from an administrator-specified,system-wide storage orthography. This allows a group of fieldworkers tocreate a consistently transcribed data set despite the fact that differentusers may be using different orthographies to transcribe the data. Sincemany endangered languages have only recently come to be written down,orthographic conventions are often still in the process of being establishedand it is not uncommon to have several competing writing systems, eachwith its own staunch adherents. Since OLD applications seek contributionsfrom as wide a user base as possible, the orthography conversion feature canprove quite useful.Inventory-based validation of transcription input. Administratorsof OLD applications can also specify inventories of graphemes for the variousbuilt-in transcription attributes93 of forms and configure input validation sothat warnings are issued or errors returned when invalid transcriptions areentered. This inventory information is also used to generate field-specifickeyboard widgets to ease data input when contributors do not have accessto appropriate keyboard layouts for the language they are documenting.9493The “transcription” attributes are orthographic transcription, phonetic transcription,narrow phonetic transcription, and morpheme break.94LingSync Prototype allows users to specify Unicode keyboards as well, though theseare not based on field-specific character/grapheme inventories.912.2. Other fieldwork toolsParsers and model implementations. Two major features of the OLDthat have no equivalents in LingSync are the morphological parser creatorand the computational modelling of linguistic models that it entails. TheOLD uses finite-state transducers (FSTs) N -gram LMs to allow users tocreate computational implementations of an unlimited number of phonolo-gies, morphologies, disambiguators, and morphological parsers, cf. chapter 3.These can be used to a) provide feedback on the consistency of user-enteredrepresentations with user-specified grammatical models and b) provide au-tomatic generation of morphological analyses and/or transcriptions, a boonto rapid data entry. LingSync does provide an automatic glossing featurewhich uses previously entered transcriptions/analyses to automatically sug-gest analyses and/or transcriptions during data entry/update. While useful,this approach is clearly not as effective as a full-blown parser which cananalyze previously unseen forms and which can be used to empirically testphonological and morphological models. Features present in bothBoth LingSync and the OLD provide software documentation, support forintegrating audio and video into their data sets, versioning of data points,and search. The following sections discuss the differences between the twosystems as regards these features.Documentation. LingSync and the OLD are fairly well documented giventhat both are open source projects under ongoing development. They havesomewhat complementary strengths and weaknesses in terms of software922.2. Other fieldwork toolsdocumentation. In general, LingSync has a lot of (primarily user-oriented)documentation resources which could stand to be integrated into a morecoherent whole; on the other hand, the OLD has (arguably) better technicaland conceptual documentation but needs to update its user guides. Notethat both tools have contextual help information built into their graphicalinterfaces.LingSync (2013) is a good reference for understanding the goals, pro-jected high-level architecture, and raison d’eˆtre of the system. The best waysto discover what has actually been implemented (aside from experimentingwith the software itself) and what is currently under development are to con-sult the developer’s blog (LingSync, 2014b) and the milestones tracker onthe LingSync’s GitHub repository.95 The slideshow tutorial for the LingSyncSpreadsheet app96 linked to from the main LingSync page (LingSync, 2014a)provides a quick overview of that tool and helps new users to get started.Finally, uploaded to YouTube are a large number of screencasts that canhelp new users to understand the system at a high level and to begin usingboth of the two current interfaces, i.e., LingSync Spreadsheet and LingSyncPrototype. The YouTube playlist entitled LingSync Tutorials97 lists most ofthem.The OLD documentation consists of this dissertation, which provides adetailed description of and argument for the system; the documentation forthe OLD version 1.0 (Dunham, 2013a), which explains how to download and95 Other fieldwork toolsinstall the OLD web service and includes a detailed specification of the datastructure and RESTful HTTP/JSON API; and the OLD (0.2) User Guide,98which is included in each language-specific OLD 0.2 web application butwhich has not been updated for some time.Audio & video support. Both LingSync and the OLD allow users to as-sociate audio and video files to their datums/forms; once associated, thesemedia files are embedded within the representations of the data and canbe played while viewing the textual data. However, there are certain imple-mentation differences between the two systems as regards this functionality.Highly relevant here is the mutual recognition of the value in facilitating thetime-alignment of media files with transcriptions. Time-aligned transcrip-tions of audio/video recordings have the potential to be highly useful in thecreation of tools that support language learning and revitalization.The OLD currently allows for any form to be associated to any numberof files, including audio and video files, and an audio file can be categorizedas a recording of an object language, metalanguage, or mixed utterance.In addition, contributors can configure audio file objects to reference othersuch objects for their digital content while supplying start and end valuesto indicate sub-intervals of the parent audio file. Thus the OLD has limitedsupport for time-aligned audio annotation insofar as a contributor may, usingthe mechanism just described, associate a form representing a transcriptionof an utterance to an audio recording of a speaker producing that utterance.Being able to create the sub-interval-referencing file objects just described,98See Other fieldwork toolssaves some time in that it obviates the need to actually splice and exportsubsections of larger audio files.LingSync (2013) discusses both a “dream” module that would assistwith automating the alignment process as well as functionality for import-ing ELAN XML documents. However, it is unclear whether or to what extenteither of these components have been implemented.Whether or not it has yetbeen implemented, functionality for automating the alignment of transcrip-tions with media files will allow fieldworkers to greatly expedite the tedioustask of manual alignment (and the slightly less tedious task of sub-intervalstart and end time specification facilitated by the OLD, as described above.)A relevant, audio-related convenience of the LingSync Spreadsheet applica-tion that is implemented is that which allows users to record audio directlyinto the browser. This could potentially save time.Versioning. Both the OLD and LingSync implement versioning of data.That is, previous versions of forms/datums are retained. This allows usersto restore data after deletions and modifications (if necessary), view thehistory of individual data points, and guard against malicious or accidentaldata corruption. The OLD accomplishes this by timestamping all forms andstoring deleted and modified forms in a dedicated backup table.Search. LingSync provides three distinct search interfaces: one integratedinto LingSync Spreadsheet, a second in LingSync Prototype, and a third,web-based interface for searching one’s public corpora via ElasticSearch.9999ElasticSearch, cf., is aweb service that provides search functionality for distributed data stores via a REST-952.2. Other fieldwork toolsAll three search interfaces allow users to search across all sessions within acorpus.100 The ElasticSearch interface allows users to search across all cor-pora that are public and to which they have access.The LingSync Spread-sheet search functionality is quite basic and currently only allows for con-joined substring matches over all datum fields, i.e., no field-specific filtersand no disjunction. The LingSync Prototype search functionality is moresophisticated in that it allows for regular expression patterns; any numberof field-specific filters conjoined, disjoined, or negated (via !); and indicatesmatched patterns via highlighting; however, the search structure is flat, i.e.,it does not allow for a hierarchy of filters with conjuncts/disjuncts at thenon-terminal nodes.101As described in § 2.3.6, the OLD exposes a RESTful interface for spec-ifying (as JSON arrays) arbitrarily deep tree structure queries where non-terminal nodes are conjunctions, disjunctions, or negations. This interfaceautomatically handles SQL joins, meaning that it is easy to search acrossthe hierarchical structure of the OLD form objects themselves; e.g., onecan search for forms associated to tags whose description values match theregex /(Chomsky|Bloomfield)/. In addition, the OLD auto-generates a se-ful interface.100The LingSync Spreadsheet app also allows one to restrict a search to a single (elici-tation) session.101This is concluded after experimenting with the system. Note that LingSync Proto-type provides a multi-field form interface for searching as well as a textual search fieldwere users can specify complex queries using the syntax fieldName:regex to specify filtersand where these filters can be conjoined via AND and/or OR. This latter, textual interface ismore powerful since it allows for multiple filters targeting the same field/attribute. How-ever, attempting to bracket sequences of coordinated filters crashes the query parser, i.e.,the query utterance:y OR ( utterance:e AND utterance:x ) returns nothing despitethe fact that, logically, it should return results. Removing the brackets results in a de-fault right-branching structure, i.e., utterance:y OR utterance:e AND utterance:x isimplicitly equivalent to ((utterance:y OR utterance:e) AND utterance:x).962.2. Other fieldwork toolsrialized representation of the tripartite (i.e., shape/gloss/category) morpho-logical analyses of forms, thus allowing searches to unambiguously targetmorphemes, a feat that cannot be accomplished via independent filters onmorpheme, gloss, and/or category fields.1022.2.4.7 Features lacking in bothTwo major areas where both LingSync and the OLD could stand to beimproved are bulk editing and dictionary creation.Bulk edit. LingSync and the OLD both need to provide better supportfor bulk editing, i.e., allowing users to perform multi-form/datum updates.In my experience working with fieldworkers, this is a feature that is con-tinually requested. This is because groups of linguists doing research on aparticular language are continually developing and changing conventions atthe foundations of their analyses and often want to keep their databasesin accordance with those conventions. As a result, manual curation of thedata can be a time-consuming process, one which could be greatly expeditedvia an effective bulk editing interface. As mentioned above (i.e., in § 2.2.3),FLEx provides an eminently imitable implementation of this feature.LingSync Prototype provides a simple command-line interface to the ap-plication’s sand-boxed file system (the “Power Users Backend”) which allowsusers to manipulate the data using JavaScript. Recently, the LingSync devel-opment team also announced the creation of a prototype “cleaning bot” that102The search interface just described is exposed by the OLD 1.0 web service and is notintegrated into the GUI since there is no GUI. The OLD 0.2 GUI has a decent searchinterface, but does not possess the expressive power of the OLD 1.0 search interface.972.2. Other fieldwork toolswas programmed to automatically propose corrections to the datums withina test corpus and then prompt the owner of the corpus to confirm/executethe proposed changes.103 Similarly, the HTTP/JSON API exposed by theOLD 1.0 web service could be used to perform bulk updates by users whoare capable of writing programs (in their language of choice) that can issueHTTP requests to retrieve the relevant data points, modify them accord-ingly, and then issue the appropriate update requests. In fact, the OLDsource includes a simple module that hides the HTTP and JSON conver-sion details and would allow users to do this via the Python interactiveprompt. However, despite these low-level solutions, the fact remains thata user-friendly interface that allows for configuring, previewing, and thenapplying bulk edits is still needed by both systems.Dictionary creation. Both LingSync and the OLD provide only minimalsupport for creating dictionaries. LingSync’s automatically extracted lexicacould be used as the foundation for generating dictionary-like resources (cf.LingSync, 2013, p. 22). The OLD 0.2, provides a dictionary-like interface(including customizable sorting) to the forms that it assumes to be mor-phemes as based on a problematically simplistic heuristic. However, neithersystem currently supports a data structure with the kinds of hierarchiesand relations necessary for a rich dictionary representation, as is providedin software such as Toolbox and FLEx, e.g., specification of synonymy andvariant relations, specification of distinct senses for a lexical entry, etc.104103 that OLD forms can have any number of translations. While this does allowfor the creation of lexical entry forms with multiple translations-as-senses, it does notallow for forms with multiple distinct senses where such senses are differentiated by dis-982.2. Other fieldwork tools2.2.5 SummaryThis section has reviewed three fieldwork applications—Toolbox, FLEx, andLingSync—and compared them to the OLD. Toolbox has a long traditionof use in linguistic fieldwork and is well equipped to help with creatingdictionaries and analyzing texts. FLEx has essentially the same feature setas Toolbox with various incremental improvements as well as novel features,notable among which is incipient support for multi-contributor network-based data creation. As a web application with a feature set that is generalenough to be useful to both theoretical and descriptive linguists, LingSyncis quite similar to the OLD; the discussion above explored the comparativebenefits of each tool.The reviews of Toolbox, FLEx, and LingSync provided above includeddirect comparisons between the features sets of those tools and the featureset of the OLD. In those reviews, efforts were made to argue for the superior-ity, or at least the equality, of the design and feature set of the OLD relativeto those tools. However, the comparison also revealed a number of respectsin which these other tools are superior to the OLD. This, in turn, suggestsways in which the OLD could be improved by borrowing features from theseother tools or by increasing the potential for interoperability and comple-mentarity between the OLD and these other systems. In the remainder ofthis summary, I itemize the relative demerits of the OLD and explain mystrategy for addressing these in future work, either by implementing new fea-tures or by increasing the capacity for interoperability and complementaritytinct morpheme break, morpheme gloss, and/or category values, something that is madepossible by the data structures of Toolbox and FLEx.992.2. Other fieldwork toolsbetween the OLD and these other tools.Perhaps the primary demerit of the OLD is that there is no GUI forversion 1.0. That is, there is no GUI that would allow one, for example,to exploit the powerful search functionality described in § 2.3.6, to buildparsers using the morphological parser creator described in chapter 3, or toexpedite data entry using the parsers so created. Creating such a GUI isthe next major step in the planned development of the OLD. Additional im-provements to the OLD as inspired by the reviews given above are as follows.The OLD would benefit from having a more flexible data structure, bulk up-date functionality, support for additional IGT rows (e.g., word glosses, wordcategories, and allomorphic morpheme break lines), a lower barrier to entry(i.e., by providing an OLD service that new users can simply sign up for,along the lines of LingSync), greater customizability in terms of how dataare displayed (e.g., IGT display as well as a tabular display with movableand hidable columns), more morphology-relevant fields along the lines ofFLEx (e.g., morpheme type and categorial combinatoric specification), morefine-grained data access control functionality (e.g., encryption), support formaking subsets of OLD data sets public and easily archivable, real-time ren-dering of OLD texts (i.e., collections), and functionality whereby the systemwould offer to automatically create forms for morphemes and/or words thatare implicit in analyses as they are entered.As discussed above, the OLD has no import functionality and quite lim-ited export functionality. Future development of the software will improveimport/export capability so that users can more easily move data in and outof an OLD application in order to then, say, import it into FLEx (or some1002.3. Featuresother similar application) as a contribution to a dictionary creation project.2.3 FeaturesThis section details the features of the OLD that most contribute to its ef-ficacy as a fieldwork tool. That it helps fieldworkers to share (and therebycreate a useful repository of) language data is one of the most significantpoints in its favour (§ 2.3.1). Its high-level design approach, i.e., its archi-tecture (§ 2.3.2), and its release under an open source license (§ 2.3.3) arecrucial to the re-purposing of both the language data and the software’s ownfunctionality. Its data structure (§ 2.3.4), so it is argued, strikes an effectivebalance between shackling prescriptivism and laissez-faire structurelessness.While the graphical user interface of the older OLD web application hassome valuable conveniences, the programmatic interface of the new OLDweb service boasts a conceptual simplicity and accessibility that are crucialto achieving reusability of data and functionality (§ 2.3.5). Being able toquickly and accurately retrieve relevant data is crucial; as such, significantthought has gone into developing an effective search interface and addi-tional functionalities that improve search (§ 2.3.6). Minor automations arepresented in § 2.3.7, though discussion of the computational morphologicalparsing and the requisite model-implementing functionality are left for chap-ter 3. Several features contribute to achieving consistency (§ 2.3.8) in thedata, thereby rendering them more searchable and more accessible generally.The dictionary-like interface to lexical items is presented in § 2.3.9, while§ 2.3.10 discusses facilities for creating formatted documents containing rich1012.3. Featuresrepresentations of fieldwork data. Functionality for customizing restrictionson access (§ 2.3.11) to data is also discussed. Finally, the happy state of thesoftware’s documentation (§ 2.3.12) is revealed.As mentioned above, there are two versions of the OLD: version 0.2which is at present being actively used for collaborative linguistic fieldworkand version 1.0 which is a web service that is awaiting development of a GUIbefore it can see active use. The features described in the sections that followhave been implemented105 in either or both versions of the OLD. Table 2.3summarizes which features are present in which application. The currentlyin-development GUI for the OLD 1.0 web service106 will implement the OLD0.2 features that are currently missing (e.g., orthography conversion and adictionary interface) from the OLD 1.0 web service.Feature OLD 0.2 OLD 1.0architecture web application web serviceGUI yes noAPI no yesmorphological parser creator no yescorpus creation no yesstructural search no yesinventory-based input validation yes yesorthography conversion yes nodictionary interface yes notext (i.e., collection) creation yes yesauthentication & role-based authorization yes yesTable 2.3: Features across OLD versions105Planned features that have not yet been implemented are explicitly described as such.106See Features2.3.1 Sharing dataThe core feature of the OLD is its facilitation of data-sharing between field-workers. I contend that easy access to peer data is, especially in the en-dangered language context, preferable (both individually and communally)to data sequestration. This section reviews the properties and sub-featuresof the OLD that conspire to produce the emergent macro-feature of data-sharing facilitation.Linguistics is a science. Its object of study is the set of cognitive abilitiesthat permit an individual to produce and understand a specified language.The raw data of linguistics are communicative signals (typically acoustic,but also visual), contexts wherein these signals may be used, and speakerjudgments concerning the well-formedness of signals and their compatibilitywith specified contexts. On the basis of such data, linguists generate hy-potheses about the knowledge that is encoded in the brain of an individual.While much of empirical linguistics can be characterized as the recordingof observed behaviour, it is possible (in certain domains more than others)to control variables and therefore to speak meaningfully of linguistic exper-imentation. Clearly this is so in a scenario where a phonetician formalizes aprocedure for eliciting the production of a controlled set of tongue-twisterswhile making ultrasound image recordings of speakers’ tongues. Though thevariables are more difficult to control, the label of experimental linguisticscould also arguably be applied to the semanticist who creates a representa-tion of a contextualized communication event and collects speaker judgmentson the acceptability of a particular utterance therein.1032.3. FeaturesGiven that linguistics is a scientific endeavour, an obvious argument forempirical transparency is that it facilitates peer replication of experimentsand observations, whence improved assessment of generalizations and claims.However, it is most often impossible for interested peers to replicate datacollection procedures on endangered languages precisely because speakers ofthese languages are so rare. Moreover, the rapport and implicit set of un-derstandings between researcher and language consultant often cannot beconveyed by representations of data. Therefore, while the detailed endan-gered language data provisioned by a system like the OLD might form thebasis for questioning a researcher’s conclusions and forming alternative ac-counts, convincing and publication-worthy counter-claims should generallybe grounded in original empirical work.The primary considerations supporting linguistic data-sharing are, inmy estimation, the potential for fast-tracking incipient research and re-purposing data toward other fieldwork-related endeavours, especially revi-talization.Having the opportunity to browse and effectively search peer data is agreat boon to research, especially in its beginning stages. Given such anopportunity, imagine how much more quickly one could uncover, say, mini-mal pairs, phonotactic patterns, possible permutations of verbal inflectionalmorphology, or the distribution and semantic contribution of a particularmorpheme. During the discussions of research groups or those that arise af-ter presentations, it very often happens that a line of thought will becomestalled on an empirical point that could quickly and easily be settled byaccess to a communally generated repository of language data such as an1042.3. FeaturesOLD application.The other primary positive result of widespread sharing of endangeredlanguage data is the potential for reuse beyond research. I am thinkingspecifically here of the potential for using such data in the teaching and learn-ing of the languages under consideration. In the course of a two-semesterlinguistic field methods course at UBC, I estimate that hundreds of hours ofaudio recordings and thousands of lines of transcriptions, translations andmorphological analyses are generated. If these data are consolidated, struc-tured, and easy to access, then they can be used to generate lesson plans,study exercises, and grammars and fed into dynamic language-learning soft-ware.An OLD web application facilitates the sharing of language data amongfieldworkers by providing an interface to a centralized database that can bemodified by multiple contributors simultaneously. The degree to which dataare effectively shared depends on the quality of the database schema and theapplication interface, i.e., how well the data are structured and presented,how consistent and detailed they are, how quickly and accurately they can beaccessed (e.g., via search), and, generally, how enticing the user experienceis. In short, the effectiveness of the core feature of data-sharing dependsupon the effectiveness of the various features described and evaluated in thesections below.Since some data-sharing-relevant features are too minor to deserve theirown sections, I summarize the argument here. The database schema (i.e.,structure) of the OLD is grounded in my own fieldwork experience with mod-ifications resulting from the usage and feedback of dozens of fellow fieldwork-1052.3. Featuresers and OLD contributors. Data are presented cleanly (i.e., without clutterand with secondary107 values hidden by default) and in the familiar interlin-ear glossed text format that is the de facto standard for foreign language datain research papers. Powerful search functionality allows for highly nuancedqueries that may make use of regular expressions and may reference multi-ple attribute values, including (system-generated) morphological represen-tations. Values for enterer and for date and time of entry are auto-generatedto facilitate proper attribution. For the prevention of data loss and/or cor-ruption, there are features that prohibit deletion of another contributor’sforms and the behind-the-scenes preservation of past data states upon eachsuccessful modification. Data entry, modification and search are facilitatedby conveniences of the user interface which include auto-fill of likely-to-be-repeated past values, keyboard shortcuts, and tab-based navigation of fields.Contributors can also collaboratively create exportable documents contain-ing system data using the collection objects.Before closing this section on the data-sharing feature of the OLD, Iwould like to address two related and valid concerns about the very desir-ability of such a feature. These are the reluctance of researchers to sharetheir data and the requirements of speakers and communities that certaindata not be made public. The OLD’s features for controlling access to data,as discussed in § 2.3.11, are designed specifically to address concerns of thisnature. However, in an environment where there is competition for publica-tions, grants, and jobs and where research is based upon a rare and hard-107I informally divide attributes of linguistic forms into primary and secondary cate-gories, with the former including the transcriptions, the morpheme break, the morphemegloss, and the translations and the latter including everything else.1062.3. Featureswon resource, researchers may be understandably hesitant to expose a fullycandid data set. As it stands, it is possible for a registered OLD user to con-tribute data for a short time (or not at all) and then have access to all of thedata in the system that are contributed by others. While the OLD does notprescribe particular terms of use for language-specific OLD applications, theadministrators of such applications are encouraged to adopt a set of termsand require their acceptance prior to user registration. These terms shouldminimally specify whether and which data can be used in publications (orcommercial products) and how the authors of such data are to be cited. Ifa user takes data from the system and uses it to further their own goals(without proper attribution or offers of co-authorship) and is conspicuouslylacking in their own contributions to the system, then administrators maydecide to close their account. Which is to say that, at this point, the OLDhas no formal mechanisms in place for dealing with disputes of this nature.At this point, I am undecided about whether addressing this type of issue isa requirement of the OLD per se.108 In my experience, fieldworkers can betrusted to behave ethically with respect to using the data of their peers. Inaddition, the set of research topics is so vast compared to the available re-searcher time that directly competing with another fieldworker for researchresults using that fieldworker’s own data is an unlikely circumstance. Forthose still unconvinced, it hardly needs stating that a fully candid dataset is not required: culturally sensitive and competitively vital data points108The approach taken by LingSync is arguably superior in this regard. In that systemusers control access to their own corpora and may allow (or revoke) different levels ofaccess to specific users. The disadvantage of this, of course, is that there is, as a result,something of a barrier to the ideal of maximizing the sharing and the reuse of data.1072.3. Featuresmay be withheld from the system according to the best judgment of theresearcher and/or speaker/community. That said, further improvements tothe authorization system of the OLD may be required in order to addresslingering concerns of this nature.However, the real challenge in getting fieldworkers to share their datais, in my estimation, enticing them to do so by creating a tool that comple-ments their preferred elicitation and organizational practises and simulta-neously adds real value to the process, e.g., by facilitating more rapid datacreation via tools that, say, automate morphological parsing. My contention,as argued below, is that the OLD does just this.2.3.2 ArchitectureThe architecture of the OLD—i.e., its high-level design (cf. McConnell,2004)—is a feature in its own right, and one that is significant enough tothe accomplishment of fieldwork goals that it deserves some discussion here.The OLD 1.0 is a web service. This means that it exposes a simple yetpowerful web-based interface that allows it to be used by other software.Any program capable of generating and sending HTTP109 requests and re-ceiving and parsing HTTP responses can interact with an OLD application(assuming, of course, that said program has a valid username and passwordauthorizing the request issued.) Requests are typically one of the five de-noted by the SCRUD acronym, i.e., requests to search (S), create (C), read(R), update (U), or destroy (D) a resource (cf. Martin, 1983), i.e., an OLD109HTTP is the standardized communication protocol of the World Wide Web (Fieldinget al., 1999). It is, in essence, simply a specification for how clients should format requestsand servers should format responses.1082.3. FeaturesFormCollection0..*0..*0..*0..*0..*0..*0..*transcriptionmorpheme_breakmorpheme_glosscommentssyntaxgrammaticality...Translation11..*form_orderprose...Corpusform_order...0..*Category0..*0..1FileMIME_type...Tag0..*0..*LMordersmoothing...0..*0..10..10..1Phonologyscript...0..*0..10..*0..10..*0..1SourceParser0..*0..1Morphologylexiconmorphotactics0..*0..*Figure 2.2: OLD UML diagram.form, file, collection, user, category, parser, etc.Figure 2.2 is a Unified Modelling Language (UML) diagram showing themain components of the OLD. The boxes represent classes and the linesconnecting them represent “has a” relationships between the classes, withmultiplicity indicated at the ends of the lines. Thus Figure 2.2 shows thatan OLD form has zero or one sources (0..1 ) and a source has zero or moreforms (0..* ). The figure also shows that a form has one or more translations(1..* ) and that each translation has exactly one form (1 ).To get all forms within an OLD application being hosted at http:1092.3. Features//, one would issue an HTTP GET requeston the URL This requestwould return an HTTP response whose body would contain a JSON array ofobjects, each representing one of the forms in the database. In order to createa new form on this OLD application, one would issue an HTTP POST requeston the URL with a JSONobject in the request body that contained the attributes and values of theform to be created. The updating, deleting, and searching of forms are op-erations that are accomplished using similar HTTP request patterns. Files,collections, corpora, categories, phonologies, morphologies, language models,parsers, etc. are all created in an analogous fashion. For a comprehensive ac-count of the data structure and API of the OLD 1.0, see the documentationat especially obvious type of software that could interact with an OLD1.0 web service is one designed for user interaction. Such an application iscurrently under development.110 It runs client-side in the browser and iswritten in JavaScript, HTML, and CSS. The past decade has seen a dra-matic increase in the capabilities of browser-based applications. In partic-ular, now possible are interfaces completely lacking in page refreshes (cf.Galli et al., 2003) and possessing functionality such as drag-and-drop. It isalso possible to create applications that run in the browser even when thereis no Internet connection and which can store data indefinitely in browser-internal databases (cf. Lawson and Sharp, 2011). A plethora of librariesand frameworks have arisen to facilitate employment of these capabilities.110See FeaturesThe browser-based interface to the OLD is written in CoffeeScript (a lan-guage that compiles to JavaScript) and uses the Backbone framework (tostructure the code) and the jQuery library (to facilitate the interface con-struction). Once completed, it will make use of these new capabilities in itsre-implementation of the existing OLD 0.2 functionality while adding inter-faces for the newly introduced resources (e.g., corpora and parsers), somelevel of offline capability, and a generally improved user experience.Currently included with the source code of the OLD 1.0 is a simplePython module that uses the Requests library to interact via HTTP witha live OLD application. This module could serve as the basis for a moresophisticated Python desktop application, either one with a graphical userinterface or one that resides in the command-line.An interesting possibility opened up by the modularized web service ar-chitecture of the OLD 1.0 is an interface to multiple OLD applications. Thiswould allow researchers to perform searches across a number of languagesand compose drafts of research papers with arguments grounded in the cross-linguistic examples found.111 I hope to implement this type of applicationat some future time.Other enticing possibilities opened up by the OLD web service includeindependent applications such as audio dictionaries and general-purpose lan-guage learning tools which harvest structured multimedia data from OLDapplications.The architectural decision to design the OLD as a re-purposable webservice allows for all of these different types of application to make use111Thanks to Martina Wiltschko for this interesting idea.1112.3. Featuresof it. It is thus a feature with significant potential for contributing to theaccomplishment of fieldwork tasks.2.3.3 Open source & web-basedSince the OLD is open source, it is freely available and its source code is ac-cessible for inspection, contribution, and/or derivation. Being web-based, itworks across operating system boundaries and has a lower barrier to devel-oper entry when compared to software written in more esoteric or low-levelprogramming languages. These properties make it easy for fieldworkers touse the software for a variety of purposes.The OLD 1.0 is licensed under the Apache License, Version 2.0. Thismeans that its source code is freely available for reuse. The source code maybe used in whole or in part in the creation of other software, so long as thenotice of the Apache License, Version 2.0 and notice of copyright holder aremaintained unaltered within the source. Interested parties can browse anddownload the source at the OLD’s GitHub repository. Using GitHub andGit, developers can contribute improvements to the software or clone it andbegin developing a derivative work of their own.As open source software, the OLD is more than just free of charge; itis freely available for detailed inspection and modification. This means thatdevelopers may study the source in detail in order to understand how itworks, improve it, or even create new pieces of software that make use ofit or parts of it. Because of this openness, the OLD is both a ready-to-usefieldwork tool and a resource of information and reusable components thatmay prove beneficial in the larger arena of what may be termed computa-1122.3. Featurestional fieldwork. The present dissertation and the extensive documentationincluded in the source (see § 2.3.12) enhance this potential for reuse.Since the OLD is web-based software, it is not subject to the cross-platform compatibility issues that can plague desktop-based applications.All modern computing platforms provide access to the World Wide Web viaa web browser application and OLD applications can, therefore, be used onany platform.Being web-based further increases the potential for contributing to andmaking use of the OLD code base. This is because the technologies employed—i.e., Python, HTML, CSS, and JavaScript—are widely used and relativelyeasy to learn. In my experience, most programmers (and many non-pro-grammers) have some level of proficiency with the core web technologies,i.e., HTML, CSS, and JavaScript.112 Python, the language in which theOLD server-side logic is written, is a mature, high-level programming lan-guage that is easy to learn and designed to be highly readable (see Lutz,2013).2.3.4 Data structureThe data structure of the OLD is the way that the system organizes thedata it stores. The data in an OLD application are stored in tables in arelational database. However, the structure is here described using the more112HTML is used to create web pages and is essentially a set of tags for organizingtext into nested structures such as divisions, headers, paragraphs, and bullet lists. CSS,i.e., Cascading Style Sheets, is the language for defining how these structural elementsare to be presented, e.g., the font size of headers, the background colour of containers,etc. JavaScript, the programming language of the web browser, allows for the creationof client-side application logic, e.g., changing a web interface in response to user actionswithout requesting new HTML pages from the server.1132.3. Featurestransparent language of objects and their attributes. Appendix A explainsrelational databases (what they are, how they work, and the way complexobjects are encoded within them) and details the OLD’s data structure.The present section considers some of the more foundational and otherwisenotable features of the OLD data structure and argues for their utility inaccomplishing linguistic fieldwork goals.The OLD data structure is an implicit prescription for how fieldworkdata ought to be organized. This structure is based on my own fieldworkexperience and has been refined and modified based on feedback from usersof the nine language-specific OLD web applications currently in use.113 Mycontention is that the data structure described here is one of several possiblesuch structures that would be effective for organizing fieldwork data. As Iam continually refining the structure, it would be illogical to assert that itis ideal.114 However, it embodies some interesting design decisions that arecrucial to other features and which, on the whole, contribute to the effectiveaccomplishment of fieldwork tasks.The first thing to consider is the value of giving any structure at all tolanguage data. After all, many fieldworkers get along just fine by keeping113The conventions for IGT representations in the OLD conform, for the most part, tothe Leipzig Glossing Rules (cf. Bickel et al., 2008). However, note that, at present, theOLD is not set up to recognize reduplication via the Leipzig standards.114I fully expect that the structures used by individual fieldworkers in their desktopdatabase applications will be distinct from that described here. However, experience sug-gests that migrating data from a particular fieldworker’s idiosyncratic data structure tothat of the OLD is usually possible and often relatively painless. The more difficult casesnecessitate writing scripts to pre-process data prior to migration. In some cases the OLDdata structure may lack appropriate objects or attributes to accommodate some data typeemployed by a particular data set. As the data import functionality of the system evolves,it will probably be endowed with a data structure that can, to an extent, be modified inorder to handle such cases.1142.3. Featureshandwritten journals or creating a series of digital text documents. The casefor structured data rests on the fact that structure facilitates consistency andmore efficient access to data. In particular, structured data can be searchedwith greater accuracy and they can be manipulated programmatically moreeasily.Setting aside hand-written notes, imagine the difficulty in searchingthrough digital text documents for data containing a particular string thatindicates the presence of a certain morpheme. Even assuming that search-ing across multiple documents is not a problem, it will be exceedingly dif-ficult to restrict results to data where the target string is present only inthe morphophonemic portions of the data and not in, say, the orthographictranscriptions, translations or free-form notes. Depending on the consistencywith which the fieldworker has informally structured their data-containingdocuments, it may be possible to compose regular expressions that are up tothe task; however, this will be very time-consuming and already presupposesa rather sophisticated technical expertise.When linguistic fieldwork data are structured, i.e., when they are di-vided into logical sub-parts and labelled, it then becomes much easier tosearch them quickly and accurately. Consider a database table full of sen-tences with orthographic transcriptions, morphologically segmented phone-mic transcriptions, glosses, translations, comments, and categories, all intheir own labelled columns. Given such a table of structured data, a databasemanagement system can be used to perform searches based on multiple con-ditions, each requiring that different parts (i.e., columns) of the data havecertain properties.1152.3. FeaturesStructured data can also be manipulated by computer programs far moreeasily than unstructured. This opens up possibilities such as programmat-ically extracting morpheme and word lists and other targeted sub-parts ofthe data for use in research or in the creation of pedagogical materials.The OLD proposes four core objects for structuring linguistic fieldworkdata: form, file, collection, and corpus. I contend that this represents a soundmethod of organization which accords with existing practises while also lay-ing the groundwork for the fieldwork-enhancing features provisioned by theOLD.Forms are textual representations of morphemes, words, phrases or sen-tences. The cap at sentences is a matter of convention—the hard restrictionis actually a rather arbitrary 255 character limit on transcription values(which could be increased if necessary). The vast majority of linguistic ex-amples in research papers, descriptive grammars, dictionaries, and learningmaterials can be represented by OLD forms. An obvious potential demeritof this approach is the fact that it is not possible to represent a multi-sentential datum as a single form. That is, in some instances it would beuseful to be able to search for patterns that cross sentence boundaries. Toa certain extent, this is made possible by the corpus construct which allowsfor the searching of representations of sequences of forms. If users need sim-ply to create presentational representations of multi-sentential data, thencollection objects can be used.Another potential issue with form objects is that they conflate abstractlexical items with representations of records of speech events. That is, abound morpheme (never utterable in isolation) and a sentence spoken by1162.3. Featuresa particular speaker at a particular time are represented by the same datatype. As a result, assigning values to certain attributes does not make sensefor certain form types. For example, a bound morpheme should not havevalues specified for elicitor, speaker, and date of elicitation. This informationshould be encoded in the forms representing records of speech events inwhich the abstract morpheme is used. Although it does not currently, theinterface to the data could easily be made to dynamically retrieve suchsubstantive examples evincing the abstract forms, i.e., a user action on alexical item would bring up a list of forms containing that lexical item. Formsrepresenting bound lexical items should probably also not have phonetictranscriptions for the very reason that they are not uttered in isolation andtherefore phonetic representations do not make sense for them.Contributors need to be mindful of the abstract/substantive distinctionwhen making decisions as just mentioned as well as when deciding whether tocreate duplicate entries. Abstract forms should not be duplicated unless theduplication can be motivated by significant novelty in analysis. On the otherhand, the duplication of substantive forms may be more easily justified sincedifferences in speaker, dialect, and context of use may be judged sufficient.In practise, I have found that the benefits of being able to search acrossabstract and substantive forms simultaneously outweighs any costs arisingfrom the conceptual vagueness of this conflation. When distinctions mustbe made, they can be made by filtering forms according to the relevantattribute values, e.g., by restricting search results to only those forms thathave the categories corresponding to abstract lexical items, whatever thosemay be, given the analyses of the contributors to the application.1172.3. FeaturesOther notable design decisions of the form object include the gram-maticality attribute, the possibility of multiple translations (with their ownappropriateness specifications), morphological representations (and the con-ventions assumed therein), the distinction between general and speaker com-ments, the multipurpose tagging system, the elicitation method attribute,and the category attribute.Linguists find value in data that indicate constructions that are not con-sidered well-formed by speakers. For example, post-positions are not presentin English so speakers will judge (3) to be malformed.(3) *He went the store to.There are conventions for representing various types and degrees of mal-formedness, with the asterisk as an indicator of ungrammaticality (cf. 3)being the most recognizable. The OLD data structure encodes grammati-cality as a distinct forced-choice attribute. Administrators of the applicationcan specify the available grammaticality options. This approach ensures thatforms can easily be sorted by grammaticality without relying on searches ofpotentially inconsistent grammaticality representations in the values of var-ious transcription attributes.There is no limit on the number of translations that a form may have. Thedecision to use distinct translation objects as opposed to a single translationfield means that users of the system do not need to deal with inconsistenciesin how distinct translations are delimited, e.g., via informal conventionssuch as commas or semicolons. In addition, each translation object has itsown appropriateness attribute which allows contributors to indicate, in a1182.3. Featuresconsistent manner, whether a particular translation is compatible with agiven form.115As is discussed more thoroughly in the chapter on morphological parsing,the OLD assumes for certain functionality that morphological analyses arerepresented as sequences of morphemes with words separated by spaces andmorphemes separated by delimiter characters (typically “-” to represent affixboundaries and “=” for clitic boundaries, (cf. Bickel et al., 2008)) that arespecified by administrators in the application-wide system settings. Basedon this assumption, an OLD application will attempt to identify morphemeshapes and glosses in the morpheme break and morpheme gloss fields andthen attempt to match these against lexical items present in the database.Using any matches found, the system will generate values for additionalattributes (e.g., category string) and will create representations of formswhich make it clear to what extent the user-supplied morphological analysisis compatible with the lexical entries present in the system. In order forthis to work, contributors must follow certain conventions, such as ensuringthat there are neither space characters within individual glosses nor unin-tended delimiters within morpheme shape representations. By and large,this approach accords with the representations of morphological analysesused by linguists. It also reflects the standard analytical assumption of ahierarchy of representations wherein sequences of phonemically transcribedmorphemes are mapped via phonological transformation rules to phonetic115Note that this appropriateness attribute of translation objects is given the label gram-maticality in the underlying data structure. However, since this does not really (typically)indicate the grammaticality of a translation, the label is not entirely accurate. Hence theuse of “appropriateness” here.1192.3. Featuresrepresentations.However, in many traditions an unsegmented transcription is not sup-plied since this would be highly redundant given the highly transparentmorphophonology of the languages in question. An example is furnished bySalishan linguistics where an unsegmented transcription is not generally pro-vided since the phonological transformations are simple enough that theycan be supplied by the reader.116The general comments attribute of forms is meant as a catch-all forfieldworker notes on the form in question. Contributors may adopt certaineasily searchable conventions in the textual value of the comments field inorder to categorize forms according to their purposes. However, the use oftag objects is the recommended approach for such categorization. Tags areuser-created objects with names and descriptions; any number of tags maybe associated to a given form. Reifying tags in this way avoids inadvertent,notational inconsistencies and facilitates the sorting of forms according tothe categorizations implied by the tags. Quotations or paraphrases of thespeaker on a particular form should be placed in the value of the speakercomments attribute. This type of comment is conceptually distinct enoughfrom general comments that it deserves its own attribute. Often it is veryinsightful to approach linguistic data by focusing on the speaker’s own in-116The OLD should be able to accommodate this approach by allowing the morphemebreak value to count as the required minimal transcription value. Currently the systemrequires an orthographic transcription for all forms but this is a restriction that should belifted in favour of a more flexible one that allows any of the various transcription typesto meet the minimal one-transcription requirement. Since orthographic transcriptions aredesirable to the broader community of fieldworkers and language communities, the systemcould be configured to auto-generate delimiter-less transcriptions (either upon data entryor upon export.)1202.3. Featurestuitions about them and a dedicated speaker comments attribute facilitatesthis.A dedicated forced-choice field for morpho-syntactic categories is alsowell-founded. This information helps with sorting forms into abstract lexicalitems vs. records of utterance events. Of course, the system must still be told,in some sense, that forms associated to a tag with a name value of Agr, forexample, are abstractions while those categorized via an S -labelled tag arenot. The category information supplied by this attribute is also crucial to thesystem-generated syntactic category string and break-gloss-category values,as described above.The elicitation method object used to classify forms is also notable. Thisinformation can be useful when assessing whether a datum is solid evidencefor a particular claim. In semantic research, for instance, data elicited byrequesting translations of metalanguage forms may not, on their own, besufficient as evidence for certain claims. Data elicited by describing a care-fully controlled context and, for example, requesting that the language con-sultant judge whether a provided form is felicitous within that context areanother valuable type of evidence. In order that the relevance of data pointscan be assessed by researchers on their own terms, these different methodsof elicitation should be recorded via the elicitation method attribute117 andany information about the context provided can be included in the value ofthe general comments attribute.118117It may be useful to specify defaults for the elicitation method values. I invite sugges-tions for best practices in how to best categorize elicitation methods.118The OLD data structure now includes a context field on form objects for the specifi-cation of such contextual semantic information.1212.3. FeaturesText is not, of course, the only medium of language data. Audio andvideo recordings of speech are valuable and oft-created artifacts and theseare encoded in OLD applications as file objects. The OLD allows for multiplefiles to be associated to a single form, reflecting the fact that multiple digitalfiles may be relevant to a given form. Searches across forms can also containconditions on the file or files to which the forms are associated.The OLD currently supports some degree of indication of the natureof the relation between a form and an associated file. At present, one canspecify, via the utterance type attribute, whether an audio file is a recordingof an utterance in the object language, one in the metalanguage,119 or onethat is mixed (i.e., both object and meta). Programs that make use of OLDdata may use these specifications and assume that when a form is associatedto an audio file classified as an object language utterance that the file is arecording of an utterance of that very form.120 Such programs could thenuse this information in the creation of, say, learning games that get playersto recognize spoken data.The data structure for OLD files allows for three types of audio file:those whose digital content is stored on the OLD application server, thosewhose content is served elsewhere, and those whose content is a sub-portionof another audio file. The last of these is constructed by specifying a parent119A recording of an utterance in the metalanguage could be useful for non-visual lan-guage learning tools. A simple example would be a series of ordered utterance-translationaudio file pairs that a learner could listen to on an MP3 player in order to improve theirvocabulary.120Of course, this assumption is problematic since a form may be associated to an objectlanguage audio file that contains, but does not consist in, an utterance of the form. Forthis reason, the OLD data structure will soon be updated so that the very associationsbetween forms and files are what are categorized.1222.3. Featuresaudio file and start and end times within the recording; these are useful forexpediting the identification of smaller audio clips that correspond directlyto forms, without performing the time-consuming task of manually editingaudio files.The data structure also encodes corpus, collection, phonology, morphol-ogy, morphological parser, and morpheme language model objects. These arediscussed in detail in various sections below. There are also backup objectscreated for collections, corpora, and forms whenever one of these is updatedor deleted. This helps to mitigate the danger of inadvertent data loss.This concludes the discussion of the data structure of the OLD. Whileit will continue to evolve in response to user needs, it currently embodiesa number of design principles that are original and which contribute to theeffective accomplishment of fieldwork goals.2.3.5 InterfaceThe various interfaces of the OLD contribute to its effectiveness as a field-work tool. There are in fact three interfaces: the OLD 0.2 GUI, the in-development OLD 1.0 GUI, and OLD 1.0 application programming interface(API). OLD 0.2 GUIThe OLD 0.2 GUI boasts an uncluttered design, keyboard shortcuts to expe-dite navigation, and integrated documentation. Keyboard shortcuts are pro-vided for accessing the search, browse, and create interfaces for forms, files,and collections. Thorough (though now somewhat outdated) user-oriented1232.3. FeaturesFigure 2.3: Screen shot of the Blackfoot OLD interface showing an IGT formdisplay.documentation—i.e., the OLD User Guide—is present in every OLD 0.2 ap-plication. Additionally, key parts of the interface have help buttons whichbring up in-page instructions for accomplishing common tasks.Figure 2.3 is a screen shot that shows a sentential form from the Black-foot OLD displayed in interlinear glossed text format. The top line is anorthographic transcription. The bottom line is the sole translation. Linestwo and three are the values of the morpheme break and morpheme glossattributes. The colour-coded links indicate degree of consistency betweenmorphological analysis and lexical items in the database. Blue links indicateperfect matches, green indicate partial matches, and the absence of a linkindicates that there are no matches. That is, this representation shows thatthe database contains a lexical entry for the morpheme imita´a´ ‘dog’ (bluelinks mean perfect match) but none for iksimatsi’ts ‘appreciative’ (no linksmean no match). The morpheme with the phonemic shape a’pii and thegloss ‘be’ is partially matched as indicated by the fact that ‘be’ is a greenlink and a’pii is not a link. This means that the database contains no formswith a’pii as their morpheme break value. However, it does contain at least1242.3. Featuresone form with ‘be’ as its gloss value. If a user clicks on the green link, theywill be presented with the nine lexical items glossed as ‘be’, one of whichhas the shape a’p. This suggests that the user may have mistranscribed thephonemic shape of this particular morpheme; or perhaps the lexical entriesin the system are incomplete or inaccurate. Either way, the visual feedbackprovided by the GUI helps fieldworkers to see gaps and/or inconsistencies intheir analyses.To further aid in the entry of lexically consistent morpholog-ical analyses, an in-page quick-search interface is provided. This performssearches asynchronously, i.e., without page refreshes and without lockingthe interface, so that lexical items, for example, can be looked up withoutleaving the form creation page. Thus, when creating the form in figure 2.3the user could have used quick-search to look for either a’pii or iksimatsi’tsand found, prior to creation, that neither were present in the system.121Though the OLD 0.2 GUI has many merits (such as those just discussed),its also reveals several opportunities for improvement and these have spurredthe in-development GUI rewrite for the 1.0 version. OLD 1.0 GUIThe OLD 1.0 GUI is currently under development and a prototype of theform add and browse interfaces has been implemented. The system is anindependent client-side single-page application written in CoffeeScript, usingthe Backbone MV* framework, and jQuery tools for the UI widgets and121The OLD 1.0 GUI will, during form creation, single out morphemes in the form thatare not recognized and will prompt the user to have the system automatically create (allor a chosen subset of) them. This will allow contributors to focus on entering their phrasaland sentential data while automating the entry of the implicit lexical items and will greatlyincrease the rate at which these valuable lexical entries are created.1252.3. FeaturesDOM manipulation. A more desktop-style experience is targeted, includinga more consistent visual experience, expanded keyboard shortcuts (strivingfor pointer-less controllability), and conveniences such as auto-expandinginput fields. Form browsing is much improved, with clickable buttons andkeyboard shortcuts to scroll through, highlight, reveal hidden data of, andperform actions on form objects.Numerous other improvements over the 0.2 interface are planned. No-table are tabular data view, and some level of offline capability and client-side data storage. Of course, a GUI for creating and using morphophono-logical models and parsers will need to be implemented. As will a GUIfor corpora manipulation and treebank search, including tree representa-tions of bracket-notation phrase structures. Also, certain (previously) server-side functionalities, e.g., orthography conversion and import/export, arebeing moved client-side. The technologies available for client-side browser-embedded applications have advanced rapidly in the past ten years and thereare many exciting opportunities here. OLD 1.0 APIThe strength of the OLD 1.0 application programming interface (API) is itsconceptual simplicity. Following the REST paradigm (cf. Fielding, 2000),OLD objects are viewed as resources that have a standard set of operations,and a corresponding set of patterns of HTTP methods and URIs for re-questing them. The operations are search, create, read, update, and destroy(SCRUD). The method-URI pairs are POST /object for create, GET /ob-ject (and GET /object/id) for read, PUT /object/id for update, DELETE1262.3. Features/object/id for delete, and SEARCH /object (also POST /object/search)for search. Data in and data out are, throughout, UTF8-encoded JSON-serialized associative arrays.122Aside from some idiosyncratic requests of particular objects, the descrip-tion of the API in the above paragraph is sufficient to allow a developer tocreate programs that interact with any OLD application, effectively makingan OLD application a sub-component of any program using it. This concep-tual simplicity makes building tools on top of an OLD 1.0 web service anattractive proposition. And that, in and of itself, constitutes a potentiallywide-ranging benefit to linguistic fieldwork.2.3.6 SearchA core requirement of a successful linguistic fieldwork application is thatrelevant data be retrievable quickly and accurately. OLD applications fulfillthis requirement by facilitating powerful searches over the data they con-tain. The OLD search functionality is, at its core, a simplified interface tothe querying power of the underlying RDBMS, i.e., the relational databasemanagement system.123 The searchability of the data set is enhanced by theOLD data structure, the auto-generation of values for particular attributesof forms, and the implementation of phrase structure search over treebankcorpora.122 An associative array is an abstract data type consisting of a set of key-value pairs.In Python these are called dictionaries and in JavaScript/JSON they are called objects.123Formulating searches based on conditions on relational attributes via SQL queries(i.e., performing joins) is prohibitively difficult for the majority of us who are not databasewizards. The OLD simplifies relational search by hiding this particular complication fromthe user. While this results in some conceptual simplicity, it obviously comes at the costof some expressivity in query formulation.1272.3. FeaturesJust like a where clause in a structured query language (SQL) query, anOLD search is a hierarchy of filter expressions, i.e., it is a list of filters thatare coordinated (conjoined or disjoined) and bracketed into a structure ofunbounded complexity. A filter expression is a requirement that the value ofa particular attribute possess a specified property, e.g., that the orthographictranscription begin with the character t or that the elicitor be among aspecified set of users.The OLD 1.0 search API expects search queries to be formatted as JSONarrays124 (i.e., lists). The simplest such list is a 4-tuple where the first ele-ment is the name of an OLD object, the second the name of an attributeof that object, the third a relation (e.g., ‘=’), and the fourth a pattern suchthat the relation holds between the pattern and the value of the object’s at-tribute. These quaternary arrays are here dubbed simplex filter expressions.The JSON array in (4) (when sent in the body of a search request to theforms resource of an OLD web service) will return all forms with a(n or-thographic) transcription attribute whose value is exactly chien. The OLDsearch array in (4) is equivalent to the SQL select query in (5).(4) ["Form", "transcription", "=", "chien"](5) select * from form where transcription='chien';124 OLD search arrays are here presented using the JavaScript Object Notation (JSON)syntax. An array is a sequence of comma-delimited elements enclosed in square brackets.Strings are enclosed in double quotation marks and numbers are represented withoutspecial formatting. Following standard conventions for Python class and attribute names,the names of OLD objects are written in camel case (e.g., a syntactic category object isreferred to as SyntacticCategory) and the names of OLD object attributes are writtenin snake case (e.g., a syntactic category attribute is referred to as syntactic category.)1282.3. FeaturesA more complex query is illustrated by the plausible Blackfoot searchin (6). This retrieves all grammatical (actually, not ungrammatical) sen-tences containing a morpheme that is phonemically transcribed as wa orone that is glossed as one of five possible shorthands for proximate.125 Thisis a pretty good approximation for a query that returns sentences containingthe proximate suffix -wa.126(6) ["and", [["Form", "syntactic_category", "name", "=", "S"],["not", ["Form", "grammaticality", "=", "*"]],["or", [["Form", "morpheme_break", "regex","-wa( |-|$)"],["Form", "morpheme_gloss", "regex","-(PROX|prox|Prox|PRX|prx)"]]]]]Search (6) illustrates several important concepts relevant to OLD searchconstruction: the syntax for constructing complex queries, conditions onrelational attributes, and regular expressions.The query in (6) is a complex filter expression constructed from simplexfilter expressions that are composed with and coordinated via logical oper-ators, i.e., conjunction "and", disjunction "or", or negation "not". As thequery illustrates, a logical operator is always the first element of a binaryarray, which is itself a (complex) filter expression. Conjunction and disjunc-125In a collaboratively created database, it is sometimes necessary to create searchesbased on such disjunctive conditions. Of course, in an ideal situation, all contributorswould gloss the same morpheme in the same way.126Of course, the is not ungrammatical condition might also be expressed using theinequality relation != instead of the negation operator. The example search is constructedthis way in order to illustrate the syntax for negating filter expressions.1292.3. Featurestion are followed by a list of filter expressions that constitute the conjunctsand disjuncts, respectively. Negation, on the other hand, is followed simplyby the negated filter expression.The first filter expression in (6) is a five-element (i.e., quinary) array thatexpresses a condition on a relational attribute, i.e., the syntactic categoryattribute of forms. The attribute is relational because its value is a refer-ence to another object, viz., a syntactic category object. These objects havetheir own attributes, including the name attribute that is relevant to thecondition currently under inspection. In contrast to the [object, attribute,relation, pattern] form of the quaternary OLD filter expression array, thequinary relational one is of the form [object, attribute, attribute, relation,pattern]. That is, the second element of the array is the name of the rela-tional attribute (e.g., syntactic_category) and the third element is thename of an attribute of the object referenced by the relational attribute(e.g., name). Thus this filter expresses the condition that a form must havea syntactic category whose name is S.The third and fourth simplex filter expressions of the query in (6) makeuse of regular expressions, hence regex as the name of the relations here.Regular expressions are powerful tools for succinctly expressing complexpatterns over sequences. The string -wa( |-|$) is a regular expression patternthat matches -wa followed by a space, another morpheme delimiter (-), orthe end of the string ($). The string -(PROX|prox|Prox|PRX|prx) matchesthe hyphen delimiter followed by any of five variously spelled shorthands forthe word proximate. These regular expressions show how to use parenthesesto group disjunctive options delimited by the vertical bar character | and1302.3. Featureshow to match the end of a string using the dollar sign character $. This justscratches the surface of what is made possible by regular expressions.Of course, users of a graphical interface to an OLD web service shouldnot be prompted to construct queries by composing JSON arrays as in (6). Asensible GUI will provide forced-choice fields where these are appropriate aswell as visually intuitive interfaces for constructing the desired bracket/treestructures. Also highly useful will be in-page help with composing regularexpressions. When completed, the GUI for the OLD 1.0 web service will al-low users to define shorthands (i.e., macros) for oft-used regular expressionsthat can be used within search patterns. A practical example of this wouldbe defining shorthands for regular expressions that match consonants, vow-els, and syllables (relative to the inventory and phonotactics of the targetlanguage) and then using these macros to restrict searches to forms con-taining n-syllable words or even particular patterns within syllable i of aword.The OLD objects that can be searched are forms, files, collections, cor-pora, sources, the set of forms remembered by a particular user,127 lan-guages, and searches of forms.128 Since form objects are the most oftensearched, here follows a listing of the attributes that can be referencedby filter expressions over forms. For details on these attributes, see Ap-pendix A. Searchable form attributes whose values are strings are tran-127Users can remember forms, meaning they can store references to specified forms intheir OLD memory. The memorizers of a form are the set of users who have rememberedthe form, i.e., saved it to their memory within the system. The memory is simply a placewhere users can store a set of forms to be used for some other purpose, e.g., creating atext or for export.128Recall that searches over forms can be saved as objects in their own right. These formsearch objects can themselves be searched.1312.3. Featuresscription, phonetic transcription, narrow phonetic transcription, morphemebreak, morpheme gloss, comments, speaker comments, grammaticality, Uni-versally Unique Identifier (UUID), syntactic category string, morpheme breakidentifiers, morpheme gloss identifiers,129 break-gloss-category, syntax, andsemantics. The value of the date elicited attribute is a date, and those ofdate-time entered and date-time modified are date-times. The value of theidentifier attribute is an integer. Each of the following attributes has as itsvalue another OLD object (the name of the object is indicated in paren-theses, unless this is the same as the name of the attribute itself): elicitor(user), enterer (user), verifier (user), speaker, elicitation method, syntacticcategory, and source. Each of the following attributes has as its value acollection of other objects: translations, tags, files, collections, memorizers(users), and corpora.The relations that can be specified in simplex filter expressions are equal-ity (=), inequality (!=), regular expression (regex or regexp), like (like), is anelement of (in), less than (<), greater than (>), less than or equal to (<=),and greater than or equal to (>=). The like relation matches strings againstpatterns that may contain two special wildcards, the underscore characterwhich matches any character one time and the percent sign character %which matches any character zero or more times. In practise, the like rela-tion is often used for substring match, e.g., querying the database for formswith a translation transcription like %hamburger% to get all data that havehamburger in one of their translations.129The morpheme break identifiers and morpheme gloss identifiers values are system-generated JSON arrays storing references to other OLD forms that match the user-suppliedmorphemes and glosses in the morpheme break and morpheme gloss fields.1322.3. FeaturesThe in relation tests whether the value of the specified attribute is iden-tical to any of the elements of the specified pattern array. For example, thesingle-filter in-based query in (7) returns all forms orthographically tran-scribed as chien or chat. The queries in (8) and (9) use the regular ex-pression relation130 and the disjunction operator, respectively, to return thesame result set as that returned by (7).(7) ["Form", "transcription", "in", ["chien", "chat"]](8) ["Form", "transcription", "regex", "^(chien|chat)$"](9) ["or", [["Form", "transcription", "=", "chien"],["Form", "transcription", "=", "chat"]]]The <, <=, >, and >= relations have obvious semantics when numbersare being compared; however, they also work with strings, dates, and date-time values. The query in (10) returns all forms with an identifier value lessthan 100, a date of elicitation before Christmas 2012, and an orthographictranscription that would be ordered before abacus in an alphabetic sort.(10) ["and", [["Form", "id", "<", 100],["Form", "date_elicited", "<", "2012-12-25"],["Form", "transcription", "<", "abacus"]]]130In regular expressions, the caret character ˆ matches the beginning of the string andthe dollar sign $ the end of the string.1332.3. FeaturesWith only the search tools described so far, complex queries can be con-structed and relevant forms (and other objects) can be accessed quickly.131However, in addition to this the OLD enhances search based on morphologi-cal criteria via the system-generated values for the syntactic category stringand break-gloss-category attributes of forms. Consider the hypothetical form(11) in an OLD application for French. Assuming that the database containssensibly categorized lexical entries corresponding to those in its morpholog-ical analysis, (11)’s syntactic category string value will be (12).(11) Lesle-sthe-PLchienschien-sdog-PLmangeaientmange-aienteat-3PL.IMPF‘The dogs were eating.’(12) D-Num N-Num V-AgrThe auto-valued syntactic category string attribute allows forms to besearched according to morphological patterns. For example a regular ex-pression filter on syntactic category string values using the pattern in (13)will return forms consisting of a determiner (with zero or more suffixes),followed by a noun (with zero or more suffixes), followed by a verb (withzero or more suffixes). Thus, (11), or rather its syntactic category stringvalue, will be matched by the regular expression in (13). A form such as lechien mange with a syntactic category string value of D N V-Agr will also131OLD search requests can also specify the ordering of the results returned as well as alimit on the number of results returned (i.e., SQL order by and limit). The details of thisare not discussed here.1342.3. Featuresbe matched. This is a pretty good start for a search that returns sentencesconsisting of an intransitive verb with a simple DP subject.(13) ^D(-[^ ]+)? N(-[^ ]+)? V(-[^ ]+)?$The key to understanding the regular expression in (13) is the compo-nent (-[^ ]+)? which matches zero or more suffix category names. Thesub-expression [^ ] matches any character except a space. The plus signcharacter is a quantifier that means one or more times, so -[^ ]+ meansmatch a hyphen followed by any character except a space, one or or moretimes. This is a sufficient characterization of one or more suffixes. Since wealso want to allow for free morphemes without suffixes, we use the ques-tion mark character quantifier, meaning zero or one time, over the entiresub-expression (with scope indicated by the parentheses): (-[^ ]+)?.More fine-grained morphological patterns can be targeted by creatingconditions on break-gloss-category values. This is because the phonemicshape, gloss, and categorial information that constitutes the morphologicalanalysis of a form is all present here. Within the paradigm of concatena-tive morphology, a morphological analysis can be conceptualized as a listof words, each of which is a list of morphemes (interleaved by morphemedelimiters132), each of which is a list of three elements, viz., shape, gloss,and category (14). The break-gloss-category value is simply a serialization ofsuch a list of nested lists. The three components identifying a morpheme are132One could represent the morphological analysis of a word as simply a list of morphemetriples. However, since the choice of delimiter can have significance, this is, in general, anundesirable approach. One convention where choice of delimiter is significant is where thehyphen is used to indicate affixation and the equals sign for cliticization (cf. Bickel et al.,2008).1352.3. Featuresdelimited by an arbitrary (rare) delimiter—here represented as the verticalbar—and the serialized words are delimited by spaces (15).(14) [[["le", "the", "D"], "-", ["s", "PL", "Num"]],[["chien", "dog", "N"], "-", ["s", "PL", "Num"]],[["mange", "eat", "V"], "-", ["aient", "3PL.IMPF", "Agr"]]](15) le|the|D-s|PL|Num chien|dog|N-s|PL|Num mange|eat|V-aient|3PL.IMPF|AgrBy performing regular expression searches on break-gloss-category val-ues, conditions based on different aspects of the morphological analysis canbe expressed simultaneously. For example, one could search for all forms con-taining a bi-morphemic word whose first morpheme is categorially N andwhose second morpheme is glossed as PL (16).(16) ( |^)[^ -]+\|[^ -]+\|N-[^ -]+\|PL\|[^ -]+( |$)In (16), ( |^) matches a space or the beginning of the string and, sim-ilarly, ( |$) matches a space or the end of the string. This ensures thatwe are matching a word. The sub-expression [^ -]+\|[^ -]+\|N matchesa morpheme categorized as N, while [^ -]+\|PL\|[^ -]+ matches a mor-pheme glossed as PL. In these sub-expressions, [^ -]+ matches a string ofone or more characters none of which are spaces or hyphens—this effectivelymatches a phonemic shape, a gloss, or a category name. The sub-expression\| matches the vertical bar, which must be escaped with a backslash so thatthe regular expression interpreter does not read it as signifying disjunction.The regular expression in (16) should return forms containing words an-1362.3. Featuresalyzed as chien|dog|N-s|PL|Num as well as words such as animaux thatmight be analyzed as anim|animal|N-o|PL|Num.The break-gloss-category values are especially valuable in that they allowfor fine-grained targeting of specific morphemes by permitting simultaneousreference to a morpheme’s shape, gloss, and category. This is useful for caseswhere two distinct morphemes have the same gloss (i.e., are synonymous)or the same shape (i.e., are homophonous) but the researcher wants to findforms containing only one of them. For example, depending on assumptions,Blackfoot may be analyzed as containing two homophonous -wa suffixes, oneverbal and glossed as 3SG and the other nominal and glossed as PROX.SG.By creating a search condition on break-gloss-category values using the reg-ular expression in (17), one can retrieve with certainty only forms containingthe verbal suffix.(17) ["Form", "break_gloss_category", "regex", "-wa\|3SG\|Agr( |-|$)"]Note that attempting the same search by conjoining conditions on themorpheme break and morpheme gloss values, as in (18), will not suffice.This is because there is no guarantee that the phonemic shape match andthe morpheme gloss match will correspond to the same morpheme. That is,the result set of the query in (17) will correctly exclude a form like (19)while the result set of (18) will incorrectly include (19).(18) ["and", [["Form", "morpheme_break", "regex", "-wa( |-|$)"],["Form", "morpheme_gloss", "regex", "-3SG( |-|$)"]]]1372.3. Features(19) Ipiimaipiim-aenter-3SGkikiandinihkiwainihki-wasing-3‘He came in and sang.’A further improvement to search functionality is made possible by thecorpus object introduced in the OLD 1.0. A corpus is, at its core, a list offorms. This list of forms can be written to a server-side file in accordancewith a predefined set of corpus generation formats. At present, the only im-plemented corpus generation format writes to file each of the corpus’ form’ssyntax values in their specified order and separated by newline characters.Assuming that these syntax values are phrase structure representations writ-ten in Penn Treebank-compatible bracket notation (cf. Marcus et al., 1993;Taylor et al., 2003) this generation format results in a treebank corpus thatcan be searched according to structural patterns using the TGrep2 utility(cf. Rohde, 2005). The effect is that OLD data sets can be searched accordingto complex syntactic criteria.TGrep2 allows treebanks to be searched with reference to structuralrelationships between nodes, including whether one node dominates, imme-diately dominates, precedes, follows, or is a sister of another node. Regularexpressions can also be used to match node names in a general manner. Theexample TGrep2 search expression in (20) returns all trees where a TP oran IP node (cf. the regular expression ^[TI]P$) immediately dominates (<)a DP which itself dominates (<<) an AP. Thus an OLD form for Le chiennoir boit ‘The black dog is drinking’ with syntax value (21) will be matchedby the TGrep2 search pattern in (20). For details on what is possible with1382.3. FeaturesTGrep2 structural searches, see Rohde (2005).(20) /^[TI]P$/ < (DP << AP)(21) (TP(DP (D le)(NP (NP (N chien))(AP (A noir))))(T'(T 0)(VP (V mange))))Of course, structure-based search of OLD data sets is only possible if asignificant subset of forms have valid syntax values. While some contributorsmay be motivated to create these manually, having the system generate themautomatically is highly desirable. The OLD 1.0 already allows contributorsto specify phonological and morphological models that can be used to createworking generators and parsers. I hope to implement similar functionalitywith respect to syntactic modelling, with the ultimate goal of allowing forthe creation of syntactic parsers that can help with the generation of struc-turally searchable syntactic representations. Analytic models and the parsersbased on them, when integrated into a collaboratively created database likethe OLD, constitute a unique opportunity for applying natural languageprocessing techniques and formalisms to endangered language data sets forthe benefit of fieldwork generally.This section has shown how OLD applications facilitate the speedy andaccurate retrieval of relevant data via a powerful search interface to a struc-tured data set. Auto-generated supplementary morphological representa-1392.3. Featurestions and structural search of treebanks further enhance the searchability ofthe data.2.3.7 AutomationThe OLD automates certain tasks in order to expedite the creation andretrieval of language data. The automatic generation of morphological rep-resentations via contributor-created parsers is discussed thoroughly in chap-ter 3. The automatic assignment of values to the syntactic category stringand break-gloss-category attributes of forms is detailed in § 2.3.6. In com-parison with the aforementioned, the following automations, though stilluseful, are minor: auto-insertion of previously used values during create andupdate, the automatic saving of state when deleting or updating certainobjects, and the creation of reduced-size copies of large digital files.The OLD 0.2 allows users to configure the system to automatically enterinto the form create/update interface previously entered values for category,speaker, elicitor, verifier, source, and date elicited. This speeds up data entry.Also, as a contributor types words into a transcription field, the systemattempts to guess values for the morpheme break and morpheme gloss fieldsbased on previously entered morphological analyses. Once the graphical userinterface for the OLD 1.0 is completed, users will be given the option tobuild and use particular morphological parsers to automate the creation ofmorpheme break and gloss values based on their transcriptions. However, toimprove performance (i.e., to avoid time-costly parse requests to the server),the OLD 1.0 GUI will store and (at least as a first approximation) look uppreviously entered morphological analyses server-side.1402.3. FeaturesWhenever a form or a collection is updated or deleted, all of its valuesprior to modification are saved to a backup table. This allows users to viewprevious versions of these data types and restore state, if desired. This isespecially useful in a multi-contributor system where one user’s data maybe altered undesirably by another.133 The OLD 0.2 currently allows usersto view the history of a form, i.e., its previous states.The OLD 1.0 can be configured to automatically create a lossy134 copy ofthe digital file that constitutes the core of a file object. That is, large audiofiles in WAV format will be copied to the significantly smaller MP3 or OGGformats. Similarly, scaled-down copies will be created for large images. Thesesmaller copies can be used when higher performance across the network isneeded. The larger, lossless copies are still retained for when they are needed,i.e., for archiving and detailed (e.g., acoustic) analysis.2.3.8 ConsistencyThe greater the consistency of a language database, the more effectively canrelevant data be retrieved from it. The OLD seeks to promote consistency inthe formatting of data while still allowing for variations that represent realdifferences in analysis. To this end, it provides features that encourage con-sistency of transcription, of morphological analysis, and of overall structure.The data structure of the OLD enforces a certain level of consistency in how133It may be desirable to implement a feature whereby a contributor can specify thatcertain or all of the forms that they enter not be modifiable by other users. At this pointin time, the honour system has proved sufficient in this regard.134Lossy file formats are those which employ compression to reduce the size of the fileyet result in information loss. File formats that do not result in data loss are termedlossless.1412.3. Featureslanguage data are carved up and labelled; the description of and argumentfor that structure is provided in § 2.3.4 and is not repeated here. Unicodenormalization, input validation, orthography conversion, and morpho-lexicalconsistency feedback are features that are covered here. Unicode normalizationThe Unicode standard (Unicode Consortium, 2009) specifies a very largeset of characters and assigns a unique code point, i.e., integer, to each one.There are a number of encodings that can be used to store strings of Unicodecharacters as binary data. OLD applications store string data using theUTF-8 encoding. Crippen (2010) is an excellent overview of Unicode withemphasis on its relevance to linguistics.Unicode data can sometimes result in troublesome inconsistencies whereequivalence classes are concerned. These are sets of characters and/or char-acter sequences that are considered equivalent. For example, the Unicodestandard defines single code points for both combining accent characters andfor certain precomposed character-accent combinations. To illustrate, thereis an equivalence between the precomposed character a´ (i.e., LATIN SMALLLETTER A WITH ACUTE, U+00E1135) and the two-character sequenceconsisting of a (LATIN SMALL LETTER A, U+0061) followed by theCOMBINING ACUTE ACCENT character U+0301. That is, a´ (U+00E1)is considered equivalent to a´ (U+0061, U+0301).135Note the convention of upper-casing names of characters. Also that of specifying aUnicode code point as a 4-digit hexadecimal number prefixed by U+. In fact, the standarddefines more characters than 164 = 65536 and it is not uncommon to see 5-digit hex codepoints.1422.3. FeaturesIssues arise because of inconsistencies in how different programs and op-erating systems handle these equivalence classes, with some opting for silentcomposition and others for decomposition. The result is that string datathat may appear identical to the user are actually represented differentlyunderlyingly and this can affect search. For example, searching a databasefor values containing a´ may (contrary to the searcher’s desires) return re-sults containing U+00E1 but not those containing U+0061, U+0301, or viceversa. An additional complication is that there is no precomposed counter-part to many sequences of base character followed by combining character,as is the case for acutely accented schwa @´ which can only be representedas the schwa character (U+0259) followed by the combining acute accentcharacter.The OLD handles this issue by normalizing all user input via the Nor-malization Form Canonical Decomposition algorithm better known by itsacronym NFD. To return to the previous example, this means that an ac-cented a character will always be stored within an OLD application as asequence of base character a followed by the combining acute accent charac-ter, i.e., as U+0061, U+0301. Both input that results in the creation of dataand input that constitutes search formulations are NFD-normalized in thisway. This ensures that users will be able to search for data accurately with-out worrying that Unicode equivalence issues might be masking potentialmatches.136136Note that Unicode NFD normalization has consequences for regular expression search.For example, the regular expression patterns [ae] and (a|e) are equivalent and will bothmatch strings that contain either a or e. However, the regular expression patterns [a´e] and(a´|e) are not equivalent. The former will match a, the combining acute accent characterU+0301, or e. The latter will match a´ or e, and is probably what is desired. Subtleties such1432.3. Features2.3.8.2 Input validationAn OLD application can be configured to restrict (or discourage) the useof certain characters and character sequences as values for certain formattributes, viz. the orthographic, narrow phonetic, phonetic, and morpho-phonemic transcriptions. This functionality thus enforces (or encourages)a certain degree of consistency in how linguistic forms are transcribed atall levels. In the OLD 0.2 graphical interface, the specified inventories ofgraphemes are used to generate clickable keyboard137 widgets for enteringall of the specified character sequences.138Administrators can specify inventories of graphemes for each of the fourtranscription-type attributes. A grapheme is a character or sequence of char-acters (i.e., a string). The system determines whether user input can beconstructed by concatenating the graphemes (plus whitespace characters,punctuation, and delimiters, as appropriate). If it cannot, either the at-tempted change is disallowed and an error message returned or a warningmessage is issued.Sometimes speakers will include words from other languages in their ut-terances and these words will be pronounced according the the phonologyof the borrowed language. Often it makes sense for fieldworkers to tran-scribe these foreign words using the foreign language’s orthography or usingas this need to be kept in mind when performing searches involving Unicode characters.137Such GUI “keyboards”, where non-standard characters are entered by clicking insteadof typing, result in slow data entry and are therefore not a permanent solution. If asuitable OS-based keyboard layout is not available, then the Unicode keyboard layouteditor Ukelele (created by SIL International) can be used to create one.138The OLD v. 0.2 also provides an interface which allows users to attempt to type thenecessary graphemes and then provides feedback comparing the name and code point ofthe characters entered with those prescribed.1442.3. Featuresphonetic characters not available in the validation inventories specified forthe object language. Clearly this will be problematic if transcription inputvalidation is set to restrict invalid input. The OLD handles this issue byallowing users to create forms tagged with the special foreign word tag. Thesystem uses these foreign word forms to intelligently build exceptions intothe input validation rules. For example, if John is a foreign word form in anOLD application for Blackfoot, then the system will allow a transcriptionlike Nits´ınoaa anna John ‘I saw John’ despite the fact that neither j nor Jare present in the orthographic inventory specified for Blackfoot. Note thatthe presence of the John foreign word will not license a transcription likeNits´ınoaa anna Joan; that will require the entry of a new foreign word formfor Joan.139Transcription input validation is notably useful in cases where there areseveral similar characters that might be used for a single purpose. A casewhere this comes up in fieldwork on languages of the Pacific Northwest iswhere there are several similar diacritic combining characters that contrib-utors might use in transcribing ejective consonants. For example, the COM-BINING COMMA ABOVE RIGHT (U+0315) character and the COMBIN-ING COMMA ABOVE (U+0313) character look very similar and both canbe used to signify an ejective consonant. Input validation can help to ensurethat ejectives are transcribed consistently.139If the foreign word participates in morphological or phonological processes of theobject language, it should probably be transcribed in accordance with the object language.E.g., the fabricated Blackfoot pseudo-example nitsigoogleatooma ‘I Googled it’ shouldprobably be transcribed as nitsikooklatooma or some such thing.1452.3. Features2.3.8.3 Orthography conversionThe OLD allows administrators to specify multiple grapheme inventoriesfor the orthographic transcription attribute of forms and designate one asthe system-wide storage orthography for such transcriptions. Users can thenspecify their own input/output orthography and use it to interact with thesystem wherever orthographic transcription values are concerned. The sys-tem transparently converts user input from the user-specific input orthog-raphy to the system-wide storage orthography and returns it in the user-specific output orthography. This facilitates consistency in orthographic rep-resentations while avoiding having to force contributors to adopt an unfamil-iar orthography. As it is not uncommon for understudied languages to havemultiple orthographies, orthography conversion can be a handy convenience.The OLD 0.2 implements orthography conversion in the server-side logic.Orthographic transcriptions are first converted and then, if necessary, theconverted transcription is validated as described in the above section. TheOLD 1.0 web service still allows for the specification of multiple orthogra-phies, but it leaves the actual conversion to client-side logic. That is, or-thography conversion for the OLD 1.0 is not yet implemented and will beimplemented in the in-development GUI for that system. Morpho-lexical consistency feedbackOLD applications display the morphemes within the morpheme break andmorpheme gloss values of forms as colour-coded links that indicate the de-gree to which the morphemes specified are already present as forms in the1462.3. Featuresdatabase. If there is a mono-morphemic (i.e., lexical) entry in the databasethat matches (both in terms of shape and gloss) a morpheme used in theanalysis of a poly-morphemic form, then the embedded morpheme is dis-played as a blue link to the matching lexical form. If the match is onlypartial, i.e., if only the shape or gloss matches, then the embedded mor-pheme is displayed as a green link. Hovering the mouse over the green linkbrings up a display of the partial matches, thus allowing the user to alterthe analysis (if appropriate) without performing a separate search task.Unicode normalization of input, input validation, orthography conver-sion, and visual feedback on the lexical consistency of morphological analy-ses all work together to promote a consistent data set. This consistency helpsfieldworkers to quickly and accurately search for and otherwise retrieve rel-evant data from an OLD application.2.3.9 DictionaryThe OLD 0.2 provides a very basic dictionary-like interface to the lexicalitems within an OLD application. The system assumes that a form is lexicalif its morphological analysis lacks morpheme and word delimiters. These lex-ical forms are sorted and displayed according to the ordering of graphemes inthe administrator-specified orthographic inventories. This convenience pro-vides a familiar representation of (one characterization of) “the” lexicon140140The is in scare quotes because there may be several distinct lexica in an OLD ap-plication. For example, a single OLD application may be used by two distinct fieldworkergroups. Group A may analyze certain verbal forms as morphologically simplex and maydefine an OLD-internal lexicon (using the OLD’s corpus construct) as containing thoseputatively simplex verbal forms. Group B, however, may assume morphological complex-ity within those same verb forms and may therefore choose to define lexica which do notinclude such putatively complex verbs but instead contain the morphemes (e.g., verb roots1472.3. Featuresthat is implicit in an OLD application.The OLD 0.2 dictionary implementation suffers, however, from the factthat it is, upon each new request, re-retrieved by means of a costly databasequery and its display representation regenerated. In order to improve respon-siveness, the OLD web service or the GUI should be amended so that therepresentations of dictionaries are cached.141 Other issues with the OLD’ssupport for dictionary creation are discussed above in the sections on Ling-Sync (§ 2.2.4) and FLEx (§ 2.2.3).2.3.10 DocumentsAn OLD application allows users to create documents consisting of format-ted prose and IGT-formatted representations of the form objects that arepresent in the database. Such documents are composed as collection ob-jects. These objects have a number of attributes, but the most importantamong them is the contents attribute whose value is a string of text con-taining formatting commands in one of two lightweight markup languages(reStructuredText and Markdown) and references to form objects. The valueof the contents attribute is converted to HTML when displayed and can beconverted to XeLaTeX (whence PDF) when exported. The form objects ref-erenced are displayed in IGT format in both the HTML and XeLaTeX/PDFoutputs.The forms referenced within the contents of a collection are also rela-and transitivity affixes) within them.141An alternative approach would be to create a separate web application dedicated toproviding a dictionary interface to a lexicon defined as a corpus (or search) object of anOLD 1.0 web service. I leave this for potential future work.1482.3. Featurestionally associated to their collections, i.e., they constitute the value of acollection’s forms attribute. In the OLD 1.0, one consequence of this is thatsearches over forms can restrict results to forms that are associated to one ormore specified collections. That is, one can search within a collection. OLDcollections can also have zero or more file objects associated with them viatheir files attribute. Finally, as of the OLD 1.0, collections can referenceother collections within their contents values; this means that a series ofsmall collections can be combined to create a large one.Collections are useful tools for creating documents from the data presentin an OLD application. Example use cases are representations of elicitationrecords, narratives, conversations, problem sets, lesson plans, chapters ofgrammars, and research papers. A collection representing a record of anelicitation session may, for example, contain the forms elicited during thesession, notes by the researcher concerning the relevance of certain formsand ideas for future elicitations, and audio recordings of the entire session.Since collections can be exported as XeLaTeX142 (see below) they can easilybe used as rough drafts for professional documents.The OLD 1.0 includes a revised data structure for source objects whichis effectively identical to the structure of BibTeX, the bibliographic databaseformat commonly used with (Xe)LaTeX. This means that future versions ofthe OLD may allow users to cite sources within the contents of collectionobjects and so generate (XeLaTeX and HTML) documents that containin-line citations and formatted bibliographies. This would further improve142For collections using reStructuredText markup, it is also possible to export into aformat that is compatible with the Microsoft Word and word processors.Implementing this type of export is a planned feature.1492.3. Featuresthe usefulness of OLD collections in facilitating the speedy generation ofprofessional (draft) documents using the data therein.Currently, the OLD 0.2 allows users to export collections using a numberof XeLaTeX-based formats. XeLaTeX is a typesetting program that allowsusers to users to write documents that contain commands which control howthe document is to be structured and formatted. These XeLaTeX source filescan be used to generate professionally formatted documents in PDF (andDVI) format. The benefit of using XeLaTeX as opposed to a what-you-see-is-what-you-get word processing program like Microsoft Word is thatauthors can focus on content and, by specifying a few parameters, let thetypesetting program handle the formatting in a professional and consistentmanner. XeLaTeX is very similar to the more familiar (and older) LaTeX,the primary difference being that XeLaTeX allows for Unicode charactersin the source files. This is convenient for export from OLD applicationssince developers do not have to worry about writing general algorithms fortranslating Unicode characters to LaTeX commands.143The differences between the various XeLaTeX-based collection exportformats has to do with which packages are used to format the interlinearglossed text representations of forms. The OLD 0.2 currently allows formswithin collections to be IGT-formatted using either the Covington or Ex-Pex packages. In addition to collection export, individual forms as well assearch result sets of forms can be exported in a number of formats, includ-143Since XeLaTeX source files can be typeset as PDF documents via command lineutilities, it would be relatively trivial to implement functionality that would non-mediatelygenerate exportable PDF versions of collection objects. This is a planned feature for theOLD 1.0.1502.3. Featuresing XeLaTeX-based, several plain text-based, and a comma-separated valuesformat that is importable into spreadsheet software like Microsoft Excel.The OLD 1.0 does not currently implement any export functionality, ei-ther for collections or for (lists of) forms. Since all data returned by an OLD1.0 web service are already structured as JSON objects, the task of generat-ing export documents can easily be delegated to the user-facing applications,such as the browser-based GUI that is currently under development.144Functionality allowing contributors to import structured data into anOLD application is highly desirable but does not yet exist. Of course, sincethe OLD 1.0 is a web service that exposes a standards-compliant JSON/HTTP-based API, it would be relatively trivial to create command-line programsthat take documents containing structured language data as input and usethese to perform form create requests on live OLD applications. That said,the in-development GUI for the OLD 1.0 will include functionality that fa-cilitates data import. This is especially useful for encouraging contributionsfrom users who do not want to use an OLD application exclusively for allof their fieldwork tasks and would instead prefer an easy way to importthe structured data produced by their preferred software. Such softwaremight include FLEx, Shoebox, Toolbox, ELAN, LingSync, FileMaker Pro, database, RDBMSs (e.g., MySQL, SQLite, PostgreSQL, Or-acle, etc.), spreadsheet programs, etc. The LingSync fieldwork applicationcurrently provides a graphical import interface that allows users to inter-actively map the structure of the file they are attempting to import to the144Typesetting XeLaTeX source files is not currently possible using browser-based (i.e.,JavaScript-based) technologies. Such functionality would need to be implemented server-side within the OLD 1.0 web service.1512.3. Featuresapplication’s data structure. Such functionality might be emulated by theOLD. Another option is to use comparisons of character sequence frequenciesto intelligently guess the structural position appropriate to the componentsof the import data.1452.3.11 AccessRestrictions on access to data in a collaborative linguistic fieldwork appli-cation are important from two distinct vantage points. First and foremostare the requirements of the speakers and communities whose knowledge isencoded in the linguistic artifacts. Also important are the requirements ofthe fieldworkers who help to create the artifacts. This section describes andmotivates the features of the OLD that enable administrators and contrib-utors to restrict who has access to OLD language data and what kind ofaccess is granted.There are a range of reasons that speakers of endangered languages andtheir communities may have for wanting to restrict access to the data gen-erated by fieldworkers and language consultants. It often happens that aspeaker will speak quite candidly during a recorded elicitation session andmay want to restrict access to all or parts of that recording for personalreasons. It also happens that particular stories or descriptions of rituals andcultural practices need to be restricted to just the language community oreven to sub-groups within the community. Given the post-colonial context146145This is an interesting approach which was suggested to me by fellow UBC Linguisticsgraduate student and fieldwork software developer Patrick Littell. He made effective useof it in a wiki-based application that was used in a UBC field methods course.146It is not unheard of that academic work related to indigenous culture and historycan have unintended yet significant political and economic impact on the community1522.3. Featuresof many fieldwork situations, there is an entirely understandable distrust ofthe dominant society that may result in speakers or communities adopting agenerally cautious (or even zero tolerance) approach when it comes to shar-ing language data via a web-based application like the OLD. Whatever theparticulars of the motivations, fieldworkers must communicate thoroughlyand honestly about their intentions for sharing fieldwork data and come toan ethical agreement with speakers and communities concerning policies ofdata access.Fieldworkers may also hold various positions on whether and the extentto which they want to share their fieldwork data. In my experience, theprimary concern of fieldworkers in the context of a collaborative fieldworkapplication is that their data not be modified by other contributors withouttheir knowledge.The OLD provides a number of features which facilitate the creation ofrestrictions on access to data. The most basic is authentication. That is,only registered users with valid usernames and passwords can gain accessto an OLD application. Passwords are encrypted using the Python mod-ule PassLib’s implementation of Password-Based Key Derivation Function2 (PBKDF2) using a salt and the default 20,000 iterations. The passwordsare stored encrypted on the server, meaning that nobody (except some-body with the expertise to crack PBKDF2-encrypted passwords) can viewthe passwords in their unencrypted form. While I make no claim to cryp-tographic expertise, PBKDF2-based encryption is widely used in modernconcerned. I am thinking here of a case where a friend’s master’s thesis on a First Nations’traditional fishery management practices became a factor in a court decision concerningmodern-day rights to harvesting that resource.1532.3. Featurestechnologies147 and it can be safely assumed that the effort required to crackan OLD user’s password would probably exceed the payoff for a would-behacker. In short, breaking into an OLD application should be prohibitivelydifficult.In addition to basic authentication, authorization to perform specifiedactions is determined by the value of the role attribute of the password-authenticated user. The roles defined by the OLD are administrator, con-tributor, and viewer. Administrators can do anything. Contributors can doeverything except alter system-wide settings, create users, and modify re-stricted forms (see below). Viewers can only read data and can never create,update, or delete OLD objects. A form can be deleted only by the contrib-utor who entered it or by an administrator. A form may be updated by anycontributor or administrator. However, since the system creates a backup ofa form whenever it is updated or deleted, previous state is never lost andcan be restored.148In addition to the role-based authorization system there is a per-objectauthorization system. Objects tagged with the special restricted tag are con-sidered restricted. These objects can only be read or modified by adminis-trators and unrestricted users. Administrators define the set of unrestrictedusers (a subset of the contributors and viewers) as an attribute of the active147According to Wikipedia, PBKDF2 is used by Mac OS X Mountain Lion, Apple’s iOS,Wi-Fi Protected Access (WPA and WPA2), and Microsoft Windows Data Protection API,among others.148The new GUI for the OLD 1.0 may provide a feature which would inform contributorswhenever one of the forms they have entered or elicited has been changed, i.e., somethingsimilar to the activity feed of LingSync corpora. Such a feature could prove useful infield methods courses for allowing junior fieldworkers to be notified of their supervisors’comments on their work.1542.3. Featuresapplication settings object.Taken together, the role-based and restriction-based authorization sys-tems allow some level of nuanced control over who has what level of accessto which data. The net effect is an extension in the hierarchy of user classi-fication from administrators with the highest level on through unrestrictedcontributors, contributors, unrestricted viewers, and finally viewers, whohave the lowest level of access.There are some notable ways in which the authentication and authoriza-tion system of the OLD could be improved. Perhaps it should be possiblefor contributors to tag a subset of the objects that they enter as private andthese would be viewable only to their enterers and to administrators. Onthe other end of the spectrum, it might be desirable to have the ability totag certain objects as unrestricted and have these accessible even withoutpassword-based authentication.The access restriction-facilitating features of the OLD can only con-tribute so much to practical solutions to this complex issue. Administratorsand contributors need to communicate openly with their language consul-tants and the relevant communities and have clear agreements and protocolslaid out for controlling access. As a result, it may be that certain data shouldsimply not be entered into an OLD application. Other data may require re-stricted status. Having protocols in place for handling requests for accesswill save a lot of headaches.In certain situations, and when appropriate conventions have been es-tablished, it may make sense for contributors to modify the informationentered by others. For example, in order to keep analytical information ac-1552.3. Featurescurate and consistent, another user’s morpheme break, morpheme gloss, tag,or category information may be modified. If such modification would resultin changes that the original elicitor does not agree with, then the form canbe duplicated prior to modification. The modified duplicate will retain theenterer and elicitor values of the original and the modifier value will indicatewhich user has made the changes.149Note finally that the frameworks upon which the OLD are built—i.e., Py-lons and SQLAlchemy—have built-in support for security against maliciousattempts to access the system, e.g., via SQL injection attacks (Gardner,2008).2.3.12 DocumentationGood documentation is crucial. Good software that lacks it will, unfortu-nately but probably, languish. This dissertation is a high-level descriptionof the OLD and an argument for its usefulness as a tool for linguistic field-work. However, it is not, in itself, sufficient as documentation. This sectiondescribes the current state of OLD documentation.The OLD 0.2 applications that are currently being used have in-builtuser-oriented documentation in the form of an HTML document entitledthe OLD User Guide.150 This document provides an overview of the soft-ware, explaining the data structure and the interface for interacting with149The OLD 0.2 user interface has a feature which allows users to easily duplicate aform, a feature that will be re-implemented in the OLD 1.0 GUI.150The OLD User Guide can currently be found at However, once the GUI forthe OLD 1.0 has been completed, the URL will resolve to ademo OLD 1.0 application that will contain a copy of the new documentation.1562.3. Featuresthe system. While still useful for users of that version, the OLD User Guideis out-of-date even for the OLD 0.2.Michael McAuliffe, a fellow UBC Linguistics graduate student, took theinitiative to create a quick-start manual for the OLD 0.2 entitled the OLDQuick Reference Guide. This succinct guide for new users of OLD 0.2 appli-cations can be found within certain language-specific OLD applications.Detailed documentation for the OLD 1.0 has been written (Dunham,2013a).151 This document, simply entitled OLD Documentation: Release1.0a1, catalogues and explains the data structure, interface, and applica-tion logic of the OLD 1.0. It also provides detailed instructions on how todownload and install the software. The code of the OLD is itself extensivelydocumented using standard Python conventions. That is, modules, classes,methods, and functions all contain formatted documentation strings (a.k.a.docstrings). These docstrings are auto-assembled and incorporated into theOLD Documentation document as the final chapter entitled API Documen-tation. This chapter is useful as a reference for developers wishing to quicklyunderstand implementation details of the system. Since all of this documen-tation is written in reStructuredText, it can be (and has been) convertedto both HTML and PDF formats.152 The entire documentation (i.e., boththe source reStructuredText documents and the generated HTML and PDFversions) comes packaged with the source which is available on GitHub andthe Python Package Index.151 HTML and PDF versions of the OLD 1.0 documentation can be accessed at by clicking on the icon in thebottom right corner labelled “v: latest” and then clicking on “HTML” or “PDF”, respec-tively.1572.4. Summary2.4 SummaryThis chapter has described the OLD and argued that it is a useful and inno-vative tool for accomplishing linguistic fieldwork goals. The case for the OLDcan be summarized as follows: it helps diverse groups of fieldworkers to buildcentralized and web-accessible stores of structured, consistent, exportable,highly searchable, and reusable data. In addition, it is well-documented andhas been tested by dozens of fieldworkers against the idiosyncrasies of nineunder-documented and/or endangered languages.The next chapter (3) focuses on the OLD functionality which allowsusers to implement morphological and phonological models and to buildautomated analyzers (i.e., parsers) in order to expedite the creation of mor-phological representations and test analytical models against OLD data sets.This functionality can be viewed both as an additional point in favour of theOLD as a valuable tool for linguistic fieldwork as well as an example of howthe insights, formalisms, tools, and methods of computational linguistics cancontribute to linguistic fieldwork.158Chapter 3Morphological ParserCreatorThis chapter describes the OLD morphological parser creator, an applicationadded to the core OLD application (as described in chapter 2), which allowsusers to create morphological parsers using OLD data. The morphologicalparser creator takes rewrite rules and OLD corpora as input and creates amorphological parser as output. The parser creator uses finite-state trans-ducers (FSTs) to model the morphophonology and N -gram language models(LMs) to rank candidate parses.The incorporation of morphological parsers into the OLD contributes tothe software’s central goal of facilitating fieldwork by i) expediting morpho-logical analysis creation and ii) automating the testing of linguistic analysesagainst data sets that, for understudied languages, are relatively large. Chap-ter 5 presents the results of using the OLD morphological parser creator tobuild parsers for the Blackfoot language.This chapter is structured as follows. § 3.1 provides a high-level overviewof the architecture of the morphological parsers created by the application.§ 3.2 provides the background on finite-state machines and formal language1593.1. Architecturetheory for the subsequent sections which explain how to use the applicationto build phonologies (§ 3.3), morphologies (§ 3.4), and morphophonologies(§ 3.5). Finally, § 3.6 reviews N -gram LMs and explains how to use theOLD morphological parser creator to build parse candidate rankers on topof these.3.1 ArchitectureThe term morphological parser here refers to a computer program that, whengiven a transcription (phonetic or orthographic) of a word, returns a rep-resentation of its morphological analysis. This section provides a high-leveloverview of the components of parsers created by the OLD morphologicalparser creator.1603.1. ArchitecturePhonology FSTMorphologyFST(lexicon & rules)Morphophonology FSTParse Disambiguator – ngram language modelsurface transcriptionset of candidate parsesmost probable parseoFigure 3.1: Schema of an OLD morphological parser.Figure 3.1 schematizes the architecture of an OLD parser. The twohighest-level components are the FST that implements the morphophono-logy and the N -gram LM that serves as the parse disambiguator.153The morphophonology takes a transcription of a word as input and re-turns a set of sequences of morphemes and delimiters, i.e., candidate mor-phological analyses. A morphophonology is the composition of a phonologyFST and a morphology FST. A phonology analyzes a surface (i.e., phonetic153I use ranker and disambiguator interchangeably for this component. The N -gram LMranks a candidate set by assigning a probability to each parse candidate. By selecting themost probably parse we disambiguate the candidate set.1613.1. Architectureor orthographic) representation and returns a set of sequences of morphemes.A morphology filters out those morpheme sequences that are incompatiblewith a given lexicon and a given set of morphotactic rules. The parse disam-biguator returns the most probable parse from the set of candidates returnedby the morphophonology.An FST is a formal model that encodes a regular relation, i.e., a set ofordered pairs of strings where the first element of each pair is a member ofa given regular language and the second element is a member of anotherregular language. Phonological transformations and morphotactically validsequences of morphemes can both be expressed as regular relations and, asa result, as FSTs (Johnson, 1972). In addition, any sequence of FSTs can becomposed to form a single FST that performs the same mapping as serial ap-plication based on the original sequence (Schu¨tzenberger, 1961). Since FST-based analysis and generation are computationally tractable transformationson strings, and since there are efficient algorithms and implementations forcompiling CS phonological rewrite rules to FSTs (Karttunen et al., 1992),it is easy for linguists with some minimal training in rule-based phonologyto specify phonological and morphological components that can be used tocreate computer programs that efficiently parse surface transcriptions tosequences of morphemes.N -gram LMs—i.e., the tools used to disambiguate the outputs of mor-phophonological analysis (i.e., select a best candidate from multiple)—aredata structures that can be used to assign probabilities to sequences ofmorphemes, i.e., morphologically analyzed words (cf. Manning and Schu¨tze,1999).1623.1. ArchitectureBy creating interfaces to the rule-to-FST compiler foma (Hulden, 2012)and the N -gram LM estimator MITLM (MITLM, 2013), and by integratingfieldwork data sets into these components, the OLD morphological parsercreator facilitates the task of constructing fully functional morphologicalparsers. This involves, minimally, writing a phonology script as a sequenceof CS rewrite rules and specifying three corpora: one for extracting a lexicon,another for extracting morphotactic rules, and a final one for calculating N -gram probabilities.This approach to creating morphological parsers is justified within thecontext of the OLD because it leaves the fieldworker in control of the phonol-ogy and the morphology. Since fieldworkers tend to be experts (or at leastsufficiently knowledgeable) in these domains, such control allows for the cre-ation of parsers that are both practical fieldwork tools—insofar as they ex-pedite the creation of morphological analyses—and effective research tools—insofar as they allow for specific phonological and morphological models tobe implemented computationally and quickly tested against rare data sets.The OLD morphological parser creator illustrates one particular type ofapproach to creating morphological parsers. While it does, as just mentioned,have the benefit of giving linguists fine-grained control over the underly-ing models, other approaches (e.g., Gildea and Jurafsky (1996); Goldsmith(2001)) could result in more robust and better performing parsers. Somedeficiencies of the current approach include the fact that the morphologycomponent (§ 3.4) cannot handle unseen category sequences or unseen mor-phemes.154 The parser creator application described here should be viewed154It should also be made clear that since the morphological parser creator is part of1633.2. Finite-state machines & grammarsas a module which has been added to the core functionality of the OLD.Additional morphological parser creation modules may, in the future, be de-veloped and added to the OLD in order to complement and/or improve onthe functionality described here.1553.2 Finite-state machines & grammarsThis section provides an overview of finite-state machines (i.e., automata andtransducers), formal grammars, and the languages and relations that theydescribe. It may prove a useful reference for the sections on OLD phonologies(§ 3.3), morphologies (§ 3.4), and morphophonologies (§ 3.5) that follow.A regular grammar is a set of rules for generating a regular language,i.e., a set of strings. Formally, a regular grammar is a four-tuple (N,Σ, P, S),where N is a finite set of non-terminal symbols, Σ a finite set of terminalsymbols (including the empty string ), P a finite set of production rules,and S ∈ N the unique start symbol. The set of production rules of a regulargrammar are of the form N → Σ∗N , i.e., rules where a non-terminal symbolexpands to a string of zero or more terminal symbols followed by a non-the OLD version 1.0 and since that version currently lacks a production-ready GUI, thereis no GUI for the parser creator or for the parsers created by means of it. This meansthat parsers created by this application cannot currently be used to suggest parses (orordered lists of candidate parses) to users of an OLD application during data entry. Suchinterface-integrated parse suggestions are a crucial ingredient in making these parsersuseful to fieldworkers and this functionality is a major planned feature for the GUI for theOLD 1.0 that is currently under development.155For the sake of brevity, I sometimes refer to components of the parsers created usingthe OLD morphological parser creator as simply OLD phonologies, OLD morphologies,etc. These should be understood as shorthands, e.g., for phonology created using the OLDmorphological parser creator, etc. and should not be taken to imply the preclusion of othertypes of parsers or parser components within the OLD.1643.2. Finite-state machines & grammarsterminal (cf. Roche and Schabes, 1997).156The regular grammar in (22) generates the (regular) language consistingof the set of strings that begin and end with a and have zero or more bs inthe middle, as represented in (23).(22) S → aBaB → bBB → (23) {aa, aba, abba, abbba, abbbba, . . .}The regular language generated by the regular grammar (22) can alsobe expressed by the regular expression157 in (24).(24) a b* aAll and only the strings generated by (22, 24) will be recognized bythe FSA depicted in the network diagram in Figure 3.2. In these networkdiagrams, states are represented by circles and transitions by directed arcs.State 0 is the unique start state and all states with double borders are finalstates.158156Technically, the restriction that production rules be of the form N → Σ∗N results ina right regular grammar. Grammars where all production rules are of the form N → NΣ∗are left regular grammars.157The regular expression syntax in (24) is that accepted by foma (and XFST). It issimilar but not identical to that accepted by the regular expression parsers included inmany programming languages. The primary difference has to do with the treatment ofwhitespace. In the foma syntax, a sequence of adjacent characters are parsed as a singlesymbol while in other regular expression parsers this is not the case. As a result, the fomaregular expression a b* a is equivalent to the Python regular expression ab*a while fomaab*a is equivalent to Python (ab)*a.158These network diagrams are created using foma.1653.2. Finite-state machines & grammars0 1a b 2a Figure 3.2: FSA network diagram for a b* a.An FSA can be thought of as a recognizer or characteristic function onstrings. In order to use an FSA network diagram to recognize a string, beginin q0 (i.e., the initial state) and at the left edge of the string to be recognized.Move from state to state by following arcs and consuming from the stringone of the symbols labelled on the arc. Continue this process of transitioningbetween states until the string has been completely consumed or no furthertransitions are possible. If the string has been consumed and a final statehas been reached, then the FSA recognizes the string. In order to generatestrings, begin at q0 with an initially empty string. Then follow an arbitrarilychosen arc and suffix one of its labelled symbols to the string. Continue thisprocess an arbitrary number of times and stop on a final state.Formally, a (deterministic159) FSA is a 5-tuple (Q,Σ, δ, q0, F ), whereQ is a finite set of states, Σ is a finite set of symbols, δ is the transitionfunction (from ordered state-symbol pairs to states, i.e., Q× Σ→ Q), q0 isthe start state, and F is the set of accept states (cf. Hulden, 2009; Beesley159A deterministic FSA is one wherein “no state has more than one outgoing arc withthe same label” (Beesley and Karttunen, 2003, p. 75).1663.2. Finite-state machines & grammarsand Karttunen, 2003). The FSA represented by Figure 3.2 is given in (25).(25) ({q0, q1, q2}, {a, b},{(q0, a)→ q1, (q1, a)→ q2, (q1, b)→ q1}, q0, {q2})Now consider the regular relation illustrated in (26), i.e., the infiniteset of ordered pairs where the first element is a string from the universallanguage160 and the second is that same string with every a replaced by ab, except when the a is preceded by a b.(26) {(pineapple, pinebpple), (banana, banbnb), (plum, plum), . . .}If R is the regular relation expressed by an FST, then let dom be thedomain of R, i.e., dom = {x|∃y : (x, y) ∈ R}, and let ran be its range,i.e., ran = {y|∃x : (x, y) ∈ R}. Generation (a.k.a. downward application)is the process of providing a string s ∈ dom as input and receiving asoutput {y|(s, y) ∈ R}. The converse operation is analysis (a.k.a. upwardapplication) which involves providing as input s ∈ ran and receiving asoutput {x|(x, s) ∈ R} (cf. Beesley and Karttunen, 2003).161Using the foma syntax,162 the regular relation in (26) can be expressedby the rewrite rule in (27).160The universal language is the set of all finite strings and is expressible in foma regularexpression syntax as ?* and in conventional regular expression syntax as .*.161In general when working with FST jargon, it is useful to simply memorize the asso-ciation of left, upper, and underlying on the one hand and right, lower, and surface on theother. That is, the left-hand side of an ordered pair (a, b) is the upper side and analysis(a.k.a. parsing, upward application, or lookup) is the process of mapping a string fromthe right-hand/lower side to one from the left-hand/upper side.162What is here termed the foma syntax was actually first developed for the XeroxFinite-State Transducer program, i.e., xfst. Kaplan and Kay (1994) provides an algorithmfor compiling CS phonological rewrite rules to FSTs.1673.2. Finite-state machines & grammars(27) a -> b || \b _The notation of (27) is very similar to the CS rewrite rules of Chomskyand Halle (1968) and subsequent work in rule-based phonology. Such rulesare of the form α → β | δ γ, i.e., α is rewritten as β whenever it followsδ and precedes γ, where α, β, δ, and γ are arbitrarily complex strings thatmay be empty. In the foma rewrite rule syntax, the functional symbols areslightly different: → is ->, | is ||, and is _. In addition, α, β, δ, andγ are regular expressions in the foma syntax. For example, the left-handside context of (27) is the regular expression \b which denotes the “termcomplement language” (Beesley and Karttunen, 2003, ch. 2) of b, i.e., theset of all single-character strings minus b.That the transformations expressed by phonological CS rewrite rulescan also be expressed as regular relations is an important finding. Sinceany regular relation can be expressed as an FST and since an FST can beused to efficiently map strings from the relation’s domain to its range (i.e.,generation) or vice versa (i.e., analysis), the finding means that computa-tionally tractable parsers and generators can, in principle, be created onthe basis of CS phonological rewrite rules. This is a surprising result sinceunrestricted CS grammars are more expressive (i.e., can be used to generatea wider range of languages) than the regular grammars that generate thedomains and ranges of regular relations. However, Johnson (1972) showedthat phonological rewrite rules always tacitly assume that the site of appli-cation moves left or right after each application and that this results in theequivalence with regular relations. For example, the rule i → ii | s will1683.2. Finite-state machines & grammarsgenerate sii from si but it will not generate siii or siiii, etc. from that sameinput (cf. Beesley and Karttunen, 2003, ch. 5).An FST that implements the regular relation (26) is depicted in thenetwork diagram in Figure 3.3. The labels on the arcs are ordered pairs ofstrings: a and b are shorthand for (a, a) and (b, b), respectively; <a:b> is anotational variant of (a, b); and @ is shorthand for {(e, e) : e /∈ {a, b}}.0b 1@ a b @ <a:b> Figure 3.3: Network diagram for a -> b || \b _.Thus, to perform downward application (i.e., generation) on bana usingFigure 3.3, begin in q0, read b, write b, and return to q0. Then read a, writea, and move to q1. Then read n, write n, and stay in q1. Then read a, writeb, and stay in q1. The result is banb.In the interfaces of foma and XFST, the command apply down (or justdown) is used to generate and the command apply up (or just up) is usedto analyze (or parse). Using the example FST diagrammed in Figure 3.3,apply down banana will yield {banbnb} while apply up banbnb will yield{banbnb, banbna, bananb, banana}. The apply up and apply down terminol-ogy is used in the interfaces to the parser-related resources that are exposedby the OLD.1693.3. PhonologiesThe formal definition of an FST is a 5-tuple (Q,Σ, q0, δ, F ), where Q isa finite set of states, Σ is a finite alphabet, q0 ∈ Q is the initial state, δ is apartial mapping from Q× (Σ ∪ {}) to Q× (Σ ∪ {}), and F ⊆ Q is the setof final states (cf. Hulden, 2009; Beesley and Karttunen, 2003). The FSTdepicted in Figure 3.3 is defined in (28).163(28) ({q0, q1}, {a, b}, q0,{(q0, b)→ (q0, b), (q0, a)→ (q1, a), (q0,@)→ (q1,@),(q1,@)→ (q1,@), (q1, a)→ (q1, b), (q1, b)→ (q0, b)}, {q0, q1})More sophisticated and linguistically relevant foma rewrite rules (andtheir equivalent FSTs) are discussed in the sections that follow and, in par-ticular, in the exposition of a phonology FST for Blackfoot given in § 5.2.1.For further reference on FSTs, their specification via CS rewrite rules as in(27), and practical examples related to natural language morphology andphonology, see Beesley and Karttunen (2003). While that text assumes theproprietary X(erox) FST software, the interface exposed by foma is nearlyidentical and most of the examples can be tested and explored using thisopen source and freely available alternative.3.3 PhonologiesAn OLD phonology is a mapping between underlying and surface represen-tations. The practical purpose of a phonology within the context of a parserbuilt using the OLD morphological parser creator is to analyze phonetic or163Here @ represents any symbol not present in Σ, i.e., @ = {x|x /∈ Σ}. In mappings, itrepresents the same symbol on both sides of the arrow. That is, @ = {(x, x)|x /∈ Σ}.1703.3. Phonologiesorthographic transcription values to morpheme break values.164 That is, aphonology may encode either canonical phonological transformations (i.e.,from phonemes to phones) or spelling rules.An OLD phonology is an ordered set of rewrite rules that is specifiedvia the foma regular expression/rewrite rule syntax introduced in § 3.2 (cf.Hulden, 2012; Beesley and Karttunen, 2003).165 This syntax accepts state-ments of the form provided in (29), i.e., the define keyword, followed bythe name of the FST to be defined, followed by the FST definition in CSrewrite rule format, and terminated by a semicolon.(29) define <name> <FST-definition> ;The only requirement imposed by the OLD is that the phonology scriptdefine an FST named phonology. Typically, the definition of the phonologyFST will be the composition of an ordered list of previously defined FSTsrepresenting phonological rules.Consider breaking, a Blackfoot phonological rule defined in Frantz (1991)and repeated in (30). Since the /kI/ sequence never, to my knowledge, occursmorpheme-internally,166 we can rewrite this rule as a transformation on themorpheme delimiter “-”, as in (31).164Actually, as discussed below, the underlying representation that is the input to phono-logical generation or the output of phonological analysis can also be a string representationof sequences of richly represented morphemes. This means that certain OLD phonologiescan be viewed as analyzing transcription values to break-gloss-category values.165A reference document for foma regular expressions is available at (1971, 1978) and, to a lesser extent, Frantz (1991) use the capital “I” to denotea phoneme with identical features to /i/, i.e., a high front unrounded vowel, but whichcauses breaking in the immediately p