UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detecting dementia from written and spoken language Masrani, Vaden 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_february_masrani_vaden.pdf [ 1.09MB ]
JSON: 24-1.0362923.json
JSON-LD: 24-1.0362923-ld.json
RDF/XML (Pretty): 24-1.0362923-rdf.xml
RDF/JSON: 24-1.0362923-rdf.json
Turtle: 24-1.0362923-turtle.txt
N-Triples: 24-1.0362923-rdf-ntriples.txt
Original Record: 24-1.0362923-source.json
Full Text

Full Text

Detecting Dementia from Written and Spoken LanguagebyVaden MasraniBSc., The University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)December 2017c© Vaden Masrani, 2017AbstractThis thesis makes three main contributions to existing work on the automatic de-tection of dementia from language. First we introduce a new set of biologicallymotivated spatial neglect features, and show their inclusion achieves a new state ofthe art in classifying Alzheimer’s disease (AD) from recordings of patients under-going the Boston Diagnostic Aphasia Examination. Second we demonstrate howa simple domain adaptation algorithm can be used to leveraging AD data to im-prove classification of mild cognitive impairment (MCI), a condition characterizedby a slight-but-noticeable decline in cognition that does not meet the criteria fordementia, and a condition for which reliable data is scarce. Third, we investigatewhether dementia can be detected from written rather than spoken language, andshow a range of classifiers achieve a performance far above baseline. Additionally,we create a new corpus of blog posts written by authors with and without dementiaand make it publicly available for future researchers.iiLay SummaryDifficulty producing language is a well known sign of early onset dementia. Thishas led to recent attempts to create non-invasive diagnostic tools that detect demen-tia from samples of a patient’s language. Our work makes three main contributionsto this effort. First, we suggest a new set of biologically motivated “spatial neglect”features that improve our ability to detect Alzheimer’s disease from recordings ofpatients undergoing standard diagnostic exams. Second, we demonstrate how touse Alzheimer’s data to detect Mild Cognitive Impairment, a condition for whichreliable data is scarce. Last, we investigate whether dementia can be detected fromwritten language, a more difficult task than using spoken language because writersare able to make revisions to the text. We develop a new blog post data set andshow our system is able to correctly classify posts at a rate far above baseline.iiiPrefaceAll of the work presented henceforth was conducted in the Laboratory for Compu-tational Intelligence in the Department of Computer Science at the University ofBritish Columbia (Point Grey campus), in collaboration with Dr. Thalia Field atthe UBC Faculty of Medicine and Dr. Gabriel Murray at University of the FraserValley. I was the lead researcher, responsible for the coding, data preprocessingand analysis, plots, concept formation, and first drafts of the manuscripts. Dr.Giuseppe Carenini, Dr. Thalia Field and Dr. Gabriel Murray were responsiblefor concept formation, draft edits, interpreting the results, and suggestions for im-provement. This work originally began as a class project in collaboration withHalldor Thorhallsson and Jacob Chen, both of whom contributed to the featureextraction code in Chapter 3.Three publications came from this work. The results from Improving Diag-nostic Accuracy Of Alzheimer’s Disease From Speech Analysis Using Markers OfHemispatial Neglect [25] appear in Chapter 4. The results from Domain Adapta-tion for Detecting Mild Cognitive Impairment [51] in Chapters 4 and 5. The resultsfrom Detecting Dementia through Retrospective Analysis of Routine Blog Posts byBloggers with Dementia in Chapter 6. The central findings from each publicationappear here, while the plots have all been expanded with more models and metricsfor consistency across chapters. The “we” I use throughout this work refers to my-self, Giuseppe Carenini, Thalia Field, and Gabriel Murray, the authors of the abovepublications.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 A Novel Feature Set: Spatial Neglect . . . . . . . . . . . 21.1.2 Domain Adaptation: Using Alzheimer’s Data To DiagnoseMild Cognitive Impairment . . . . . . . . . . . . . . . . 31.1.3 Written Language: A New Corpus And Demonstration OfViability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 4v1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Medical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 Alzheimer’s Disease And Other Dementias . . . . . . . . 52.1.2 Mild Cognitive Impairment . . . . . . . . . . . . . . . . 92.2 Automatic Detection Of Dementia . . . . . . . . . . . . . . . . . 112.2.1 Spoken Language . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Written Language . . . . . . . . . . . . . . . . . . . . . . 133 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.1.1 Dementiabank . . . . . . . . . . . . . . . . . . . . . . . 153.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Parts-Of-Speech (15) . . . . . . . . . . . . . . . . . . . . 173.2.2 Context-Free-Grammar Rules (44) . . . . . . . . . . . . . 183.2.3 Syntactic Complexity (27) . . . . . . . . . . . . . . . . . 193.2.4 Vocabulary Richness (4) . . . . . . . . . . . . . . . . . . 193.2.5 Psycholinguistic (5) . . . . . . . . . . . . . . . . . . . . 193.2.6 Repetitiveness (5) . . . . . . . . . . . . . . . . . . . . . . 203.2.7 Information Units (info-units) (40) . . . . . . . . . . . . . 203.2.8 Acoustic (172) . . . . . . . . . . . . . . . . . . . . . . . 213.3 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Evaluating Novel Feature Sets . . . . . . . . . . . . . . . . . . . . . 244.1 Spatial Neglect . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.1 Spatial Partitions . . . . . . . . . . . . . . . . . . . . . . 264.1.2 Spatial Neglect Features . . . . . . . . . . . . . . . . . . 274.2 Discourse Features . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.1 Discourse Parser: CODRA . . . . . . . . . . . . . . . . . 284.2.2 Discourse Features . . . . . . . . . . . . . . . . . . . . . 294.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 30vi4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.4.1 Baseline Classification Performance . . . . . . . . . . . . 304.4.2 Classification Performance With Novel Feature Sets . . . 354.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Detecting Mild Cognitive Impairment with Domain Adaptation . . . 435.1 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . 445.1.1 AUGMENT . . . . . . . . . . . . . . . . . . . . . . . . . 455.1.2 CORAL . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3 Baseline, Experiments, Results . . . . . . . . . . . . . . . . . . . 485.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 Detecting Dementia From Written Text . . . . . . . . . . . . . . . . 526.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 546.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.1.1 Spoken . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.1.2 Written . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 75viiList of TablesTable 2.1 Speech and language impairments in the individual types of de-mentia. Table replicated from Klimova and Kuca [41]. . . . . . 7Table 2.2 Major types of dementia and their characteristics. Table repli-cated from Kumar et al. [43]. . . . . . . . . . . . . . . . . . . 8Table 3.1 Demographics of DementiaBank Dataset . . . . . . . . . . . . 17Table 3.2 A list of info units and their synonyms. . . . . . . . . . . . . . 22Table 3.3 Models and their hyperparameters. . . . . . . . . . . . . . . . 23Table 4.1 List of info-units within each division. . . . . . . . . . . . . . 28Table 6.1 Blog Information as of April 4th, 2017 . . . . . . . . . . . . . 54Table A.1 List of all features. . . . . . . . . . . . . . . . . . . . . . . . . 78viiiList of FiguresFigure 3.1 Processing pipeline from clinical interview to evaluation. Weperform a 10-fold cross validation for the evaluation stage. Ex-periments use either the the blog data set (left) or the Demen-tiaBank data set (right) but not both. . . . . . . . . . . . . . 16Figure 3.2 Cookie Theft picture from the Boston Diagnostic Aphasia Ex-amination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 3.3 Manually transcribed sample response from a patient undergo-ing the Cookie Theft Picture Test. . . . . . . . . . . . . . . . 18Figure 4.1 Left: A clock drawn by a patient with left-side spatial neglect.Right: Eye movements of a patient with left-side spatial ne-glect. Patient was asked to search for letter T among Ls. Reddots are fixations and yellow lines are saccadic movements be-tween fixations. Images from Husain [34] . . . . . . . . . . . 25Figure 4.2 We divide the Cookie Theft image into halves (red), strips(blue), and quadrants (green), and create sets of info-unit withineach division. For example, the “girl” info-unit is in the lefthalf, far-left strip, and SW and NW quadrants. . . . . . . . . . 26ixFigure 4.3 Discourse tree for the two sentences “But he added: ‘Somepeople use the purchasers’ index as a leading indicator, andsome use it as a coincident indicator. But the thing it’s sup-posed to measure - manufacturing strength - is missed alto-gether last month.”’ Each sentence contains three ElementaryDiscourse Units (EDU)s. EDUs correspond to leaves of the treeand discourse relations correspond to edges. (Figure adaptedfrom [38]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 4.4 F-measure for different models as we vary the number of fea-tures included. Dark line shows the mean F-measure acrosseach of the 10-folds and 90% CI are shown in the shaded re-gions. Features are added in decreasing order of their abso-lute correlation with the labels in the training fold. Most mod-els reach their maximum performance between 35-50 featuresand then decline in performance as more features are included.This shows the need to include a feature selection step beforetraining each model. . . . . . . . . . . . . . . . . . . . . . . 31Figure 4.5 We show mean F-measure, accuracy, and Area Under the Curve(AUC) for each model at their optimum number of features(e.g. the peak performance in Figure 4.4). Error bars foreach model show 90% CI across all 10 folds. Logistic regres-sion performs best (ACC: 0.822, 90% CI=0.795-0.848, AUC:0.894, 90% CI=0.867-0.921, FMS: 0.824, 90% CI=0.798-0.850)and has the tightest error bars across all models. . . . . . . . . 32Figure 4.6 This shows the mean change in performance across modelswhen a feature group is removed and the model is retrained.A greater decrease in performance indicates a more significantfeature group. The number of features within each group arelisted in parenthesis after each group name. Acoustic, Demo-graphic, Parts of Speech and Information Content groups areimportant while Syntactic Complexity, Psycholinguistic and Vo-cabulary Richness are not. Large error bars indicate that thechange in performance varies quite significantly between folds. 34xFigure 4.7 Feature importance score is calculated by equation 4.5. Ascore of 1.0 indicated the feature was selected first in all 10folds, while a score of 0.0 indicates the feature was not se-lected within the top 50 features in any folds. Feature rankingdoes not depend on any particular model and only is based onthe correlation between the feature and the binary labels. Meanword length, age, and noun phrase to personal pronoun are thehighest scoring features on the DementiaBank data set. . . . . 38Figure 4.8 For each of the new feature sets we show the mean F-measureacross five models. We compare against ‘none’, which is theperformance of the existing system without the new feature set.halves improves the best model, logistic regression, from 0.824(90% CI=0.798-0.850) to 0.846 (90% CI=0.813-0.878). Stripsimproves logistic regression as well, to 0.833 (90% CI=0.801-0.866), although not as much as halves. Quarters and Dis-course have negligible effect on the performance of the bestclassifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 4.9 For each of the new feature sets we show the change in mean F-measure across five models when the new feature set is added.While halves improves the performance of the best classifier(logistic regression) it has mixed results on the suboptimalclassifiers. Large error bars indicate the change in performancevaries quite drastically between folds. Discourse features haveno effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 4.10 Feature importance score is calculated as shown in equation 4.5with the addition of the halves features. A score of 1.0 in-dicated the feature was selected first in all 10 folds, while ascore of 0.0 indicates the feature was not selected within thetop 50 features in any folds. Perception: Rightside receivesan almost perfect score, scoring more highly than Mean wordlength, age, and Noun Phrase To Personal Pronoun from Fig-ure 4.7. Three other halves features, Concentration: Rightside,Attention: Rightside and Perception: Leftside also score highly. 41xiFigure 4.11 Box plots of the top four features from Figure 4.10. Top leftshows right-side perceptivity, top right shows age, bottom leftshows noun phrase to person pronoun (a measure of how oftenthe patient uses personal pronouns), and bottom right is meanlength of words. Those with dementia are less perceptive onthe right side of their visual field than controls, as well as beingolder and more likely to use personal pronouns and shorter words. 42Figure 5.1 The CORAL algorithm is shown in three steps. The target andsource data set consist of three features; x, y, z. In a) thesource data and target data are normalized to unit varianceand zero mean, but have different covariances distributions.b) The source data is whitened to remove the correlations be-tween features. c) The source data is recoloured with the tar-get domain’s correlations and the two data sets are aligned. Aclassifier is then trained on the re-aligned source data. (Figureadapted from [72]) . . . . . . . . . . . . . . . . . . . . . . . 48Figure 5.2 Comparison of two domain adaption methods, AUGMENT andCORAL, against three domain adaptation baselines and onemodel baseline (dummy classifier which predicts the major-ity class in the training fold). Mean F-measure and 90% CIare shown across 10-folds. Only target data appears in the testfold. AUGMENT with logistic regression outperforms all base-lines. CORAL doesn’t improve either model above the majorityclass baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . 49Figure 5.3 Performance of two domain adaption methods, AUGMENT andCORAL, on classifiers that do not learn a weight vector. AUGMENTdoes poorly in this setting because the models are unable tochoose between the “target only”, “source only” or “both” ver-sion of each feature. . . . . . . . . . . . . . . . . . . . . . . 51xiiFigure 6.1 We show Area Under the Curve (AUC) for each model as wevary the number of features. Error bars for each model show90% CI across all 9 folds. We use two plots so error bars aredistinguishable. All models beat the dummy classifier (ma-jority class) with the K-Nearest Neighbours (KNN) achievingthe best performance (Accuracy (ACC): 0.728, 90% CI=0.687-0.769, AUC: 0.761, 90% CI=0.714-0.807, F-measure (FMS):0.785, 90% CI=0.746-0.823). . . . . . . . . . . . . . . . . . 56Figure 6.2 We show mean Accuracy (ACC), F-measure (FMS) and AreaUnder the Curve (AUC) for each model at their optimum num-ber of features (e.g. the peak performance in Figure 6.1). Errorbars for each model show 90% CI across all 9 folds. All mod-els beat the dummy classifier (majority class) with the KNNachieving the best performance (ACC: 0.728, 90% CI=0.687-0.769, AUC: 0.761, 90% CI=0.714-0.807, FMS: 0.785, 90%CI=0.746-0.823). . . . . . . . . . . . . . . . . . . . . . . . . 57Figure 6.3 As with figure 4.6 we show the mean change in performanceacross models when a feature group is removed and the modelis retrained. A greater decrease in performance indicates amore significant feature group. The number of features withineach group are listed in parenthesis after each name groupname. Unlike with the DementiaBank data set all feature groupsare important to the prediction accuracy, with the removal ofthe psycholinguistic group having the greatest deleterious ef-fect across all models. . . . . . . . . . . . . . . . . . . . . . 58xiiiFigure 6.4 Feature importance score for the blog data set, as calculated byequation 4.5. A score of 1.0 indicated the feature was selectedfirst in all 9 folds, while a score of 0.0 indicates the feature wasnot selected within the top 50 features in any folds. Featureranking does not depend on any particular model and only isbased on the correlation between the feature and the binarylabels. SUBTL Word Score, Number of Sentences, Mean WordLength, and Noun Phrase To Personal Pronoun are the highestscoring features on the data set. . . . . . . . . . . . . . . . . 59Figure 6.5 Box plots of the four highest scoring features in figure 6.4.SUBTL Word Score top left, Mean Word Length top right, NounPhrase To Personal Pronoun bottom left, Number of Sentencesbottom right. Blogs written by persons with dementia are redand controls are blue. As in the spoken case, persons with de-mentia use the personal pronoun more often and use smallerwords on average. Bloggers with dementia also have a higherSUBTL score (indicating an impoverished vocabulary) and writeshorter posts. . . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure A.1 Plot showing the performance of the halves feature set with-out quadratic terms. The performance of Random Forest andGaussian Naive Bayes is not hurt in this case as it is in fig-ure 4.9. The performance of logistic regression also decreaseswithout the quadratic terms. . . . . . . . . . . . . . . . . . . 78Figure A.2 Accuracy of models with new feature sets. . . . . . . . . . . . 79Figure A.3 Change in accuracy of models with new feature sets. . . . . . 80Figure A.4 AUC of models with new feature sets. . . . . . . . . . . . . . 81Figure A.5 Change in AUC of models with new feature sets. . . . . . . . 82xivGlossaryACC AccuracyADOD Alzheimer’s disease and related dementiasAD Alzheimer’s diseaseASR automatic speech recognitionAUC Area Under the CurveBDAE Boston Diagnostic Aphasia ExaminationCT Computed TomographyDSM-5 Diagnostic and Statistical Manual of Mental Disorders, Fifth EditionEDU Elementary Discourse UnitsFMS F-measureKNN K-Nearest NeighboursMCI mild cognitive impairmentMFCC Mel-frequency Cepstral CoefficientML machine learningMMSE Mini Mental State ExaminationMNCD Mild Neurocognitive DisorderxvMRI Magnetic Resonance ImagingNLP Natural Language ProcessingOPTIMA Oxford Project to Investigate Memory and AgingSVM Support Vector MachinexviAcknowledgmentsOnly convention prevents me from listing multiple names on the title page, as thisthesis is not the product of one person. Having Giuseppe Carenini as a supervisorfelt like cheating; he was generous with his time and with his experience, and wassincerely interested in seeing his students succeed. It is rare to find such kindnesscoupled with technical expertise, and I can only hope to emulate those qualities inthe future.I must also express my gratitude to Thalia Field and Gabriel Murray, who bothplayed pivotal roles in this research. Thalia, for your vast medical knowledge andfor your careful comments to multiple drafts I sent your way. Gabe, for your tech-nical help and for making time to be a second reader of this work. It’s been apleasure and a privilege publishing with you both.To Halldor Thorhallsson, Jacob Chen, Kimberly Dextras, Robbie Rolin, LouieDinh, Meghana Venkatswamy, Giovanni Viviani, Dilan Ustek, Antoine Ponsard,Kuba Karpierz, Neil Newman, Daniel Almeida, Jordon Johnson, and the rest ofmy fellow graduates, thank you. Thank you for being beer-callers, brainstormers,sounding boards, project members, debate partners, mountain hikers, and friends.May you all have happy careers spent solving interesting problems.To my dear friends in Vancouver, Calgary, Tokyo, Perth, Seattle, and Portland,you all mean the world to me. These last two years would have been gray andbland without you. We’ve made enough memories to last several lifetimes, and Ihope we make enough to last several more.Second to last but not second to least, my wonderful family whose emotional1support was unfailing over these last few years. I wouldn’t be writing these words1And occasionally financial.xviihad my parents not kindled in me a love of exploration. Curiosity is the fuel ofscience and they made sure to fill up the tank.And finally, Halina, my life’s longest love. We grew up together and will growold together. You make everything better.xviiiDedicationTo Halina, without whom I would have neither the skill nor the will to do much ofanything.xixChapter 1IntroductionEvery year, Canadians spend $10.4 billion caring for persons with dementia. Alzheimer’sdisease (AD), which accounts for 60% - 80% of all dementia diagnoses, is projectedto become a trillion dollar disease worldwide by 2018 which places it among themost financially costly diseases in developed countries [11, 59]. Although there isnot yet a cure for AD, researchers believe early detection will be key to preventing,slowing, and stopping the disease [3].Of the 47 million people who live with dementia today, only approximately25% receive a formal diagnosis [35]. A diagnosis of dementia may involve re-peated medical follow-up, interviews with patients and caregivers by trained healthcare professionals, and detailed cognitive assessment [3]. Blood tests and neu-roimaging, which can be distressing to elderly patients, are also often used to ruleout other causes of dementia-like symptoms. In developing countries, access tosome or all of these resources may not be available, and this is reflected in thehigher than average rates of undiagnosed dementia in those regions [35]. What isneeded is a diagnostic tool which is non-invasive, inexpensive and easy to admin-ister, so patients in developed and developing countries can receive care and planfor their future, as well as to make lifestyle choices that can slow the progressionof the disease [3].One promising avenue, and the focus of this thesis, is to develop an automatedtool that makes a diagnosis by detecting changes in language. While AD is charac-terized by a decline in many cognitive functions, including “impairments in atten-1tion/concentration, orientation, judgment, visuospatial abilities, executive function,and language,” [24] dysphasia1 has been suggested as being more significant thanother symptoms due to its correlation with a decline in noncognitive skills such ashygiene, dressing and eating [66]. Those with AD often have a number of linguisticdeficits, including:• Difficulties finding words• Diminished vocabularies• A difficulty recalling the names of everyday objects (“anomia”)• A tendency to speak with repetitions (“echolalia”)• A difficulty producing sounds, syllables and words (“verbal apraxia”)Given the importance of dysphasia in detecting early signs of dementia, re-searchers are applying advances in machine learning (ML) and Natural LanguageProcessing (NLP) to develop a tool that identify dementia based only on a sampleof a patient’s speech. Such a tool would assist clinicians in making a diagno-sis and hopefully would obviate the need for more invasive screening techniques.Further, it could be easily distributed to developing countries via a mobile phoneapplication. Previous work in this area has shown positive preliminary results us-ing language to distinguish between patients with and without dementia, as wellas between subtypes of dementia (e.g. AD with and without additional vascularpathology), but most previous work has been limited by small data sets and has fo-cused only on one form of language production, namely spoken language [27, 61].This work builds upon previous research on speech analysis, and makes three maincontributions which are detailed below.1.1 Contributions1.1.1 A Novel Feature Set: Spatial NeglectDementiaBank is a well studied data set of patients undergoing the “Cookie TheftPicture Description Task” component of the Boston Diagnostic Aphasia Exami-1The loss of the ability to produce or understand language.2nation (BDAE). Participants are asked to describe the cartoon image seen in Fig-ure 3.2 and their responses are recorded and manually transcribed. We introduce anew set of features that measure whether a respondent is more perceptive on oneside of their visual field than the other, a condition known as spatial neglect. These“spatial neglect” features are effective and simple to extract and we show their in-clusion achieves a new state of the art in detecting AD from speech. This studyalso considered two variations of the spatial neglect features, as well as discoursefeatures, and show they have little or no effect across a range of models trained onthe DementiaBank data set.1.1.2 Domain Adaptation: Using Alzheimer’s Data To Diagnose MildCognitive ImpairmentTraining a ML classifier to identify mild cognitive impairment (MCI) is difficultboth because those patients with MCI are less symptomatic than those with AD, andbecause there is less training data available. We show how the available AD data canbe leveraged to improve classification accuracy for MCI using domain adaptationtechniques. We compare two simple domain adaptation algorithms, AUGMENTand CORAL, and show that only AUGMENT is an effective way to improve theperformance of the best classifier in detecting MCI.1.1.3 Written Language: A New Corpus And Demonstration OfViabilityMost research on automatically detecting dementia from language has been fo-cused on spoken language, but little work has been done on written language. An-alyzing written language is difficult because the author has the opportunity to deletemistakes and make revisions to the text, as well as receive third party assistance.Depending on the source (e.g. blogs, email, twitter), the author may not be con-strained to a single topic as are the participants of the BDAE. Additionally, acousticand test-specific features cannot be extracted from blog posts as they can from datacollected from standardized test.Despite these difficulties there will be a lot of written data available in thefuture as a greater number of seniors begin using the internet. We show that a3range of models can determine whether the author of a blog post has dementia ata rate far above baselines. We create a new corpus of blog posts written by eitherpersons diagnosed with dementia or caregivers of persons with dementia, and makeit publicly available for further research.1.2 ReproducibilityThe code to reproduce all results and the corresponding plots is available at: https://github.com/vadmas/dementia classifier and the blog corpus is available at: https://github.com/vadmas/blog corpus.1.3 Thesis OverviewTo improve readability the three contributions are each given their own chapters(4, 5, 6, respectively) with results concluding each. Chapter 2 goes into the back-ground knowledge required to understand the remainder of the thesis. Specificallywe focus of Alzheimer’s disease (AD), the most common form of dementia, andmild cognitive impairment (MCI), a condition that often precedes dementia and isthe focus of Chapter 5. We also discuss previous work that has been done on au-tomatic detection of dementia from speech. Section 3 discusses the methodologycommon to each of the experiments. Any deviations from the general methodologyare discussed within each chapter. Most of the results, and some of the text and fig-ures, have appeared in three publications which were published over the duration ofthis thesis: Domain Adaptation for Detecting Mild Cognitive Impairment [51], De-tecting Dementia through Retrospective Analysis of Routine Blog Posts by Bloggerswith Dementia [50], and Improving Diagnostic Accuracy Of Alzheimer’s DiseaseFrom Speech Analysis Using Markers Of Hemispatial Neglect [25]. We concludewith Chapter 7 which summarizes the work and outlines areas for future research.4Chapter 2BackgroundIn this section, we provide an overview of dementia focusing on AD and MCI,the two subtypes of dementia discussed in this thesis. We do not aim to providea comprehensive review of the current state of medical knowledge of dementia(cf. Association [3]), but rather to provide relevant background information nec-essary to understand the remainder of this work. We then discuss previous workthat has been done using machine learning models to classify dementia from writ-ten and spoken language. Discussions on how our research builds upon previousresearch is contained within Chapter 4, 5, and 6. The reader is presumed to havesome familiarity with basic ML models (logistic regression, random forests, Sup-port Vector Machines (SVMS), etc) and therefore the details of each model willnot be discussed1. The background for domain adaptation and discourse parsing iscovered in sections 5.1 and 4.2, respectively.2.1 Medical Overview2.1.1 Alzheimer’s Disease And Other DementiasDementia is an umbrella term for a variety of diseases that cause a decline in cog-nitive ability beyond that which is expected from normal aging. Symptoms are1For those unfamiliar with common ML models, an overview of the various models usedin this thesis is here: https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/5often gradual, irreversible, and significant enough to affect daily functioning. Tobe classified as a major neurocognitive disorder2 by DSM-5, a patient must show a“significant cognitive decline from a previous level of performance in one or morecognitive domains”:• Learning and memory• Language• Executive function• Complex attention• Perceptual-motor• Social cognitionand:a) The cognitive deficits interfere with independence in everyday activitiesb) The cognitive deficits do not occur exclusively in the context of a deliriumc) The cognitive deficits are not better explained by another mental disorder(eg. depression, schizophrenia) [4]Language impairment is common to all dementias, although the dysphasia maymanifest itself differently depending on the underlying pathology [41, 74]. Ta-ble 2.1 (reproduced from [41]) lists the key speech and language impairments thatcharacterize each subtype of dementia.Symptoms intensify as the disease progresses. In the early stages, a personmay have difficulty performing chores around the house and may become notice-ably more forgetful, needing prompting to take pills or do other routine daily activ-ities. The individual may also demonstrate personality and mood changes, as wellas have difficulty finding words to express themselves. These symptoms worsenthrough the middle and late stages of the disease until the person is unable to live2Dementia was renamed “major neurocognitive disorder in the Diagnostic and Statistical Manualof Mental Disorders, Fifth Edition (DSM-5)”6Types of Dementia The key speech and language impairments in the early stages of dementiaAlzheimer’s disease - Finding the right word for objects- Naming the objects- Word comprehension- Loud voiceVascular dementia - Finding the right word for objects- Naming the objects- Word comprehension- Incomprehensible speech- Decreased complexityDementia with LewyBodies- Language disorders include both the symptoms of AD and PDDParkinson’s disease de-mentia- Non-articulated speech- Loss of verbal fluency- Non-grammatical sentences- Slow speech- Soft voiceFrontotemporal de-mentia, Progressivenon-fluent aphasia- Slow and hesitant speech- Grammatical mistakes- Worsened understanding of complex sentences- Finding the right word for objects- Loss of literacy skills such as reading and writingSemantic dementia - Finding the right word for objects- Naming the objects- Word comprehension- A lack of vocabulary- Surface loss of literacy skillsMixed dementia - Language disorders include the symptoms of AD, vascular dementia andDLB, or just a combination of two of themTable 2.1: Speech and language impairments in the individual types of de-mentia. Table replicated from Klimova and Kuca [41].without assistance. Late stage dementia is characterized by almost complete apha-sia, severe memory loss, and a total reliance on care providers.Alzheimer’s Disease is the most common cause of dementia, accounting for60% to 80% of cases, while vascular dementia, which is caused by disease orinjury to the brain that impedes blood flow, is the second most common, accountingfor 10% of known cases. One in nine people aged 65 and older have AD, whileabout one third of people age 85 and older have AD [3]. Less common forms ofinclude dementia with Lewy bodies, Parkinsons disease, frontotemporal dementia,and Creutzfeldt-Jakob disease. Characteristics of these dementias are summarizedin Table 2.2 (reproduced from Kumar et al. [43]).At present, there is no single test to diagnose dementia. Physicians rely on clin-7Types of Dementia CharacteristicsAlzheimers disease - Most common type of dementia; accounts for 60 to 80 percent of cases.- Difficulty remembering names and recent events is often an early clinicalsymptom; apathy and depression are also often early symptoms.- Later symptoms include impaired judgment, disorientation, confusion, be-haviour changes, and trouble in speaking, swallowing and walking.- Hallmark abnormalities are deposits of the protein fragment beta-amyloid(plaques) and twisted strands of the protein tau (tangles).Vascular dementia (alsoknown as multi-infarctdementia or vascularcognitive impairment)- Considered the second most common type of dementia.- Impairment is caused by decreased blood flow to parts of the brain, often dueto a series of small strokes that block arteries.- Symptoms often overlap with those of Alzheimers, although memory maynot be as seriously affected.Mixed type Characterized by the presence of the hallmark abnormalities of Alzheimersand another type of dementia, most commonly vascular dementia, but alsoother types, such as dementia with Lewy bodies.Dementia with Lewybodies- Pattern of decline may be similar to Alzheimers, including problems withmemory and judgment and behaviour changes.- Alertness and severity of cognitive symptoms may fluctuate daily.- Visual hallucinations, muscle rigidity and tremors are common.- Hallmarks include Lewy bodies (abnormal deposits of the protein alpha-synuclein) that form inside nerve cells in the brain.Parkinsons disease - Many people who have Parkinsons disease develop dementia in the laterstages of the disease.- The hallmark abnormality is Lewy bodies (abnormal deposits of the proteinalpha-synuclein) that form inside nerve cells in the brain.frontotemporal demen-tia- Involves damage to brain cells, especially in the front and side regions of thebrain.- Typical symptoms include changes in personality and behaviour and diffi-culty with language.- No distinguishing microscopic abnormality is linked to all cases.- Picks disease, characterized by Picks bodies, is one type of front temporaldementia.Creutzfeldt-Jakobdisease- Rapidly fatal disorder that impairs memory and coordination and causes be-haviour changes.- Variant Creutzfeldt-Jakob disease is believed to be caused by consumption ofproducts from cattle affected by mad cow disease.- Caused by the misfolding of prion protein throughout the brain.Table 2.2: Major types of dementia and their characteristics. Table replicatedfrom Kumar et al. [43].ical interviews with family members and the patient to see if they meet the criteriaenumerated in the DSM-5. Magnetic Resonance Imaging (MRI) and Computed To-mography (CT) are often used to rule out treatable or non-dementia pathologies(e.g. brain tumours or cerebrovascular disease) that can cause cognitive decline.Neuroimaging is also used to search for biological markers (e.g. cerebral atro-phy or reduced glucose metabolism in the fronto-temporo-parietal and cingulate8cortices) which can add evidence towards a diagnosis [23, 33, 60].Clinicians also rely on the Mini Mental State Examination (MMSE), a 30-pointquestionnaire that measures impairment across five cognitive functions: orienta-tion, registration, attention and calculation, recall, and language. [67]. Of a total30 possible points, a score between 20-24 indicates mild dementia, 13-20 moder-ate, and ≤ 12 severe dementia. Language impairment is assessed by asking theparticipant to recall the names of a watch and pencil, read and repeat a phrase(separate tasks), and follow a three-stage command [45]. Fine-grained assessmentof impairment across different language domains is not included, despite evidencethat AD causes semantic, pragmatic, syntactic, and phonological language deficits3.Automated analysis of language has been suggested as a promising approach to de-tecting linguistic impairment across multiple linguistic domains [73].There are no known cures for dementia, although medication can be used totreat symptoms or slow its progress. Commonly used medication for AD includesdonepezil, an acetylcholinesterase inhibitor which increases the concentration ofacetylcholine4 in the brain, and memantine, which targets the glutamatergic sys-tem by binding with NMDA receptors to reduce toxicity associated with excessiveglutamate5[69]. Both donepezil and memantine are approved for moderate andlate stage AD and provides modest improvement in cognitive function [5, 56, 75].Other drugs are available 6 for various stages of AD but all are palliative in nature.2.1.2 Mild Cognitive ImpairmentMCI is a defined as a noticeable decline in cognitive function that may - but cru-cially, may not - lead to an eventual dementia diagnosis. Individuals with MCI(reclassified as Mild Neurocognitive Disorder (MNCD)7 in the DSM-5 [71]) show3This is likely due to its age. The MMSE was developed in 1975, before much of the research intothe effect of AD on language impairment was conducted [73]4Acetylcholine is a neurotransmitter associated with attention and memory.5Glutamate is important neurotransmitter in the brain involved in learning and memory. ADcauses an excessive buildup of Glutamate in the brain which kills glutamate receptors (or “NMDAreceptors”) by overexposure. Memantine and other NMDA receptor antagonists reduce toxicity bybinding with NMDA receptors to reduce their exposure to excess glutamate.6See http://www.alzheimer.ca/en/Home/About-dementia/Treatment-options/Drugs-approved-for-Alzheimers-disease7The central difference between MNCD and MCI is that research into MCI mainly involved a cohort9cognitive impairment beyond that which is expected for their age, but which isless severe than dementia and does not significantly interfere with daily activi-ties [57, 71]. Unlike with AD and related dementias, the individual retains theability to perform functional tasks (e.g. hygiene, eating) but may become slower orless efficient at performing everyday tasks. A patient story from Langa and Levine[47] reads:Mrs J, age 81 years, with hypertension and hyperlipidemia, requesteda referral to a neurologist, stating: “I am forgetting things I justheard”.Mrs J and her husband began noticing mild memory problems 1.5years earlier, and report slow progression since. Her husband no-ticed changes in problem solving and time management. Mrs J waseasily distracted and had difficulty remembering recent conversations.She misplaced objects and spent time looking for them; she read andwrote less than before. She repeatedly asked how to do things on hercomputer and cell phone. Her husband reported that she exhibitedno initiative, and that their home seemed more disorganized. She haddifficulty planning dinner and her cooking was simpler. Both deniedchanges in language or speech. She continued to drive locally withoutaccidents but had difficulty remembering directions to familiar places.Mrs J had no hallucinations or delusions. She slept well, her moodwas fine, and she exhibited no behavioral problems or personalitychanges.Functionally, she remained independent in all activities of daily living(ADLs). She had urinary frequency and over the past couple of monthsshe had a few incidents of incontinence, especially when awakeningfrom a nap. In instrumental activities of daily living (IADLs), Mr Jhad recently taken over paying bills. Finally, even with a compart-mentalized pill-box, she occasionally forgot to take her medications(amlodipine 5 mg daily; losartan 50 mg twice daily; and ergocalcif-erol 1,000 units daily.)of elderly patients while MNCD included all age groups [71].10Population-based studies estimate MCI to be prevalent in between 12-18% inpeople over the age of 60 [58]. Annually, 8 to 15% of those with MCI will progressto dementia while the rest will either revert to normal cognition or will stay mildlyimpaired [58]. MCI can be due to neurodegenerative diseases (most commonlyAD) or reversible causes, including psychiatric illness or metabolic disturbancesincluding thyroid disease or vitamin B12 deficiency [32].MCI has no single cause and no single treatment. Early diagnosis is impor-tant because it allows patients to test for potentially treatable causes (e.g. majordepressions or vascular risk factors) and make modifications to their lifestyle toslow the onset of dementia [58]. As well, early diagnosis can lead to confirmatorydiagnostic testing for conditions such as AD, and can better allow for planning forsocial supports and closer medical follow-up. Unsurprisingly, MCI is more difficultto detect than dementia since the symptoms are less severe; the MMSE has a sen-sitivity of 88.3% (95% CI, 81.3 to 92.9) and a specificity of 86.2% (95% CI, 81.8to 89.7) for detecting dementia, but only a sensitivity of 45 to 60% and specificityof 65 to 90% for detecting MCI [47, 49]. In addition, MCI has not been studied asextensively as AD and therefore there is less clinical data available with which totrain a machine learning model.2.2 Automatic Detection Of DementiaWith advances in ML models and NLP an interest has emerged in training machinelearning models to automatically detect Alzheimer’s disease and related dementias(ADOD) from language. We discuss previous work done on both spoken and writtentext below, and, in chapters 4, 5, and 6, comment on how this research differs fromprevious work.2.2.1 Spoken LanguageThere has been success in using lexical and acoustic features derived from speechto diagnose ADOD. Ahmed et al. [2] determined features that could be used toidentify dementia from speech, using data collected in the Oxford Project to In-vestigate Memory and Aging (OPTIMA) study. These researchers used a Britishcohort of 30 participants, 15 with AD at either MCI or mild stages, and 15 whose11age and education matched healthy controls. They found that language progres-sively deteriorates as AD progresses and suggested using semantic, lexical content,and syntactic complexity features to identify cases.Rentoumi et al. [61] then used a Gaussian naive Bayes Classifier with lexicaland syntactic features to distinguish between two subtypes of dementia; AD withand without additional vascular pathology. They achieved a classification accuracyof 75% on 36 transcripts from the OPTIMA data set.Orimaye et al. [54] expanded on this work by using a similar feature set todistinguish between ADOD and healthy patients. They performed a comparison offive machine machine learning classifiers - SVMS with a RBF kernel, Naive Bayes,Decision trees, Neural Networks, and Bayesian Networks - on the larger Demen-tiaBank data set (sample size = 484) and found SVMS had the best performancewith a F-measure score of 74% [54].In 2014 Fraser et al. [26] compared different feature sets that could be usedin discriminating between three different types of primary progressive aphasia (asubtype of frontotemporal dementia). They concluded that a smaller relevant sub-set of features achieves better classification accuracy than using all features andhighlighted the importance of a feature selection step. They also showed how psy-cholinguistic features, such as frequency and familiarity, were useful in detectingprimary progressive aphasia. In later work Fraser et al. [27] used logistic regres-sion to achieve state-of-the-art of 81.92% in distinguishing individuals with ADfrom those without. Their experiments were run on the DementiaBank data set de-scribed in Section 3.1.1 and they found optimal performance when 35-50 featuresare used, consistent with their previous work [26].Researchers have also looked at automatically detecting MCI, a harder taskthan detecting AD - both because limited data is available and because MCI is lesssymptomatic than AD. Roark et al. [63] demonstrated the viability of this task us-ing transcripts and audio recordings of patients undergoing the Wechsler LogicalMemory I/II test. This test involves a patient twice retelling a short story, onceimmediately after the story was told and again after a 30-minute delay. Roarket al. [63] extracted two broad sets of features; “linguistic complexity” featuresthat measure the complexity of a narrative, and “speech duration” features that in-clude number of pauses, pause length and pause-to-speech ratio. Using SVM’s,12they achieved a maximum area under the ROC curve (AUC) of 0.74 and concludedthat NLP techniques could be used to automatically derive measures to discrimi-nate between healthy and MCI subjects. To´th et al. [76] made a step towards fullyautomated detection of MCI by determining the effect of automatic speech recogni-tion (ASR) compared to manual transcriptions. They showed classification resultsworsen slightly when using ASR, although they still achieved an F-measure of 85.3on Hungarian patients. Ko¨nig et al. [42] found similar results using their end-to-end system on a cohort of French speaking seniors, and positive results have alsobeen found with Greek speakers [65].2.2.2 Written LanguageEarly signs of dementia can be detected through analysis of writing samples aswell [40, 48, 62]. In the “Nun Study” researchers analyzed autobiographies writ-ten in the US by members of the School Sisters of Notre Dame between 1931-1996 [68]. Those nuns who met the criteria for dementia had lower grammaticalcomplexity scores and lower “idea density” in their autobiographies. Surprisingly,the measure of idea density in autobiographies written by nuns in their 20’s waspredictive of dementia in late life [40].Le et al. [48] performed a longitudinal analysis of the writing styles of threenovelists: Iris Murdoch who died with AD, Agatha Christie (suspected of havingAD), and P.D. James (normal brain aging). Measurements of syntactic and lexicalcomplexity were made from 51 novels that collectively spanned the authors’ ca-reers. Murdoch and Christie exhibited evidence of linguistic decline in later works,such as vocabulary loss, increased repetition, and a deficit of noun tokens [48].Hirst and Wei Feng [31] studied the question of whether a model trained torecognize authorship would recognize text written by an author late in their lifeif that author had dementia. They used “authorship attribution” and “authorshipverification”8 methods on the works of Iris Murdoch and Agatha Christie (AD)and P.D. James (control). They hypothesized than in the case of the AD authors,an SVM classifier would not be able to attribute (or verify) text written in the late8Authorship attribution is a multi-label classification problem: Given a set of authors and anunknown text, determine the author. Authorship verification is a binary classification: Given anauthor and a text, determine if the text was written by the author.13stage of the authors career to the author because of changes in writing style dueto dementia. Their results were inconclusive as they found changes in the controlauthor (P.D. James) as well, indicating that authors’ styles change naturally (orperhaps intentionally) as a result of age.Despite evidence that linguistic markers found in writing samples can predictdementia, it appears that no attempts have been made to train models to classifydementia based on writing alone. To date, it is not clear whether systems that candetect dementia from spoken language will work for written language, given thataudio and test-specific feature groups are not available from unstructured writtentext as they are from data collected from patient undergoing standard diagnosticexams.14Chapter 3MethodologyIn this section we detail the experimental methodology common to all subsequentexperiments. Any deviations from the general procedure described below (such asthe blog data set used in Chapter 6) are discussed within each chapter. The fullpipeline is seen in Figure Data SetWe use two data sets for this work: one consists of samples of spoken language andthe other consists of samples of written language. We detail the DementiaBank dataset, which we use in Chapter 4 and 5, below. The data set of written samples isdiscussed in Section DementiabankFor spoken samples we used the DementiaBank data set, a publicly available dataset that consists of transcripts and recordings of English-speaking participants de-scribing the “Cookie Theft Picture,” a component of the Boston Diagnostic Apha-sia Examination [29]. A patient is asked to describe the cartoon image in Fig-ure 3.2 and their answer is recorded and manually transcribed - including falsestarts, pauses, and paraphasia1 - and then segmented into utterances. An utterance1“Paraphasia” is a type of language error associated with aphasia. It is characterized by unin-tended syllables, words, or phrases that result from an effort to speak. Examples include “lelephone”for “telephone” or “ragon” for “wagon”.15Figure 3.1: Processing pipeline from clinical interview to evaluation. We per-form a 10-fold cross validation for the evaluation stage. Experimentsuse either the the blog data set (left) or the DementiaBank data set (right)but not both.16Diagnosis Patients Samples Mean Words Mean Age Gender (F/M)AD 169 257 104.98 (s=59.8) 71.72 (s=8.47) 87/170MCI 19 43 111.09 (s=55.8) 69.39 (s=8.09) 27/16Control 99 242 113.56 (s=58.5) 63.95 (s=9.16) 88/154Table 3.1: Demographics of DementiaBank Datasetis defined as a unit of speech bounded by silence. An example response is seen inFigure 3.3.DementiaBank consists of 309 samples from 208 persons with dementia and242 samples from 102 healthy elderly controls. A patient can give multiple inter-views. Ages ranged from 45 to 90 with interviews conducted between 1983 and1988. Medical re-review was done up to five years after the end of the study. Of the208 persons included in the study, 181 patients were diagnosed with probable/def-inite Alzheimer’s disease (AD) and seven with vascular dementia. Some patientswere discarded due to misdiagnoses or on other clinical grounds. Of the 309 inter-views with dementia patients, 43 samples were classified as mild cognitive impair-ment (MCI) and 256 samples as possible/probable AD. The remaining interviewswere not used in this study. Demographic information about the DementiaBanksamples used in this study is listed in Table FeaturesIn addition to the age of the patient, which is a known predictor of dementia notleveraged in previous work [28], a total of 314 lexical and acoustic features wereextracted and divided into eight groups. All features listed below have appearedin previous work, most notably from Fraser et al. [27]. A full list of all featuresappears in Table A.1 in the appendix.3.2.1 Parts-Of-Speech (15)We use the Stanford Tagger2 to capture the frequency of various parts of speechtags (nouns, verbs, adjectives, adverbs, pronouns, determiners, and so forth). Fre-quency counts are normalized by the number of words in the utterance and we2Available at: http://nlp.stanford.edu/software/tagger.shtml17Figure 3.2: Cookie Theft picture from the Boston Diagnostic Aphasia Exam-ination.um. mhm. alright. there’s um a young boy that’s getting a cookie jar. and ithe’s uh in bad shape because uh the thing is falling over. and in the picturethe mother is washing dishes and doesn’t see it. and so is the the water isoverflowing in the sink. and the dishes might get falled over if you don’tfell fall over there there if you don’t get it. and it there it’s a picture of akitchen window. and the curtains are very uh distinct. but the water is flowstill flowing.Figure 3.3: Manually transcribed sample response from a patient undergoingthe Cookie Theft Picture Test.record the mean across utterances. Disfluencies (“um”, “er”, “ah”), not-in-dictionarywords of three or more letters, and word-type ratios (noun to verb, pronoun to noun,etc) were also counted.3.2.2 Context-Free-Grammar Rules (44)These features count how often a phrase structure rule occurs in an utterance, in-cluding NP→VP PP, NP→DT NP, etc. We use Penn Treebank tags and parse treescome from the Stanford parser.183.2.3 Syntactic Complexity (27)These features measure the complexity of an utterance through metrics such as thedepth of the parse tree, mean length of words, mean length of sentences, meanlength of T-Units, mean length of clauses, and clauses per sentence. We use the L2Syntactic Complexity Analyzer3.3.2.4 Vocabulary Richness (4)We calculated four metrics that capture the range of vocabulary in a text. The stan-dard type-to-token ratio (the ratio of vocabulary size to text length |V |/N), and themoving-average type-token ratio (MATTR), which is a length-independent metricof lexical diversity [15]. We also record Brunet’s index, an alternative length-independent metric of lexical diversity that has appeared in previous work, andHonore’s statistic, a metric of lexical diversity based on the counting the numberof singleton words appearing in a person’s speech [9]3.2.5 Psycholinguistic (5)Psycholinguistic features are linguistic properties of words that effect word pro-cessing and learnability [64]. We used five psycholinguisic features (numbers inparenthesis indicate the number of words with scores in the database):• Familiarity (3626): A measure associated with how familiar a word is (“monad”has a low score while “breakfast” has a high score).• Concreteness (1372): A measures of how concrete or abstract a word is(“however” has a low score while “December” has a high score).• Imagability (4829): A measure of how easily one can conjure a mental im-age of a word (“equanimity” has a low score and “beach” has a high score).• Age of acquisition (31104): A measures of how old people are on averagewhen the first learn the word. We use the expanded set from Kuperman et al.[44].3Available at: http://www.personal.psu.edu/xxl13/downloads/l2sca.html19• SUBTL (74K): A measure of the frequency with which a word is used indaily life.Scores for concreteness, familiarity, imagability, and age of acquisition were de-rived from surveys and crowdsourcing, and the results from multiple studies wereaggregated and made publicly available4. Participants were asked to rate wordson a numerical scale (generally between 1-5 or 1-7, depending on the study) andscores were averaged across ratings. The SUBTL word scores were not derivedfrom crowdsourcing, but instead from television and film subtitles. Word fre-quencies were calculated from 51 million words across 8388 film and televisionepisodes [8].3.2.6 Repetitiveness (5)We vectorized the utterances using TF-IDF and measure the cosine similarity be-tween utterances. We then recorded the proportion of distances below three thresh-olds (0, 0.3, 0.5), as well as the minimum and average cosine distance.3.2.7 Information Units (info-units) (40)Croisile et al. [16] compiled a list of 23 items that can be discerned in the CookieTheft Picture. These “information units” can be either actions or nouns and ex-amples include jar, cookie, boy, kitchen, boy taking, and woman drying. For eachinformation unit we extracted two features: a binary feature indicating whether thesubject has mentioned the item (or one of its synonyms in WordNet), and a fre-quency count of how many times an item has been mentioned. Three info-units(e.g., “woman indifferent to the children”) from Croisile et al. [16] were not in-cluded in this work due to the lack of specificity of the info-unit.For each information unit, we used WordNet to create a set of synonyms, hyper-nyms and hyponyms that could be used to identify the item. We manually removedinappropriate unigrams (e.g., “irrigate” for “water”). For an information unit to beconsidered recognized, one of the unigrams in the set must appear in the transcript.In the case of the four action information units (boy taking, water overflowing,mother washing, stool falling), both unigrams must be present in a single utterance4Available at: http://websites.psychology.uwa.edu.au/school/MRCDatabase/mrc2.html20and the words must be tagged with the appropriate POS tag. The list of info unitsand their synonyms is shown in Table Acoustic (172)Mel-frequency Cepstral Coefficients (MFCCS) are frequently used is speech pro-cessing and represent spectral information from the speech signal, using a scaleknown as the “mel-frequency scale,” which is chosen to mimic the way humansperceive audio. MFCCS are calculated by segmenting the signal into short framesand taking the (discrete) fourier transform of each segment. The MFCCS are thencalculated from the “mel log powers” of the first 14 coefficients calculated by thefourier transform of each segment5. Each segment from the original signal thenproduces 14 MFCCS, resulting in 14 MFCC distributions. We then calculate themean, variance, skewness, and kurtosis of the first 14 MFCCS, representing spec-tral information from the speech signal. We did the same for the velocity andacceleration, where velocity is calculated as the delta between consecutive timesteps and acceleration as the double-deltas.3.3 Feature selectionWe follow the recommendation of Fraser et al. [26] and performed a feature selec-tion preprocessing step. Within each training fold, we selected for inclusion thefirst k features that have the highest absolute correlation with the training labels.The reported performance is the maximum average across all 1≤ k ≤ D, where Dis the number of features. Figure 4.4 shows the importance of feature selection,with the maximum performance achieved for most models between 35-50 feature.3.4 ModelsWe used five models from the SKLearn python package, as seen in Table 3.3.We also considering including a multilayer perceptron classifier, but it was laterexcluded due to badly overfitting the data (despite efforts spent hyperparametertuning).5see http://www.practicalcryptography.com/miscellaneous/machine-learning guide-mel-frequency-cepstral-coefficients-mfccs/ for more detail21Info Unit SynonymsBoy boy, son, brother, male childGirl girl, daughter, sister, female childWoman woman, mom, mother, lady, parent, female, adult, grownupKitchen kitchen, roomExterior exterior, outside, garden, yard, outdoors, backyard, driveway, path, tree, bushCookie cookie, biscuit, cake, treatJar jar, container, crock, potStool stool, seat, chair, ladderSink sink, basin, washbasin, washbowl, washstand, tapPlate plateDishcloth dishcloth, dishrag, rag, cloth, napkin, towelWater water, dishwater, liquidWindow window, frame, glassCupboard cupboard, closet, shelfDishes dish, dishes, cup, cups, counterCurtains curtain, curtains, drape, drapes, drapery, drapery, blind, blinds, screen, screensSteal take, steal, taking, stealingFall fall, falling, slip, slippingWash wash, dry, clean, washing, drying, cleaningOverflow overflow, spill, overflowing, spillingTable 3.2: A list of info units and their synonyms.3.5 EvaluationA 10-fold cross validation procedure was used to evaluate each model, where mul-tiple interviews from a given patient were contained either to the training fold orthe test fold, but not both. Feature selection took place within each fold: the high-est mean performance was returned across all 1 ≤ k ≤ D features. Bar plots showthe 90% CI across all folds. We report F-measure, accuracy, and Area Under theCurve (AUC) for most experiments except where the test set is too small to reportAUC.22Model Hyperparameterslogistic regression L2 regularization, alpha = 1.0K Nearest Neighbors K = 5Random Forest Trees = 100, max depth = 3Gaussian Naive Bayes n/aSupport Vector Machine kernel = ’rbf’Dummy Classifier Most frequent label in training foldTable 3.3: Models and their hyperparameters.23Chapter 4Evaluating Novel Feature SetsIn this chapter we propose and evaluate two novel feature sets: Spatial neglectand discourse features, described in sections 4.1 and 4.2 respectively. We considerthree variations on the Spatial neglect features - halves, strips, and quadrants - andshow that halves, the most biologically plausible variant, improves the performanceof best classifier. We also show that discourse features have no effect on improvingclassification accuracy across a range of models.This chapter builds the earlier research discussed in Section 2.2.1 by evaluat-ing two new feature sets not present in prior work. Our goal in this chapter is todemonstrate the efficacy (or lack thereof) of either or both of the novel feature setsand recommend (or not) their use to future researchers. We discuss both featuresets in detail below.4.1 Spatial NeglectSpatial neglect (also: “hemispatial neglect,” “unilateral neglect,” “hemineglect,”“unilateral spacial neglect”) is the phenomenon of reduced awareness on one sideof the visual field which often occurs as a result of brain damage. Spatial neglectdiffers from “hemianopia,” or blindness over half the field of vision, in that a patientwith neglect still has sensation and is able to, for example, detect a bright lighton their neglected side. Depending on the extent of the condition, a patient withneglect may fail to notice people or large objects on one side of space, may only24Figure 4.1: Left: A clock drawn by a patient with left-side spatial neglect.Right: Eye movements of a patient with left-side spatial neglect. Pa-tient was asked to search for letter T among Ls. Red dots are fixationsand yellow lines are saccadic movements between fixations. Imagesfrom Husain [34]shave or apply makeup to the non-neglected side, or may only draw or examineone half of an image. The left imagine in Figure 4.1 shows a drawing of a clock bypatient with spatial neglect [34, 55]. The right image in Figure 4.1 shows the eyemovements of a patient with spatial neglect who was asked to identify the letter Tamong L’s.Previous studies have shown patients with Alzheimer’s disease and related de-mentias (ADOD) exhibit signs of spatial neglect1 but surprisingly, none of the previ-ous work in automatic detection of dementia have included features which measureneglect [14, 36, 37, 52, 53, 77]. We propose four new features which measure at-tention, concentration, repetition and perception in different visual fields. We dis-cuss how each feature is calculated in Section 4.1.2 and discuss the three differentspatial partitions we consider in Section study by Kasai et al. [39] disputes these findings. They show that results for one measureof neglect, the “line bisection (LB) task” did not significantly correlate with the results from anothermeasure of neglect, the “left category copying of the Rey-Osterrieth Complex Figure Task (RCFT)”and conclude that Alzheimers patients do not show left unilateral spatial neglect but instead exhibitperipheral inattention (e.g. neglect on both sides). However, their data also shows that patients withAD make “no copying errors” on the left side of the image more often than the right (Figure 3), andalso show there are no significant differences between AD and control patients in the LB task anyway,so a lack of correlation may be unsurprising.254.1.1 Spatial PartitionsWe divided the Cookie Theft Image into halves, strips, and quadrants as seen inFigure 4.2. For each division we create a set of info-units contained within, shownin Table 4.1, and used these sets to calculate measures of spatial neglect. An info-unit is included in all divisions it spans (e.g., “girl” is included in SW and NWquadrants), meaning an info-unit can appear in multiple divisions. We considereach partitioning scheme to be its own feature set. Models are trained using thefeatures described in Section 3.2 and one of the halves, strips, or quadrant sets. Foreach set we also tried adding quadratic features (e.g. attention2,concentration2,attention∗concentration etc) but in some cases the performance decreased. We report the re-sults of the optimal performance of each feature set, with or without quadraticfeatures.Figure 4.2: We divide the Cookie Theft image into halves (red), strips (blue),and quadrants (green), and create sets of info-unit within each division.For example, the “girl” info-unit is in the left half, far-left strip, and SWand NW quadrants.264.1.2 Spatial Neglect FeaturesTo measure spatial neglect, for each division we computed four simple metricsusing the counts of each info-unit as described in Section 3.2.7. Let Si be the set ofinfo-units in division i, ni be the count of mentions of any info-unit in Si, ui be thenumber of unique info-units mentioned in Si, and nall be the total number of wordsin the patient’s response. Then for division i,attentioni = ni (4.1)concentrationi =ninall(4.2)repetitioni =uini(4.3)perceptioni =ui|Si| (4.4)4.2 Discourse FeaturesOne measure of coherence that has been absent in the previous work comes fromdiscourse analysis. In a coherent passage, a reader can clearly discern how onesentence relates to the next. A given sentence may explain or elaborate upon aprevious sentence (as this one is doing), or act as background for a future sentence.Such relations can be formed on an intra-sentential level as well, with ElementaryDiscourse Units (EDU) being clause-like units of text which can be related to oneanother by discourse relations. Discourse parsing is the task of segmenting a pieceof text into its EDUs and then forming a discourse tree with edges corresponding todiscourse relations, as seen in Figure 4.3. Features related to the discourse struc-ture of a passage can then be extracted from the discourse tree, as discussed inSection 4.2.2.Previous work has shown a disparity in the overall discourse ability of patientswith ADOD compared to healthy controls [7, 12, 21]. Those with ADOD showa greater impairment in global coherence, have more disruptive topic shift and27HalvesLeft boy, girl, cookie, jar, stool, cupboard, steal, fall, kitchenRight woman, exterior, sink, plate, dishcloth, water, window, dishes,curtains, wash, overflow, cupboard, kitchenStripsFar-left girl, cookie, jar, stool, cupboard, steal, kitchen, cupboardCenter-left boy, cookie, stool, steal, fall, kitchen, cupboardCenter-right woman, exterior, sink, plate, dishcloth, water, window, dishes,curtains, wash, overflow, kitchen, cupboardFar-right exterior, window, dishes, curtains, kitchen, cupboardQuartersNE woman, exterior, plate, dishcloth, wash, window, curtains,kitchenSE woman, sink, water, dishes, overflow, cupboard, kitchenNW girl, cookie, jar, cupboard, steal, boy, cookie, kitchenSW girl, stool, fall, cupboard, kitchenTable 4.1: List of info-units within each division.greater use of empty phrases, and produce fewer cohesive ties than controls [18–20, 46]. Discourse parsing has been useful in determining overall coherence inother domains such as essay scoring; thus, we hypothesized that it would also beuseful for AD detection [22]. Most recently, Abdalla et al. [1] looked at differencesin discourse structure between AD and controls in two data sets, DementiaBankand the Carolinas Conversations Collection 2. They found significant differencesbetween elaboration and attribution between the two groups, in both data sets.However, it remains unclear if the inclusion of discourse features improves theaccuracy of classifiers.4.2.1 Discourse Parser: CODRACODRA, or “a COmplete probabilistic Discriminative framework for performingRhetorical3 Analysis” is a discourse parser which combines discourse segmenting2http://carolinaconversations.musc.edu/about/3“Rhetorical parsing” and “Discourse parsing” are used interchangeably in the literature.28Figure 4.3: Discourse tree for the two sentences “But he added: ‘Some peo-ple use the purchasers’ index as a leading indicator, and some use it asa coincident indicator. But the thing it’s supposed to measure - man-ufacturing strength - is missed altogether last month.”’ Each sentencecontains three EDUs. EDUs correspond to leaves of the tree and dis-course relations correspond to edges. (Figure adapted from [38])- partitioning raw text into EDUs - with discourse parsing - the problem of forminga discourse tree from a sequence of EDUs. Most existing parsers perform structureprediction and relation labeling separately, and are therefore unable to make use ofsequential dependencies between text segments. CODRA addresses this limitationby implementing a joint model for the two tasks. In addition, CODRA improveson previous discourse parsers by performing inter- and intra- sentential parsingseparately with two different probabilistic models. This allows for optimal parsingusing the probabilities generated by the forward-backwards algorithm on CRFs(Conditional Random Fields). CODRA significantly out -performed state-of-the-art performance on two data sets in 2015 [38].4.2.2 Discourse FeaturesWe used CODRA to segment the speech EDUs and identify the relations betweenthem [38]. We counted the number of occurrences of each of the 18 discourse rela-tions (“attribution”, “background”, “cause”, “comparison”, “condition”, “con-trast”, “elaboration”, “enablement”, “evaluation”, “explanation”, “joint”, “manner-means”, “same-unit”, “summary”, “temporal”, “textual organization”, “topicchange”, “topic comment”), the depth of the discourse tree, the average numberof EDUS per utterance, the ratio of each discourse relation to the total number of29discourse relations, and the discourse relation type-to-token ratio.4.3 Experimental DesignWe follow the experimental design discussed in Chapter 3 (See Figure 3.1). Un-like Fraser et al. [26] and Fraser et al. [27] we do not include features related tothe number or duration of pauses in the speech. We had tried using forced align-ment to determine the time intervals for each word, but found the sound qualitywas too poor to get reliable results. We also include a single non-linguistic demo-graphic feature - age - alongside the linguistic features with the justification than anon-invasive diagnostic tool would be able to elicit this information easily from apatient.4.4 ResultsTo evaluate the relative strength of the four new feature sets (spatial neglect fromeither halves, strips, or quarters, plus discourse features) we first evaluated theperformance of the system without the new feature sets to establish a baselineperformance.4.4.1 Baseline Classification PerformanceFigure 4.4 shows the F-measure across a range of models as we vary the numberof included features. Features are ordered by absolute correlation with the labelsin the training fold and are added in decreasing order (e.g., feature with the highestcorrelation added first, then second highest, etc.) The coloured shaded regionsshow 90% confidence intervals. Consistent with Fraser et al. [27], most modelspeak around 35-50 features and logistic regression outperforms the other models(F-measure: 0.824, and 90% CI=0.798-0.850). This plot highlights the importanceof the feature selection step, as the performance of logistic regression would dropbeneath 75% had all features been included. Figure 4.5 shows the F-measure,accuracy, and AUC for each model at their optimal number of features. All modelssubstantially outperform the baseline across all three metrics.We also ran an ablation study where each of the feature groups listed in Sec-3050 100 150 200 250 300Number of Features0.450.500.550.600.650.700.750.800.85F-MeasureMajorityClassLogRegSVMKNNRandomForestGausNaiveBayesFeature Selection CurveFigure 4.4: F-measure for different models as we vary the number of featuresincluded. Dark line shows the mean F-measure across each of the 10-folds and 90% CI are shown in the shaded regions. Features are addedin decreasing order of their absolute correlation with the labels in thetraining fold. Most models reach their maximum performance between35-50 features and then decline in performance as more features areincluded. This shows the need to include a feature selection step beforetraining each model.tion 3.2 were removed (“ablated”), feature selection was redone, and the model wasretrained. We then got a measure of the importance of a particular feature groupbased on how much the performance has changed as a result of the ablation. InFigure 4.6 we see the change in F-measure across all models when a feature groupis ablated. A more significant decrease in performance indicates a more importantfeature group, and the error bars show the 90% CI across the 10 folds. The numberat the end of each label indicates how many features were contained within thegroup.Acoustic, demographic, parts of speech and info-units are the most important31MajorityClass LogReg SVM KNN RandomForest GausNaiveBayesModel0. Cross Validation PerformanceFigure 4.5: We show mean F-measure, accuracy, and Area Under the Curve(AUC) for each model at their optimum number of features (e.g. thepeak performance in Figure 4.4). Error bars for each model show 90%CI across all 10 folds. Logistic regression performs best (ACC: 0.822,90% CI=0.795-0.848, AUC: 0.894, 90% CI=0.867-0.921, FMS: 0.824,90% CI=0.798-0.850) and has the tightest error bars across all models.feature groups, causing a decrease in performance across all models. The otherfive groups either a minor decrease in performance in some models - or in somecases, even improve the model upon being ablated. This counterintuitive resultwas also seen in Fraser et al. [26]. The top performing model from Figure 4.4(logistic regression, blue), decreases in performance in every case. The singledemographic feature, age, is highly important, as are the diagnostic-test-specificinfo-unit features. Vocabulary richness, psycholinguistic, and repetitiveness are32the least important feature groups on this data set. Fraser et al. [26] also foundvocabulary richness and repetitiveness unimportant, suggesting that “it is the wordsthemselves, and not the number of different words being used, that is important.”However, contrary to our results, they found psycholinguistic features to be veryimportant. This could be due to the differences between data sets. Besides beingsmaller in size (n=40), and involving a different subtype of dementia (primaryprogressive aphasia), their task was a narrative retelling task where patients wereasked to retell the story of Cinderella. Differences in psycholinguistic markers suchas concreteness or imagability might be more apparent in this case compared to thetask studied here, where patients are all asked to describe the same image.Last, we investigated the relative importance of each feature by showing themean feature score across all 10 folds. Within each fold, the features are sortedbased on their absolute correlation with the training labels. The score for feature iin fold j given feature rank ri j is calculated as:scorei j =1−ri j50 , if ri j ≤ 500, else(4.5)Figure 4.7 shows the mean feature score and 90% CI across all 10 folds. A scoreof 1.0 indicates the feature was ranked first in all folds (ranks are indexed from 0)while a score of 0.0 indicates a feature was not selected within the first 50 featuresin any fold. The threshold of 50 was chosen because most models reached theirmaximum performance by 50 features. Note that unlike the ablation analysis, thefeature score is agnostic to the model as it is only derived based on the correlationsbetween features and labels.As with Figure 4.6, features from acoustic, demographic, parts of speech andinfo-unit scored highly, as did context-free-grammar features. One feature fromthe psycholinguistics group - “imagability” - scored highly but this is likely due toits correlation with info-units which are all highly imageable. Noun phrase to per-sonal pronoun, a context-free-grammar feature which measures personal pronounusage, also scores highly in agreement with previous literature that demonstratedpatients with dementia have an increased rate of pronoun usage [30]. An unantici-pated result is that the most highly ranked feature is not age but mean word length33Cfg (44)Syntactic_Complexity (27)Psycholinguistic (5)Vocabulary_Richness (4)Repetitiveness (5)Acoustics (172)Demographic (1)Parts_Of_Speech (16)Info-Units (40)Feature Set7. in F-Measure LogRegSVMKNNRandomForestGausNaiveBayesFeature AblationFigure 4.6: This shows the mean change in performance across models whena feature group is removed and the model is retrained. A greater de-crease in performance indicates a more significant feature group. Thenumber of features within each group are listed in parenthesis after eachgroup name. Acoustic, Demographic, Parts of Speech and InformationContent groups are important while Syntactic Complexity, Psycholin-guistic and Vocabulary Richness are not. Large error bars indicate thatthe change in performance varies quite significantly between folds.which has a perfect score of 1.0. Increased pronoun usage and a bias towardsshorter, less imageable words suggests vague and non-specific speech.344.4.2 Classification Performance With Novel Feature SetsNow that we have verified the performance of our system using features from priorwork, we evaluate the halves, strips, quarters and discourse feature sets. Eachfeature set was added in isolation (i.e., separately from the other three sets) to theexisting features and the feature selection and model training steps were rerun.Quadratic cross terms were added to each feature set, but resulted in worse perfor-mance for strips, quarters and discourse. Thus these results do not use quadraticterms for those sets. Figure 4.8 shows the performance of each model with the ad-ditional feature set. Adding halves features improve the F-measure of the best clas-sifier from 0.824% (90% CI=0.798-0.850%) to 0.846% (90% CI=0.813-0.878%).The strips set has the second largest improvement to logistic regression, improvingthe F-measure to 0.833% (90% CI=0.801-0.866). Quarters and discourse had anegligible effect on most models.Figure 4.9 shows the change in performance with the added feature sets. Theeffect of the halves features on the suboptimal models is mixed: halves improveK-Nearest Neighbours (KNN) and the confidence interval around SVC is too largeto be considered reliable. Halves hurts the other two models, Random Forests andGaussian Naive Bayes. However, when the the quadratic terms are removed, theperformance of Random Forests and Gaussian Naive Bayes is no longer decreasedby the inclusion of halves features. A plot of the change in performance withoutquadratic features is shown in supplemental material, as are plots for AUC andaccuracy with the new features.Last we see how the feature score for halves compare to other features in Fig-ure 4.10. The highest scoring feature across all features is perception: rightside,which measures the fraction of info-units the patient recognized on the right side ofthe image. concentration: rightside, attention: rightside, and perception: leftsidealso score highly and have smaller confidence intervals than most other features.In Figure 4.11 we see box plots for the four highest scoring features; perception:rightside, mean word length, age, and noun phrase to personal pronoun, with con-trol interviews in blue. Respondents with dementia are less perceptive on their rightside than healthy controls, they use more pronouns and shorter words on average,and they are older.354.5 DiscussionIn this chapter we proposed and evaluated four feature sets; three that measurespatial neglect across different partitions of the CookieTheft image, and discoursefeatures which measure the overall coherence of a patient’s response. We showedthat by partitioning the CookieTheft image in two halves and measuring four sim-ple metrics of spatial neglect attention, concentration, repetition, and perception(plus their quadratic cross terms), we improve the F-measure of the best classi-fier, logistic regression, by 2.2% from 82.4% (90% CI=79.8-85.0) to 84.6% (90%CI=81.3-87.8). One spatial neglect feature, Perception: Rightside, was more highlycorrelated with a dementia diagnosis than all other features, including age. Im-provements were seen in a number of models, although the addition of quadraticcross terms hurt some suboptimal models. Thus, the inclusion of quadratic crossshould be considered model dependent.Interestingly, the strips partition also improved the accuracy of logistic regres-sion (although not as much as halves) while quadrants did not. This finding agreeswith the medical literature which has shown patients with AD tend to exhibit spa-tial neglect on one side of their visual field [14, 36, 37, 52, 53, 77]. spatial neglectis not known to cause inattention along the horizontal axis (e.g the top or bottomof an image) and therefore the quadrants partition did not improve classificationperformance. Our system was also able to detect other known linguistic deficitsof AD patients, namely that they tend to use personal pronouns and shorter wordsmore often than healthy counterparts.Our main negative finding was that discourse features do not improve classifi-cation accuracy across the five models we tested. This is likely due to the struc-ture of the CookieTheft description task. Unlike the narrative retelling task of theWechsler Logical Memory I/II test which involves a patient retelling a short story,the CookieTheft description task is more of a “checklist” of potential items to benoticed. Therefore there is less opportunity for a response to be coherent (or not)compared to healthy controls. We therefore conclude that while discourse featuresare not useful in discriminating dementia from controls on the CookieTheft testthey may be useful in longer and less structured narratives, such as the blog dataset discussed in Chapter 6. In that context, a speaker has an opportunity to use a36larger set of discourse relations to connect one statement to the next.370.0 0.2 0.4 0.6 0.8 1.0Scoremfcc5_kurtosismfcc6_kurtosismfcc7_kurtosismfcc8_kurtosismfcc4_kurtosismfcc10_kurtosismfcc7_skewnessmfcc13_kurtosismfcc3_kurtosismfcc12_kurtosismfcc9_kurtosismfcc11_kurtosismfcc5_skewnessNoun Phrase to Personal PronounNoun Phrase to Determiner, NounADVPAvgPPTypeLengthNonEmbeddedVP_to_VBGPProportionPPAvgPPTypeLengthEmbeddedPPTypeRateVP_to_AUX_ADJPVPProportionVP_to_AUX_VPAvgVPTypeLengthNonEmbeddedAvgVPTypeLengthEmbeddedINTJINTJ_to_UHageInfoUnit (Boolean): ObjectStoolInfoUnit (Boolean): ObjectWindowInfoUnit (Boolean): SubjectGirlInfoUnit (Boolean): ObjectSinkInfoUnit (Count): ObjectWindowInfoUnit (Boolean): SubjectWomanInfoUnit (Count): SubjectWomanInfoUnit (Boolean): ObjectCurtainsInfoUnit (Boolean): PlaceExteriorInfoUnit (Boolean): ActionStoolFallingInfoUnit (Boolean): ActionWomanDryingWashingInfoUnit (Count): ActionWomanDryingWashingInfoUnit (Count): ObjectCurtainsInfoUnit (Boolean): ObjectDishesInfoUnit (Count): PlaceExteriorInfoUnit (Count): ObjectSinkInfoUnit (Count): ActionStoolFallingInfoUnit (Count): ObjectCookieInfoUnit (Boolean): ObjectCookieMeanWordLengthRatioPronounNumAdverbsNumNounsNumInflectedVerbsNumSubordinateConjunctionsNumDeterminersRatioCoordinateImagability ScoreConcreteness ScoreSUBTLWord ScoresFamiliarity Scoreproportion of utterance pairs w/ cosine sim < .3Average cosine distance between utterancesClause per T-unitDependent clause per T-unitComplex T-unit ratioMean length of sentenceClause per sentenceMean length of T-unitNumber of SentencesComplex nominal per T-unitFeaturecfgsyntactic_complexityparts_of_speechrepetitivenessinformation_contentpsycholinguisticacousticsdemographicFeature ImportanceFigure 4.7: Feature importance score is calculated by equation 4.5. A scoreof 1.0 indicated the feature was selected first in all 10 folds, while ascore of 0.0 indicates the feature was not selected within the top 50features in any folds. Feature ranking does not depend on any particularmodel and only is based on the correlation between the feature and thebinary labels. Mean word length, age, and noun phrase to personalpronoun are the highest scoring features on the DementiaBank data set.38baseline strips halves+quadratic quadrant discourseFeature Set0.7000.7250.7500.7750.8000.8250.8500.8750.900F-MeasureLogRegSVMKNNRandomForestGausNaiveBayesPerformance w/ New Feature SetsFigure 4.8: For each of the new feature sets we show the mean F-measureacross five models. We compare against ‘none’, which is the perfor-mance of the existing system without the new feature set. halves im-proves the best model, logistic regression, from 0.824 (90% CI=0.798-0.850) to 0.846 (90% CI=0.813-0.878). Strips improves logistic regres-sion as well, to 0.833 (90% CI=0.801-0.866), although not as much ashalves. Quarters and Discourse have negligible effect on the perfor-mance of the best classifier.39strips halves+quadratic quadrant discourseFeature Set0.1000.0750.0500.0250.0000.0250.0500.0750.100Change in F-MeasureLogRegSVMKNNRandomForestGausNaiveBayesChange in Performance w/ New Feature SetsFigure 4.9: For each of the new feature sets we show the change in mean F-measure across five models when the new feature set is added. Whilehalves improves the performance of the best classifier (logistic regres-sion) it has mixed results on the suboptimal classifiers. Large errorbars indicate the change in performance varies quite drastically betweenfolds. Discourse features have no effect.400.0 0.2 0.4 0.6 0.8 1.0Scoremfcc5_kurtosismfcc6_kurtosismfcc7_kurtosismfcc8_kurtosismfcc4_kurtosismfcc10_kurtosismfcc7_skewnessmfcc13_kurtosismfcc3_kurtosismfcc12_kurtosismfcc9_kurtosisNoun Phrase to Personal PronounNoun Phrase to Determiner, NounADVPAvgPPTypeLengthNonEmbeddedVP_to_VBGPProportionPPAvgPPTypeLengthEmbeddedPPTypeRateVP_to_AUX_ADJPVPProportionVP_to_AUX_VPINTJAvgVPTypeLengthNonEmbeddedAvgVPTypeLengthEmbeddedINTJ_to_UHagePerception: RightsideAttention: RightsidePerception: LeftsideConcentration: RightsideAttention: LeftsideInfoUnit (Boolean): ObjectStoolInfoUnit (Boolean): ObjectWindowInfoUnit (Boolean): ObjectSinkInfoUnit (Boolean): SubjectGirlInfoUnit (Count): ObjectWindowInfoUnit (Boolean): SubjectWomanInfoUnit (Count): SubjectWomanInfoUnit (Boolean): ObjectCurtainsInfoUnit (Boolean): PlaceExteriorInfoUnit (Boolean): ActionStoolFallingInfoUnit (Boolean): ActionWomanDryingWashingInfoUnit (Count): ActionWomanDryingWashingInfoUnit (Count): ObjectCurtainsInfoUnit (Boolean): ObjectDishesInfoUnit (Count): PlaceExteriorInfoUnit (Count): ObjectSinkInfoUnit (Count): ActionStoolFallingInfoUnit (Count): ObjectCookieInfoUnit (Boolean): ObjectCookieMeanWordLengthRatioPronounNumAdverbsNumNounsNumInflectedVerbsNumSubordinateConjunctionsNumDeterminersImagability ScoreConcreteness ScoreSUBTLWord ScoresFamiliarity Scoreproportion of utterance pairs w/ cosine sim < .3Average cosine distance between utterancesClause per T-unitMean length of sentenceComplex T-unit ratioDependent clause per T-unitClause per sentenceMean length of T-unitFeaturecfgsyntactic_complexityparts_of_speechrepetitivenessinformation_contentpsycholinguistichalvesacousticsdemographicFeature ImportanceFigure 4.10: Feature importance score is calculated as shown in equation 4.5with the addition of the halves features. A score of 1.0 indicated thefeature was selected first in all 10 folds, while a score of 0.0 indi-cates the feature was not selected within the top 50 features in anyfolds. Perception: Rightside receives an almost perfect score, scor-ing more highly than Mean word length, age, and Noun Phrase ToPersonal Pronoun from Figure 4.7. Three other halves features, Con-centration: Rightside, Attention: Rightside and Perception: Leftsidealso score highly.41False TrueDementia0. RightsideFalse TrueDementia5060708090ageFalse TrueDementia0. Phrase to Personal PronounFalse TrueDementia2.753.003.253.503.754.00MeanWordLengthFigure 4.11: Box plots of the top four features from Figure 4.10. Top leftshows right-side perceptivity, top right shows age, bottom left showsnoun phrase to person pronoun (a measure of how often the patientuses personal pronouns), and bottom right is mean length of words.Those with dementia are less perceptive on the right side of their vi-sual field than controls, as well as being older and more likely to usepersonal pronouns and shorter words.42Chapter 5Detecting Mild CognitiveImpairment with DomainAdaptationWhile much work has focused on Alzheimer’s disease (AD), comparatively littleattention has been paid to mild cognitive impairment (MCI). Given that it is amore heterogenous condition and is associated with less impairment than AD, peo-ple with MCI may not receive medical attention until they develop a more profoundcognitive impairment. Thus, there are less available spoken language samples frompeople with MCI than patients with AD. The relative paucity of MCI data comparedto AD therefore makes it difficult to build a diagnostic model for MCI. Given thatpeople with MCI have a greater potential benefit from further assessment and ther-apy than those who have progressed to dementia, a model that could make optimaluse of limited available data could be potentially very useful.This chapter will demonstrate how domain adaptation can be used to exploitavailable AD data, thereby improving detection of MCI from spoken language sam-ples. We compared two simple domain adaptation algorithms, AUGMENT andCORAL, and show AUGMENT improved upon all baselines. These algorithms arediscussed in detail in Section 5.1.1 and 5.1.2. Our work differs from the previ-ous work on MCI described in Section 2.2 in several ways. We use the feature setproposed by Fraser et al. [26] which is larger than the feature sets of Roark et al.43[63], To´th et al. [76], and Satt et al. [65]. Unlike Roark et al. [63], we used MCIdata collected from DementiaBank (as described in Section 3.1.1), where patientsundergo a picture description task rather than a narrative retelling task. Most sig-nificantly, the goal of our study was different: while previous research in this areahas focused on MCI detection (either with manual transcriptions or using automaticspeech recognition (ASR)), the goal here was to demonstrate the viability of usingdomain adaptation algorithm to overcome the lack of MCI data [65, 76]. We beginwith a brief discussion of domain adaptation.5.1 Domain AdaptationDomain adaptation is a general term for a variety of techniques aimed at exploitingresources in one domain (the source domain) in order to improve performance onsome task in a second domain (the target domain). This is typically done whenthe target domain has little or no labelled data, while the source domain has a rela-tively large amount of labelled data, as well as existing models trained on that data.Typically the source data have been annotated for some phenomenon of interest,and the target data relate to another phenomenon that is very similar.The issue of domain adaptation has received increasing attention in recentyears. In work by Chelba and Acero [13], the source model is used to derive priorsfor the weights of the target model. They employ this technique with a maximumentropy model and apply it to the task of automatic capitalization of uniformly-cased data. They report that adaptation yields a relative improvement of 25-30%in the target domain.Blitzer et al. [6] introduced Structural Correspondence Learning (SCL), inwhich relationships between features in the two domains are determined by find-ing correlations with so-called pivot features, which are features exhibiting similarbehaviour in both domains. They used SCL to improve the performance of a parserapplied to Biomedical data, but trained on Wall Street Journal data.Daume [17] introduced an approach wherein each feature is copied so thatthere is a source version, a target version and a general version of the feature. Heshowed that this straight-forward approach could yield improvement on a varietyof NLP sequence labeling problems, such as named entity recognition, shallow44parsing and POS tagging. More recently, Sun et al. [72] proposed CORAL, amethod which aligns the second-order statistics of the source and target domain.We have implemented these two approaches, and describe them in more detail inbelow.5.1.1 AUGMENTDaume III’s AUGMENT domain adaptation algorithm is simple (“frustratingly” so[17]) and has been shown to be effective on a wide range of data sets. It augmentsthe feature space by making a “source-only”, “target-only”, and “common” copyof each feature, as seen below.[XsXt](n×d)⇒[Xs 0 XsXt Xt 0](n×3d)(5.1)Here Xs ∈Rns×d and Xt ∈Rnt×d are matrices of source and target data, where eachof the n rows is an observation, each of the d column is a feature, n = nt +ns andnt  ns. We create three copies of each column: a source-only column with zerosin target rows, a target-only column with zeros in the source rows, and the originalcolumn with both target and source entries left untouched. This augmented dataset is then fed into a standard learning algorithm.The motivation for this transformation is intuitive. If a column contains a fea-ture (such as mean word length) which correlates to a diagnosis in both the targetand source data (i.e. MCI and AD), a learning algorithm will increase the weight inthe common column and reduce the weight on target-only and source-only copies,thereby reducing their importance in the model. However, if a feature correlatesto a diagnosis only with MCI data, a learning algorithm can increase the weight ofthe target-only column (which contains zeros for all the source data) and reduceweight of the original and source-only columns, thereby assuring the feature willbe less relevant to the model when applied to Alzheimer’s data. By expanding thefeature space and padding with zeros, a model can learn whether to apply a givenfeature on zero, one, or both data sets.Although not explicitly stated in the original paper, AUGMENT assumes the45model learns a weight vector (e.g. logistic regression, SVM) so as to select theappropriate copy of the feature. Because of this we expect that models that clas-sify without learning weights (e.g. KNN, Naive Bayes, Random Forests) will notimprove under AUGMENT’s feature transformation.5.1.2 CORALCORAL (CORrelation ALignment) is another recently proposed “frustratingly easy”[72] domain adaptation algorithm that works by aligning the covariances of thesource and target features. The algorithm first normalizes the source data to zeromean and unit variance, and then a whitening transform is performed on the sourcedata to remove the correlation between the source features. A whitening transformis a linear transformation of the feature space such that the covariance of the trans-formed feature space is the identity matrix. We use PCA whitening on the sourcedata as follows:Σs = E[XsXTs ]−E[Xs]E[XTs ] = QDQTW = QD−12 QTXˆs =WXsThat is, we first take the eigenvalue decomposition of the covariance matrix of the(zero meaned) source data. Then we set the whitening matrix to be the eigenvaluedecomposition with the negative square root of the eigenvalues. This results in the46whitened source data having an identity covariance:cov(Xˆs) = E[XˆsXˆTs ]−E[Xˆs]E[XˆTs ]cov(Xˆs) = E[WXsXTs WT ]−E[WXs]E[XTs W T ]cov(Xˆs) =WE[XsXTs ]WT −((((((((WE[Xs]E[XTs ]WTcov(Xˆs) =WΣsW Tcov(Xˆs) = QD−12 QT QDQT QD−12 QTcov(Xˆs) = QD−12 D12 D12 D−12 QTcov(Xˆs) = QQTcov(Xˆs) = IFinally, the source matrix is “recoloured” with the correlations from the targetdata using colour matrix Wt :Σt = E[XtXTt ]−E[Xt ]E[XTt ] = QDQTWt = QD12 QTX˜s =Wt Xˆscov(X˜s) = E[X˜sX˜Ts ]cov(X˜s) = E[Wt XˆsXˆTs WTt ]cov(X˜s) =WtE[XˆsXˆTs ]WTtcov(X˜s) =WtIW Ttcov(X˜s) = QD12 QT QD12 QTcov(X˜s) = QDQTcov(X˜s) = ΣtThese three steps are shown in Figure 5.1. A model is then trained on the re-coloured source data and used to classify the target data.47Figure 5.1: The CORAL algorithm is shown in three steps. The target andsource data set consist of three features; x, y, z. In a) the source dataand target data are normalized to unit variance and zero mean, but havedifferent covariances distributions. b) The source data is whitened to re-move the correlations between features. c) The source data is recolouredwith the target domain’s correlations and the two data sets are aligned.A classifier is then trained on the re-aligned source data. (Figure adaptedfrom [72])5.2 Data SetThe DementiaBank data set, described in detail in Section 3.1.1, contains 43 MCIsamples from 19 patients, 257 possible/probable AD samples, and 242 control sam-ples. We split the data set into “target” data (86 rows, 43 MCI, 41 control) and“source” data (458 rows, 257 possible/probable AD, 201 control). Multiple inter-views from a single control patient were contained to either the target or the sourcedata sets, but not both.5.3 Baseline, Experiments, ResultsWe followed the experimental designed described in Chapter 3, using an aug-mented feature space for AUGMENT and CORAL. We compare against three do-main adaptation baselines. Target only trains the model only using target data,source only trains a model only using source data but evaluates on the target data.In the relabeled source model, we pool the target data and source data in the train-ing folds and relabel AD to MCI. Along with the domain adaptation baselines we48included one baseline model, majority class, which predicts the majority class inthe training fold.The test set contained only MCI data. In the AUGMENT, CORAL, and relabeledapproaches, each fold of the training set contained a combination of MCI+AD datawhile the source only baseline contained only AD in the training fold. Our goal wasto verify whether the accuracy achieved by using these domain adaptation methodsoutperforms the accuracy achieved by using MCI data alone.target_only source_only relabeled augment coralModel0. 5.2: Comparison of two domain adaption methods, AUGMENT andCORAL, against three domain adaptation baselines and one model base-line (dummy classifier which predicts the majority class in the trainingfold). Mean F-measure and 90% CI are shown across 10-folds. Onlytarget data appears in the test fold. AUGMENT with logistic regressionoutperforms all baselines. CORAL doesn’t improve either model abovethe majority class baseline.49In figure 5.2 we show the F-measure for models with a weight vector (SVMand logistic regression). The AUGMENT domain adaptation algorithm with logis-tic regression performed best (F-measure: 0.717, 90% CI=0.562-0.871), beatingall three domain adaptation baselines and the dummy classifier. AUGMENT alsoimproved the SVM classifier over baselines although the performance (F-measure:0.664, 90% CI=0.533-0.796) did not match logistic regression. CORAL does notimprove either model beyond the simple majority class baseline model, and forlogistic regression it results in a worse performance than the target only domainadaptation baseline.Figure 5.3 shows both methods on three models (Naive Bayes, Random Forests,K-Nearest Neighbours (KNN)) which do not classify using a weight vector. As weexpected, we see AUGMENT fails to improve any models and actually makes theirperformance worse than the target only baseline. This underscores the importanceof only using the AUGMENT method with a model that is able to select, via theweight vector, which of the three copies of the feature to use.5.4 DiscussionThis chapter showed how we can use a simple domain adaptation algorithm, AUGMENT,to use AD data to overcome the scarcity of MCI data. Using AUGMENT we im-proved the F-measure of logistic regression from 66.7%, (90% CI=50.5-82.9) us-ing MCI data only, to 71.7% (90% CI=56.2-87.1), using both MCI and AD data.AUGMENT requires a simple modification of the target and source feature space,and can be easily extended to incorporate source data from multiple domains.We also showed that AUGMENT only works with classifiers that learn a weightvector. This is an important caveat that was not explicitly stated in the original pa-per by [17]. Practitioners should be cautious about applying AUGMENT to modelsthat do not learn a weight vector, such as KNN or Gaussian Naive Bayes, becausedoing so can actually decreases the performance.The main negative result was the performance of the CORAL domain adaptationmethod with logistic regression (F-measure: 56.5% 90%CI=40.3-72.8), which isworse than the target-only method. In other words, using CORAL results in a worseperformance than not doing domain adaptation at all. It has previously been found50target_only source_only relabeled augment coralModel0. 5.3: Performance of two domain adaption methods, AUGMENT andCORAL, on classifiers that do not learn a weight vector. AUGMENT doespoorly in this setting because the models are unable to choose betweenthe “target only”, “source only” or “both” version of each feature.that CORAL does not always work well with boolean features such as bag-of-wordsfeatures [72]. Info-units, which we see in figures 4.6 and 4.7 to be strong predictorsof dementia, are largely boolean.51Chapter 6Detecting Dementia FromWritten TextChapters 4 and 5 have focused on spoken language collected from patients under-going a clinical examination. This data is expensive to collect and does not accu-rately reflect how patients use language in daily life. Perhaps most importantly,as millennials and “iGen’s” continue to age and use the internet, the predominantsource of language samples from those with dementia will not be spoken, but writ-ten.There were 173 million blogs on the web in 2011, only twenty years since thefirst website was launched in 1991 [70]. As these bloggers enter their senescence,a percentage of them will be diagnosed with dementia, and a percentage of thosewill continue to use the internet. There will therefore be a growing data set avail-able in the form of tweets, blog posts, and social media comments with which totrain a classifier. Provided these writers have a verified clinical diagnosis of de-mentia, such a data set would be large, inexpensive to acquire, easy to process,and require no manual transcriptions. Unlike with spoken speech, written text willcontain fewer instances of the subject being “flustered” by potential word-findingdifficulties and other time-dependent performance issues. This therefore mightmake it possible to detect subtle lexical, grammatical or pragmatic issues that maybe missed from spoken text.There are downsides to using written language samples as well. Unlike spoken52language, written text can be edited or revised by oneself or others. People withdementia may have “good days” and “bad days,” and may write only on days whenthey are feeling lucid. Thus, written samples may be biased towards more intactlanguage. Furthermore, researchers do not have an audio recording to accompanythe text and patients are not constrained to a single topic; people with dementiamay have greater facility discussing familiar topics. A non-standardized data setwill also prevent the collection of common test-specific linguistic features such asinfo-units. However, working with a very large data set may be able to mitigate theeffects of these limitations. Additionally, since substantial amounts of data can becollected for the same person, more accurate, user-specific longitudinal predictionsmight be possibleIn this chapter we present the first attempt at automatically detecting whethera blog post was written by an individual with dementia. We followed the generalmethodology described in Chapter 3 with a different data set, described in Sec-tion 6.1. The goal was to determine if this task is possible given the constraintslisted above, and also to determine if the features most discriminating in the writ-ten case are the same as in the spoken case. We make our data set publicly availableat https://github.com/vadmas/blog corpus.6.1 Data SetWe scraped the text of 2805 posts from 6 public blogs as described in Table 6.1.Three blogs were written by persons with dementia (First blogger: male, Alzheimer’sdisease (AD), age 72. Second blogger: female, AD, age 61. Third blogger: Male,Dementia with Lewy Bodies, age 65) and three were written by family membersof persons with dementia to be used as control (all female, ages unknown). Otherdemographic information, such as education level, was unavailable. From each ofthe three dementia blogs, we manually filtered all texts not written by the ownerof the blog (such as fan letters) or posts containing more images than text. Thisleft with 1654 samples written by persons with dementia and 1151 from healthycontrols. Control blogs were written by children, spouses, or caregivers of seniorswith dementia and were selected to control for topic and previous level of writingexperience.53URL (http://*.blogspot.ca) Posts Mean words Start Date Diagnosis Gender/Ageliving-with-alzhiemers 344 263.03 (s=140.28) Sept 2006 AD M, 72 (approx)creatingmemories 618 242.22 (s=169.42) Dec 2003 AD F, 61parkblog-silverfox 692 393.21 (s=181.54) May 2009 Lewy Body M, 65journeywithdementia 201 803.91 (s=548.34) Mar 2012 Control F, unknownearlyonset 452 615.11 (s=206.72) Jan 2008 Control F, unknownhelpparentsagewell 498 227.12 (s=209.17) Sept 2009 Control F, unknownTable 6.1: Blog Information as of April 4th, 20176.2 Experimental DesignWe followed the general methodology described in Chapter 3 using the blog dataset instead of the DementiaBank data set. We use the features described in Sec-tion 3.2, with the exception of the acoustic and info-unit feature groups which werenot available for blog data. In total we extract 102 features from each blog post witha binary label, indicating whether or not the author has dementia. We performeda 9-fold cross validation across all pairs of blogs with opposite labels. Each testfold contains all posts from one dementia blog and one control blog, and the postsfrom the remaining four blogs are used in the training fold. As with the previousexperiments we run a feature selection step within each training fold, as describedin Section 3.3. We report Accuracy (ACC), F-measure (FMS) and Area Under theCurve (AUC) for each model and compare against a dummy classifier that predictsthe majority class label in the training fold.6.3 ResultsFigures 6.1 and 6.2 show the feature selection curve and the final peak classi-fication performance, respectively. Unlike with DementiaBank all models reachnear-optimal performance near 10 features then the performance either levels offor improves slightly as more features are added. The best model with the blogsdata set is K-Nearest Neighbours (KNN) (ACC: 0.728, 90% CI=0.687-0.769, AUC:0.761, 90% CI=0.714-0.807, FMS: 0.785, 90% CI=0.746-0.823) which slightlybeats logistic regression (ACC: 0.724, 90% CI=0.677-0.770, AUC: 0.759, 90%CI=0.689-0.829, FMS: 0.785, 90% CI=0.743-0.827) and had tighter error bars. All54models beat the baseline AUC of 0.50.We ran the same ablation analysis on the blogs data set as we performed on theDementiaBank (Section 4.4). The results are shown in Figure 6.3. Unlike with theDementiaBank data set psycholinguistic features are the most important featuregroup, with their ablation causing the performance of all models to drop signifi-cantly. Somewhat unexpectedly the removal of the other feature groups causes aslight improvement in the best classifier, KNN, although the improvement is withinthe error bars in all cases, and not seen in logistic regression, the near-optimalclassifier.Figure 6.4 shows the scores for each feature, as calculated by equation 4.5. Ascore of 1.0 indicated the feature was selected first in all 9 folds, while a score of0.0 indicates the feature was not selected within the top 50 features in any folds.SUBTL word score, which is a measure of how frequently a word is used in dailylife, is the most highly correlated with a dementia diagnosis across all 9 folds. Thenumber of sentences per post also is highly correlated with a diagnosis. As withthe DementiaBank data set, both mean word length and noun phrase to personalpronoun also score highly. We also observe that the error bands are larger for mostfeatures than in the figure 4.7, indicating the correlation between feature and labelhas a greater dependence on the particular training fold (higher variance in thebias-variance tradeoff) than in the DementiaBank data set.Finally, Figure 6.4 shows the box plots of the four highest scoring features in:SUBTL word score, number of sentences, mean word length, noun phrase to per-sonal pronoun. The top left box plot shows that bloggers with dementia (red) havea higher SUBTL Word Score on average. SUBTL is a measure of how frequently aword is used in daily life, with a higher score indicating a more ordinary word anda lower score indicating a less common one. Scores are derived from television andfilm subtitles. In the six blogs in our data set, bloggers with dementia use more or-dinary (i.e. more frequently occurring) words than their control counterparts. Theyalso tend to write shorter blog posts, and in agreement with the DementiaBank dataset, use shorter words and more personal pronouns.5520 40 60 80 100Number of Features0.500.550.600.650.700.750.80AUCMajorityClassLogRegKNNFeature Selection Curve20 40 60 80 100Number of Features0.500.550.600.650.700.75AUCMajorityClassRandomForestGausNaiveBayesSVMFigure 6.1: We show Area Under the Curve (AUC) for each model as we varythe number of features. Error bars for each model show 90% CI acrossall 9 folds. We use two plots so error bars are distinguishable. All mod-els beat the dummy classifier (majority class) with the KNN achievingthe best performance (ACC: 0.728, 90% CI=0.687-0.769, AUC: 0.761,90% CI=0.714-0.807, FMS: 0.785, 90% CI=0.746-0.823).56MajorityClassLogReg SVM KNNRandomForestGausNaiveBayesModel0. 6.2: We show mean Accuracy (ACC), F-measure (FMS) and Area Un-der the Curve (AUC) for each model at their optimum number of fea-tures (e.g. the peak performance in Figure 6.1). Error bars for eachmodel show 90% CI across all 9 folds. All models beat the dummyclassifier (majority class) with the KNN achieving the best performance(ACC: 0.728, 90% CI=0.687-0.769, AUC: 0.761, 90% CI=0.714-0.807,FMS: 0.785, 90% CI=0.746-0.823).57Cfg (44)Syntactic_Complexity (27)Psycholinguistic (5)Vocabulary_Richness (4)Repetitiveness (5)Parts_Of_Speech (16)Feature Set302520151050510Change in AUC LogRegSVMKNNRandomForestGausNaiveBayesFeature AblationFigure 6.3: As with figure 4.6 we show the mean change in performanceacross models when a feature group is removed and the model is re-trained. A greater decrease in performance indicates a more significantfeature group. The number of features within each group are listed inparenthesis after each name group name. Unlike with the Dementia-Bank data set all feature groups are important to the prediction accu-racy, with the removal of the psycholinguistic group having the greatestdeleterious effect across all models.580.0 0.2 0.4 0.6 0.8 1.0Feature ScoreNoun Phrase to Personal PronounNPTypeRatePRNADVPVP_to_AUX_VPAvgNPTypeLengthEmbeddedAvgPPTypeLengthNonEmbeddedVP_to_AUXAvgVPTypeLengthEmbeddedAvgPPTypeLengthEmbeddedNoun Phrase to Determiner, NounAvgVPTypeLengthNonEmbeddedVPTypeRateVP_to_AUX_ADJPPProportionPPAvgNPTypeLengthNonEmbeddedNPADJPNPProportionVPProportionPPTypeRateVP_to_VBGPRTVPXINTJCONJPFRAGMeanWordLengthRatioPronounNumAdjectivesNumVerbsNumCoordinateConjunctionsRatioNounNumAdverbsgetLightVerbCountNumNounsNumInflectedVerbsRatioCoordinateRatioVerbNumDeterminersNumSubordinateConjunctionsNumInterjectionsSUBTLWord ScoresConcreteness ScoregetAoaScoreImagability ScoreFamiliarity Scoreproportion_below_threshold_0Average cosine distance between utterancesNumber of SentencesNumber of Complex NominalsNumber of T-UnitsNumber of WordsNumber of Coordinate PhrasesCTtree_heightNumber of Words per Utterance Number of ClausesComplex nominal per ClauseMean length of ClausesDCClause per sentenceCP_CDC_CClause per T-unitDependent clause per T-unitMean length of sentenceVerb Phrase per T-unitComplex T-unit ratioMean length of T-unitCP_TComplex nominal per T-unitMATTRHonoreStatisticBrunetIndexType-to-token ratioFeature Sets cfgvocabulary_richnesssyntactic_complexityparts_of_speechrepetitivenesspsycholinguisticFigure 6.4: Feature importance score for the blog data set, as calculated byequation 4.5. A score of 1.0 indicated the feature was selected first in all9 folds, while a score of 0.0 indicates the feature was not selected withinthe top 50 features in any folds. Feature ranking does not depend on anyparticular model and only is based on the correlation between the featureand the binary labels. SUBTL Word Score, Number of Sentences, MeanWord Length, and Noun Phrase To Personal Pronoun are the highestscoring features on the data set.59journeywithdementiahelpparentsagewellearlyonsetliving-with-alzhiemersparkblog-silverfoxcreatingmemoriesblog010002000300040005000SUBTLWord Scoresjourneywithdementiahelpparentsagewellearlyonsetliving-with-alzhiemersparkblog-silverfoxcreatingmemoriesblog2468101214MeanWordLengthjourneywithdementiahelpparentsagewellearlyonsetliving-with-alzhiemersparkblog-silverfoxcreatingmemoriesblog0. Phrase to Personal Pronounjourneywithdementiahelpparentsagewellearlyonsetliving-with-alzhiemersparkblog-silverfoxcreatingmemoriesblog05101520253035Number of SentencesFigure 6.5: Box plots of the four highest scoring features in figure 6.4.SUBTL Word Score top left, Mean Word Length top right, Noun PhraseTo Personal Pronoun bottom left, Number of Sentences bottom right.Blogs written by persons with dementia are red and controls are blue.As in the spoken case, persons with dementia use the personal pronounmore often and use smaller words on average. Bloggers with dementiaalso have a higher SUBTL score (indicating an impoverished vocabu-lary) and write shorter posts.6.4 DiscussionThis chapter demonstrated how dementia can be automatically detected from writ-ten text in the form of blog posts. We collected a data set of 2805 blog posts writtenby either persons with dementia or family members of persons with dementia. Wethen extracted 102 lexical features from each post and evaluated the performanceof five classifiers in detecting whether the author of a post from an unseen blog hasdementia. KNN beat the other models, and the baseline classifier, with an AUC of0.761 (90% CI=0.714-0.807).We also observed that bloggers with dementia tend to use fewer uncommon60words (as indicated by the higher SUBTL score), and write shorter posts usingshorter words, on average. This finding is interesting because it would be difficultfor a human reading a single post to detect the simplified language (provided thepost was coherent, which from inspection all were), but a higher SUBTL score andshorter word length could be detected automatically given a collection of posts.Given that the data set consisted only of six blogs, a larger data set is necessaryto discern if the change in language we’ve identified here is in fact due to dementia,or due to the idiosyncrasies of this particular data set. For example, we detectedan increased pronoun usage by bloggers with dementia compared to controls. Thisagrees with previous work and our results from Chapter 4, but it could also bedue to the fact that the bloggers in the dementia blogs were writing about theirown experience while the authors in the control blogs were writing about someoneelse’s experience, and hence may use less pronouns.Despite the limitations of a comparatively small data set and the difficultiesassociated with analyzing written text (including the author’s ability to make revi-sions and make use of third party editors), we have shown it is possible to detectdementia from written text. This opens the door to making use of the upcomingdeluge of online text written by seniors suffering from cognitive decline as data onwhich to train machine learning models.61Chapter 7ConclusionEarly detection of dementia is important, not only for patients for whom a diagnosisis the first step to receiving adequate support, but for researchers, who say thatearly detection will be crucial to finding a cure [3]. There have recently beensuccesses in using machine learning and natural language processing techniquesto automatically detect dementia from speech. This thesis has made three maincontributions towards this effort.First, we proposed a novel set of features biologically motivated that we callspatial neglect features. These features measure whether the respondent is moreperceptive on one side of their visual field than the other. We showed their in-clusion increases the F-measure of logistic regression from 82.4% to 84.6%. Thisachieves a new state of the art on the DementiaBank data set, beating the previousstate-of-the-art of 81.92%. We considered three different partitions of the Cooki-eTheft image, halves, strips, and quarters and found that halves performs best, inagreement with previous finding in medical literature. Previous work has foundthat patients with AD show differences in discourse structure and so we also evalu-ated the effect of discourse features on model performance, but found they had noeffect on the DementiaBank data set.Second, we demonstrated how a simple domain adaptation algorithm can beused to overcome the lack of available mild cognitive impairment (MCI) data. Wecompared two “frustratingly simple” domain adaptation algorithms that used ADdata to improve the accuracy of MCI detection, and found that AUGMENT beats all62baselines and improves the F-measure from 66.7% using only MCI data, to 71.7%using MCI + AD.Last, we evaluated our framework on written data in the form of blog posts. It isnot obvious that a system that can detect dementia from spoken language could dothe same for written language, given that one can make revisions to text but cannotdo so for recorded extemporaneous speech. We show that a range of models canpredict whether the author of a blog post has dementia at a rate far above baselines.KNN achieved a maximum AUC of 0.761 beating the baseline of 0.50 by a widemargin. Additionally we make the blog corpus used in our experiments publiclyavailable for future researchers.Besides the main contributions listed above, we made some observations thatwill be useful to practitioners and help guide future work. For practitioners, werecommend the use of spatial neglect features (with a halves partition) as theyincreases the performance of most models, but recommend including them bothwith and without quadratic terms in order to determine which performs best. Inthe case of logistic regression, the addition of the quadratic terms improved theperformance significantly but for Random Forests and Gaussian Naive Bayes, thequadratic terms hurt the performance significantly. This is likely due to the factthat many of the quadratic features were uninformative and some models are lesscapable of dealing with an excess of uninformative features than others1.Another observation regards the use of AUGMENT for domain adaptation. Asnoted in Section 5.3, AUGMENT performs well when a model is able to select be-tween the three copies of a feature (c.f Section 5.1.1) via a weight vector. Modelswhich are unable to do so, such as Random Forests, Gaussian Naive Bayes, andKNN are negatively impacted by the augmentation of the feature space. Practition-ers should bear this in mind when they use AUGMENT with their models.We also reiterate that practitioners should consider whether a substantial por-tion of their features are sparse or binary before using CORAL. Sun et al. [72] also1For example, Random Forests randomly chooses a subset of the features at each node, mean-ing that the inclusion of a large number of uninformative features reduces the probability thatan informative feature will be selected at each node. Similarly, the Naive Bayes classifier as-sumes conditional independence between all features, so including uninformative features (e.g.p( fd |yn = 0) ≈ p( fd |yn = 1)) will hurt the probability of a label being classified correctly by bi-asing all probabilities towards 0.5.63found CORAL performed poorly on text data sets and they hypothesized it was dueto the lack of correlation between the sparse bag-of-words features. We also sus-pect the poor performance is due to centering the features as a preprocessing step,as centering destroys sparsity. Better results with CORAL may be also obtainedwith word embeddings as they are lees sparse.7.1 Future WorkThere are multiple directions we would like to take this work in the future. Wediscuss the spoken and written data sets separately for clarity.7.1.1 SpokenOne aspect of the AUGMENT domain adaptation algorithm that was not used inthis work is its ability to accommodate data from multiple source domains. Thisis done via a trivial extension to the standard feature space augmentation (c.f.,Section 5.1.1) and would allow us to include source data from patients with vas-cular dementia, dementia with Lewy bodies, and other non-Alzheimer’s dementiasalong with the AD data used in Chapter 5. Incorporating source data from multiplepathologies could potentially improve our diagnostic capabilities, but this has yetto be shown.Rather than using AUGMENT to leverage data from multiple pathologies, wecould use it to leverage data from multiple diagnostic exams. For example we couldpotentially improve upon the results in Chapter 4.1 by using AUGMENT with datacollected from the Narrative Retelling task from the Wechsler Logical Memory I/IItest, or the blog data we discuss in Chapter 6. In these settings discourse features,which Chapter 4 showed were not predictive on the DementiaBank data set, mayalso be more useful given the narrative structure of the speech samples.We suspect CORAL performed poorly on the DementiaBank data set becauseof the boolean info-unit features. A small modification to CORAL, where we align(c.f., Section 5.1.2) only the non-boolean features, could improve CORAL’s per-formance. Another potentially interesting area of future work would be mergeAUGMENT and CORAL into a single algorithm by adding a “CORAL aligned” copyof the feature to the AUGMENT feature space.647.1.2 WrittenIn Chapter 6 we confirmed it is possible to detect signs of dementia from blog posts.There were a few limitations of our approach that we would like to address in futurework. First, the small size of our data set meant we were unable to differentiatebetween subtypes of dementia (e.g., Dementia with Lewy Bodies and AD). This isnot desirable because different pathologies have different symptoms (cf. Table 2.2and 2.1). We would like to collect a larger data set to allow us to control for types ofdementia, as well as demographic information such as age, gender, and educationlevel - information that was not present for all the blogs in our study.Another limitation of the above work is the unstructured nature of the text.Unlike with the DementiaBank data set, none of the bloggers were constrainedto a single topic, beyond the general topic of “living with dementia”. We couldpotentially improve our results by performing a topic clustering preprocessing stepon the blog posts. After clustering we could either train a classifier separately foreach cluster or include topic membership as a feature.Topic clustering would also help us to better understand the differences wefound in some linguistic markers between bloggers (cf. figure 6.4 and 6.5). Weobserved bloggers with dementia have a higher SUBTL score (indicating an im-poverished vocabulary) and shorter average word length compared to healthy con-trols. These findings need further investigation to confirm if they are in fact dueto dementia-induced aphasia, as medical literature would predict. In Masrani et al.[50] we had looked at the longitudinal change of one of the metrics, the SUBTLword score, to see if the bloggers became more symptomatic as the disease pro-gressed. Results were inconclusive however, with the longitudinal trend of theSUBTL score moving in the direction opposite to what we expected. With topicclustering, we could track the longitudinal changes of certain linguistic markerswithin each topic, as well as the longitudinal changes in the topics themselves,to better understand the differences in writing style between the writers with andwithout dementia.Finally, we hope to explore how aphasia manifests itself on different onlineplatforms. Language is shaped by its environment and linguistic features that areuseful in classifying blog posts may not be useful classifying tweets. Today only6534% of seniors use social media [10]. That number will surely rise as the Internetgeneration reaches adulthood and continues to use instant messenger, to commenton Facebook posts, and to converse on online forums. It therefore behooves us tounderstand how to detect signs of cognitive decline in these settings.66Bibliography[1] M. Abdalla, F. Rudzicz, and G. Hirst. Rhetorical structure and alzheimersdisease. Aphasiology, pages 1–20, 2017. → pages 28[2] S. Ahmed, A.-M. F. Haigh, C. A. de Jager, and P. Garrard. Connectedspeech as a marker of disease progression in autopsy-proven alzheimersdisease. Brain, 136(12):3727–3737, 2013. → pages 11[3] A. Association. 2016 alzheimer’s disease facts and figures.https://www.alz.org/documents custom/2016-facts-and-figures.pdf, 2016.Accessed: 2017-11-13. → pages 1, 5, 7, 62[4] A. P. Association et al. Diagnostic and statistical manual of mentaldisorders (DSM-5 R©). American Psychiatric Pub, 2013. → pages 6[5] J. Birks and R. J. Harvey. Donepezil for dementia due to alzheimer’sdisease. The Cochrane Library, 2006. → pages 9[6] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structuralcorrespondence learning. pages 120–128, July 2006. → pages 44[7] L. X. Blonder, E. D. Kort, and F. A. Schmitt. Conversational discourse inpatients with alzheimer’s disease. Journal of Linguistic Anthropology, 4(1):50–71, 1994. → pages 27[8] M. Brysbaert and B. New. Moving beyond Kucˇera and Francis: A criticalevaluation of current word frequency norms and the introduction of a newand improved word frequency measure for american english. BehaviorResearch Methods, 41:977–990, 2009. → pages 20[9] R. S. Bucks, S. Singh, J. M. Cuerden, and G. K. Wilcock. Analysis ofspontaneous, conversational speech in dementia of alzheimer type:Evaluation of an objective technique for analysing lexical performance.Aphasiology, 14(1):71–91, 2000. → pages 1967[10] P. R. Center. Technology use among seniors.http://www.pewinternet.org/2017/05/17/technology-use-among-seniors/,2017. Accessed: 2017-12-06. → pages 66[11] L. W. Chambers, C. Bancej, and I. McDowell. Prevalence and monetarycosts of dementia in canada. The Alzheimer Society of Canada, 2016. →pages 1[12] S. B. Chapman, H. K. Ulatowska, K. King, J. K. Johnson, and D. D.McIntire. Discourse in early alzheimer’s disease versus normal advancedaging. American Journal of Speech-Language Pathology, 4(4):124–129,1995. → pages 27[13] C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: Littledata can help a lot. Computer Speech & Language, 20(4):382–399, 2006. →pages 44[14] M. M. Cherrier, M. F. Mendez, M. Dave, and K. M. Perryman. Performanceon the rey-osterrieth complex figure test in alzheimer disease and vasculardementia. Cognitive and Behavioral Neurology, 12(2):95–101, 1999. →pages 25, 36[15] M. A. Covington and J. D. McFall. Cutting the gordian knot: Themoving-average type–token ratio (mattr). Journal of QuantitativeLinguistics, 17(2):94–100, 2010. → pages 19[16] B. Croisile, B. Ska, M.-J. Brabant, A. Duchene, Y. Lepage, G. Aimard, andM. Trillet. Comparative study of oral and written picture description inpatients with alzheimer’s disease. Brain and language, 53(1):1–19, 1996. →pages 20[17] H. Daume. Frustratingly easy domain adaptation. 2007. → pages 44, 45, 50[18] B. H. Davis. So, you had two sisters, right? functions for discourse markersin alzheimers talk. In Alzheimer Talk, Text and Context, pages 128–145.Springer, 2005. → pages 28[19] K. Dijkstra, M. S. Bourgeois, R. S. Allen, and L. D. Burgio. Conversationalcoherence: Discourse analysis of older adults with and without dementia.Journal of Neurolinguistics, 17(4):263–283, 2004. → pages[20] C. Ellis, A. Henderson, H. H. Wright, and Y. Rogalski. Global coherenceduring discourse production in adults: a review of the literature.68International Journal of Language & Communication Disorders, 2016. →pages 28[21] D. G. Ellis. Coherence patterns in alzheimer’s discourse. CommunicationResearch, 23(4):472–495, 1996. → pages 27[22] V. W. Feng. RST-style discourse parsing and its applications in discourseanalysis. PhD thesis, University of Toronto, 2015. → pages 28[23] L. K. Ferreira and G. F. Busatto. Neuroimaging in alzheimer’s disease:current role in clinical practice and potential future applications. Clinics, 66:19–24, 2011. → pages 9[24] S. H. Ferris and M. Farlow. Language impairment in alzheimers disease andbenefits of acetylcholinesterase inhibitors. Clinical interventions in aging, 8:1007, 2013. → pages 2[25] T. S. Field, V. Masrani, G. Murray, and G. Carenini. Improving diagnosticaccuracy of alzheimer’s disease from speech analysis using markers ofhemispatial neglect. Alzheimer’s & Dementia: The Journal of theAlzheimer’s Association, 13(7):P157–P158, 2017. → pages iv, 4[26] K. C. Fraser, G. Hirst, N. L. Graham, J. A. Meltzer, S. E. Black, andE. Rochon. Comparison of different feature sets for identification of variantsin progressive aphasia. ACL, page 17, 2014. → pages 12, 21, 30, 32, 33, 43[27] K. C. Fraser, J. A. Meltzer, and F. Rudzicz. Linguistic features identifyalzheimers disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2015. → pages 2, 12, 17, 30[28] S. Gao, H. C. Hendrie, K. S. Hall, and S. Hui. The relationships betweenage, sex, and the incidence of dementia and alzheimer disease: ameta-analysis. Archives of general psychiatry, 55(9):809–815, 1998. →pages 17[29] E. Giles, K. Patterson, and J. R. Hodges. Performance on the boston cookietheft picture description task in patients with early dementia of thealzheimer’s type: missing information. Aphasiology, 10(4):395–408, 1996.→ pages 15[30] D. B. Hier, K. Hagenlocker, and A. G. Shindler. Language disintegration indementia: Effects of etiology and severity. Brain and language, 25(1):117–133, 1985. → pages 3369[31] G. Hirst and V. Wei Feng. Changes in style in authors with alzheimer’sdisease. English Studies, 93(3):357–370, 2012. → pages 13[32] G.-Y. R. Hsiung, A. Donald, J. Grand, S. E. Black, R. W. Bouchard, S. G.Gauthier, I. Loy-English, D. B. Hogan, A. Kertesz, K. Rockwood, et al.Outcomes of cognitively impaired not demented at 2 years in the canadiancohort study of cognitive impairment and related dementias. Dementia andgeriatric cognitive disorders, 22(5-6):413–420, 2006. → pages 11[33] A. Hunt, P. Scho¨nknecht, M. Henze, U. Seidl, U. Haberkorn, andJ. Schro¨der. Reduced cerebral glucose metabolism in patients at risk foralzheimer’s disease. Psychiatry Research: Neuroimaging, 155(2):147–154,2007. → pages 9[34] M. Husain. Hemineglect. Scholarpedia, 3(2):3681, 2008. → pages ix, 25[35] A. D. International. Dementia statistics.https://www.alz.co.uk/research/statistics. Accessed: 2017-11-30. → pages 1[36] S. Ishiai, R. Okiyama, Y. Koyama, and K. Seki. Unilateral spatial neglect inalzheimer’s disease a line bisection study. Acta neurologica scandinavica,93(2-3):219–224, 1996. → pages 25, 36[37] S. Ishiai, Y. Koyama, K. Seki, S. Orimo, N. Sodeyama, E. Ozawa, E. Lee,M. Takahashi, S. Watabiki, R. Okiyama, et al. Unilateral spatial neglect inad significance of line bisection performance. Neurology, 55(3):364–370,2000. → pages 25, 36[38] S. Joty, G. Carenini, and R. T. Ng. Codra: A novel discriminative frameworkfor rhetorical analysis. Computational Linguistics, 2015. → pages x, 29[39] M. Kasai, J. Ishizaki, and K. Meguro. Alzheimer’s patients do not show leftunilateral spatial neglect but exhibit peripheral inattention and simplification.Dementia & Neuropsychologia, 1(4):374–380, 2007. → pages 25[40] S. Kemper, L. H. Greiner, J. G. Marquis, K. Prenovost, and T. L. Mitzner.Language decline across the life span: findings from the nun study.Psychology and aging, 16(2):227, 2001. → pages 13[41] B. Klimova and K. Kuca. Speech and language impairments in dementia.Journal of Applied Biomedicine, 14(2):97–103, 2016. → pages viii, 6, 7[42] A. Ko¨nig, A. Satt, A. Sorin, R. Hoory, O. Toledo-Ronen, A. Derreumaux,V. Manera, F. Verhey, P. Aalten, P. H. Robert, et al. Automatic speech70analysis for the assessment of patients with predementia and alzheimer’sdisease. Alzheimer’s & Dementia: Diagnosis, Assessment & DiseaseMonitoring, 1(1):112–124, 2015. → pages 13[43] A. Kumar et al. Dementia: An overview. Journal of Drug Delivery andTherapeutics, 3(3):163–167, 2013. → pages viii, 7, 8[44] V. Kuperman, H. Stadthagen-Gonzalez, and M. Brysbaert.Age-of-acquisition ratings for 30,000 english words. Behavior ResearchMethods, 44(4):978–990, 2012. → pages 19[45] L. Kurlowicz and M. Wallace. The mini mental state examination (mmse).https://www.mountsinai.on.ca/care/psych/on-call-resources/on-call-resources/mmse.pdf, 1999. Accessed: 2017-11-10. → pages 9[46] M. Laine, M. Laakso, E. Vuorinen, and J. Rinne. Coherence andinformativeness of discourse in two dementia types. Journal ofNeurolinguistics, 11(1):79–87, 1998. → pages 28[47] K. M. Langa and D. A. Levine. The diagnosis and management of mildcognitive impairment: a clinical review. Jama, 312(23):2551–2561, 2014.→ pages 10, 11[48] X. Le, I. Lancashire, G. Hirst, and R. Jokel. Longitudinal detection ofdementia through lexical and syntactic changes in writing: a case study ofthree british novelists. Literary and Linguistic Computing, 26(4):435–461,2011. → pages 13[49] J. S. Lin, E. O’Connor, R. C. Rossom, L. A. Perdue, B. U. Burda,M. Thompson, and E. Eckstrom. Screening for cognitive impairment inolder adults: an evidence update for the us preventive services task force.2013. → pages 11[50] V. Masrani, G. Murray, T. Field, and G. Carenini. Detecting dementiathrough retrospective analysis of routine blog posts by bloggers withdementia. BioNLP, pages 232–237, 2017. → pages 4, 65[51] V. Masrani, G. Murray, T. S. Field, and G. Carenini. Domain adaptation fordetecting mild cognitive impairment. In Canadian Conference on ArtificialIntelligence, pages 248–259. Springer, 2017. → pages iv, 4[52] M. F. Mendez, M. M. Cherrier, and J. S. Cymerman. Hemispatial neglect onvisual search tasks in alzheimer’s disease. Cognitive and BehavioralNeurology, 10(3):203–208, 1997. → pages 25, 3671[53] A. Milner, M. Harvey, R. Roberts, and S. Forster. Line bisection errors invisual neglect: misguided action or size distortion? Neuropsychologia, 31(1):39–49, 1993. → pages 25, 36[54] S. O. Orimaye, J. S.-M. Wong, and K. J. Golden. Learning predictivelinguistic features for alzheimer’s disease and related dementias using verbalutterances. In Proc. 1st Workshop. Computational Linguistics and ClinicalPsychology (CLPsych), 2014. → pages 12[55] A. Parton, P. Malhotra, and M. Husain. Hemispatial neglect. Journal ofNeurology, Neurosurgery & Psychiatry, 75(1):13–21, 2004. → pages 25[56] E. R. Peskind, S. G. Potkin, N. Pomara, B. R. Ott, S. M. Graham, J. T. Olin,S. McDonald, M. M.-M.-. S. Group, et al. Memantine treatment in mild tomoderate alzheimer disease: a 24-week randomized, controlled trial. TheAmerican journal of geriatric psychiatry, 14(8):704–715, 2006. → pages 9[57] R. C. Petersen. Mild cognitive impairment as a diagnostic entity. Journal ofinternal medicine, 256(3):183–194, 2004. → pages 10[58] R. C. Petersen. Mild cognitive impairment. CONTINUUM: LifelongLearning in Neurology, 22(2, Dementia):404–418, 2016. → pages 11[59] M. J. Prince. World Alzheimer Report 2015: the global impact of dementia:an analysis of prevalence, incidence, cost and trends. London, 2015. →pages 1[60] T. S. Ramachandran, S. Zachariah, and V. Agrawal. Alzheimer diseaseimaging. E-medicine. medscape/article, 336281, 2012. → pages 9[61] V. Rentoumi, L. Raoufian, S. Ahmed, C. A. de Jager, and P. Garrard.Features and machine learning classification of connected speech samplesfrom patients with autopsy proven alzheimer’s disease with and withoutadditional vascular pathology. Journal of Alzheimer’s Disease, 42(s3), 2014.→ pages 2, 12[62] K. P. Riley, D. A. Snowdon, M. F. Desrosiers, and W. R. Markesbery. Earlylife linguistic ability, late life cognitive function, and neuropathology:findings from the nun study. Neurobiology of aging, 26(3):341–347, 2005.→ pages 13[63] B. Roark, M. Mitchell, J.-P. Hosom, K. Hollingshead, and J. Kaye. Spokenlanguage derived measures for detecting mild cognitive impairment. IEEE72Transactions on Audio, Speech, and Language Processing, 19(7):2081–2090, 2011. → pages 12, 44[64] T. Salsbury, S. A. Crossley, and D. S. McNamara. Psycholinguistic wordinformation in second language oral discourse. Second Language Research,27(3):343–360, 2011. → pages 19[65] A. Satt, A. Sorin, O. Toledo-Ronen, O. Barkan, I. Kompatsiaris,A. Kokonozi, and M. Tsolaki. Evaluation of speech-based protocol fordetection of early-stage dementia. In INTERSPEECH, pages 1692–1696,2013. → pages 13, 44[66] E. Schwam and Y. Xu. Cognition and function in alzheimer’s disease:identifying the transitions from moderate to severe disease. Dementia andgeriatric cognitive disorders, 29(4):309, 2010. → pages 2[67] A. Shahid, K. Wilkinson, S. Marcu, and C. M. Shapiro. Mini-mental stateexamination (mmse). In STOP, THAT and One Hundred Other Sleep Scales,pages 223–224. Springer, 2011. → pages 9[68] D. A. Snowdon. Aging and alzheimer’s disease: lessons from the nun study.The Gerontologist, 37(2):150–156, 1997. → pages 13[69] A. Society. Ebixa (also known as memantine hydrochloride). http://www.alzheimer.ca/sites/default/files/files/national/drugs/drug ebixa 2008 e.pdf,2008. Accessed: 2017-11-13. → pages 9[70] Statistica. Number of blogs worldwide from 2006 to 2011 (in millions).https://www.statista.com/statistics/278527/number-of-blogs-worldwide/,2012. Accessed: 2017-11-23. → pages 52[71] G. B. Stokin, J. Krell-Roesch, R. C. Petersen, and Y. E. Geda. Mildneurocognitive disorder: an old wine in a new bottle. Harvard review ofpsychiatry, 23(5):368, 2015. → pages 9, 10[72] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domainadaptation. In AAAI, 2016. → pages xii, 45, 46, 48, 51, 63[73] G. Szatloczki, I. Hoffmann, V. Vincze, J. Kalman, and M. Pakaski. Speakingin alzheimers disease, is that an early sign? importance of changes inlanguage abilities in alzheimers disease. Frontiers in aging neuroscience, 7,2015. → pages 973[74] D. F. Tang-Wai and N. L. Graham. Assessment of language function indementia. Geriatrics, 11(2):103–110, 2008. → pages 6[75] P. N. Tariot, M. R. Farlow, G. T. Grossberg, S. M. Graham, S. McDonald,I. Gergel, M. S. Group, et al. Memantine treatment in patients with moderateto severe alzheimer disease already receiving donepezil: a randomizedcontrolled trial. Jama, 291(3):317–324, 2004. → pages 9[76] L. To´th, G. Gosztolya, V. Vincze, I. Hoffmann, G. Szatlo´czki, E. Biro,F. Zsura, M. Pa´ka´ski, and J. Ka´lma´n. Automatic detection of mild cognitiveimpairment from spontaneous speech using asr. In Sixteenth AnnualConference of the International Speech Communication Association, 2015.→ pages 13, 44[77] A. Venneri, R. Pentore, B. Cotticelli, and S. Della Sala. Unilateral spatialneglect in the late stage of alzheimer’s disease. Cortex, 34(5):743–752,1998. → pages 25, 3674Appendix ASupporting MaterialsParts-of-speech(16)Number of Nouns, Number of Verbs, Number Of Not-in-Dictionary, Mean Word Length, Number of Adverbs, Numberof Adjectives, Number of Determiners, Number of Interjec-tions, Number of Inflected Verbs, Number of Coordinate Con-junctions, Number of Subordinate Conjunctions, Ratio Noun-to-verb, Ratio Noun-to-Noun+Verb, Ratio Pronoun-to-noun,Ratio Coordinate to subordinate conjunctions, LightVerbCountContext-free-grammar rules(45) (Using PennTreebank POS Tags)ADVP to RB, INTJ to UH, NP to DT NN, NP to PRP,ROOT to FRAG, VP to AUX, VP to AUX ADJP,VP to AUX VP, VP to VBD NP, VP to VBG,VP to VBG PP, CONJP, TTR, UCP, VP, Avg NP TypeLength Embedded, Avg NP Type Length Non Embedded,Avg PP Type Length Embedded, Avg PP Type Length NonEmbedded, Avg VP Type Length Embedded, Avg VP TypeLength Non Embedded, WHADJP, WHAVP, WHNP, WHPP,X, FRAG, INTJ, LST, NP Type Rate, VP Type Rate, PP TypeRate, P Proportion, NP Proportion, VP Proportion, NAC, NP,NX, PP, PRN, PRT, QP, RRC, ADJP, ADVP75Syntactic Com-plexity (27)Mean Word Length, Mean words per utterance , Mean lengthof sentence, Mean length of T unit, Mean length of Clauses,Disfluency Frequency, Total Number Of Words, Number ofUtterances, Tree height, Complex Nominal per T unit, Com-plex nominal per Clause, Coordinate Phrase per clause, Co-ordinate Phrase per T unit, Complex T unit ratio, Clauseper sentences, Clause per T unit, Dependent Clause per sen-tences, Clause per T unit, T unit per sentence, Verb Phraseper T unit, Number of Complex Nominals, Number of Co-ordinate Phrases, Number of Dependent Clauses, Number ofSentences, Number of T Units, Number of Words, Number ofClausesVocabulary Rich-ness (3)MATTR, Brunet Index, Honore Statistic, Type-to-token ratioPsycholinguistic(5)Aoa Score, Concreteness Score, Familiarity Score, Imagabil-ity Score, SUBTLWord ScoreRepetitiveness (5) Min Cos Dist, Proportion Below Threshold 0, Proportion Be-low Threshold 0.3, Proportion Below Threshold 0.5, Avg CosDistAcoustic (172) mfcc n kurtosis (for 1≤n≤13), mfcc n mean (for 1≤n≤13),mfcc n skewness (for 1≤n≤13), mfcc n var (for 1≤n≤13),mfcc n vel kurtosis (for 1≤n≤13), mfcc n vel mean(for 1≤n≤13), mfcc n vel skewness (for 1≤n≤13),mfcc n vel var (for 1≤n≤13), mfcc n accel kurtosis(for 1≤n≤13), mfcc n accel mean (for 1≤n≤13),mfcc n accel skewness (for 1≤n≤13), mfcc n accel var(for 1≤n≤13), energy accel kurtosis, energy accel mean, en-ergy accel skewness, energy accel var, energy kurtosis,energy mean, energy skewness, energy var, en-ergy vel kurtosis, energy vel mean, energy vel skewness,energy vel var, fundemental frequency mean, fundemen-tal frequency var76Information units(info-units) (40)ObjectCookie (keyword), ObjectCupboard (keyword), Ob-jectCurtains (keyword), ObjectDishcloth (keyword), Object-Dishes (keyword), ObjectJar (keyword), ObjectPlate (key-word), ObjectSink (keyword), ObjectStool (keyword), Ob-jectWater (keyword), ObjectWindow (keyword), PlaceExte-rior (keyword), PlaceKitchen (keyword), SubjectBoy (key-word), SubjectGirl (keyword), SubjectWoman (keyword),ActionBoyTaking (keyword), ActionStoolFalling (keyword),ActionWaterOverflowing (keyword), ActionWomanDrying-Washing (keyword), ActionBoyTaking (binary), ActionStool-Falling (binary), ActionWaterOverflowing (binary), Action-WomanDryingWashing (binary), ObjectCookie (binary), Ob-jectCupboard (binary), ObjectCurtains (binary), ObjectDish-cloth (binary), ObjectDishes (binary), ObjectJar (binary), Ob-jectPlate (binary), ObjectSink (binary), ObjectStool (binary),ObjectWater (binary), ObjectWindow (binary), PlaceExterior(binary), PlaceKitchen (binary), SubjectBoy (binary), Sub-jectGirl (binary), SubjectWoman (binary)Demographic (1) AgeDiscourse Features(39)Comparison, Edu rate, Topic-Change, Summary, Topic-Comment, Same-Unit, Evaluation, Contrast, Elaboration,Attribution, TextualOrganization, Cause, Explanation, En-ablement, Joint, Depth, Background, Temporal, Condition,Manner-Means, Comparison ratio, Topic-Change ratio,Summary ratio, Topic-Comment ratio, Same-Unit ratio,Evaluation ratio, Contrast ratio, Elaboration ratio, Attribu-tion ratio, TextualOrganization ratio, Cause ratio, Explana-tion ratio, Enablement ratio, Joint ratio, Background ratio,Temporal ratio, Condition ratio, Manner-Means ratio,Discourse type token ratio77Halves Features (9) Attention: Leftside, Concentration: Leftside, Repetition:Leftside, Perception: Leftside, Attention: Rightside, Concen-tration: Rightside, Repetition: Rightside, Perception: Right-side, Number of switches from LS to RSTable A.1: List of all features.strips halves quadrant discourseFeature Set0.1000.0750.0500.0250.0000.0250.0500.0750.100Change in F-MeasureLogRegSVMKNNRandomForestGausNaiveBayesChange in Performance w/ New Feature SetsFigure A.1: Plot showing the performance of the halves feature set withoutquadratic terms. The performance of Random Forest and GaussianNaive Bayes is not hurt in this case as it is in figure 4.9. The perfor-mance of logistic regression also decreases without the quadratic terms.78Baseline strips halves+quadratic quadrant discourseFeature Set0.7000.7250.7500.7750.8000.8250.8500.8750.900AccuracyLogRegSVMKNNRandomForestGausNaiveBayesPerformance w/ New Feature SetsFigure A.2: Accuracy of models with new feature sets.79strips halves+quadratic quadrant discourseFeature Set0.1000.0750.0500.0250.0000.0250.0500.0750.100Change in AccuracyLogRegSVMKNNRandomForestGausNaiveBayesChange in Performance w/ New Feature SetsFigure A.3: Change in accuracy of models with new feature sets.80baseline strips halves+quadratic quadrant discourseFeature Set0.700.750.800.850.900.95AUCLogRegSVMKNNRandomForestGausNaiveBayesPerformance w/ New Feature SetsFigure A.4: AUC of models with new feature sets.81strips halves+quadratic quadrant discourseFeature Set0.1000.0750.0500.0250.0000.0250.0500.0750.100Change in AUCLogRegSVMKNNRandomForestGausNaiveBayesChange in Performance w/ New Feature SetsFigure A.5: Change in AUC of models with new feature sets.82


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items