Exploring Neural Models for Predicting Dementia fromLanguagebyWeirui KongB.Eng., Zhejiang University, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)August 2019c©Weirui Kong, 2019The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Exploring Neural Models for Predicting Dementia from Languagesubmitted by Weirui Kong in partial fulfillment of the requirements for the degreeof Master of Science in Computer Science.Examining Committee:Giuseppe Carenini, Computer ScienceSupervisorThalia Field, Faculty of MedicineSupervisorRichard Lester, Faculty of MedicineAdditional ExamineriiAbstractIn this thesis we explore the effectiveness of neural models that require no task-specific feature for automatic dementia prediction. The problem is about classify-ing Alzheimer’s disease (AD) from recordings of patients undergoing the BostonDiagnostic Aphasia Examination (BDAE). First we use a multimodal neural modelto fuse linguistic features and acoustic features, and investigate the performancechange compared to simply concatenating these features. Then we propose a novelcoherence feature generated by a neural coherence model, and evaluate the predic-tiveness of this new feature for dementia prediction. Finally we apply an end-to-end neural method which is free from feature engineering and achieves state-of-the-art classification result on a widely used dementia dataset. We further interpretthe predictions made by this neural model from different angles, including modelvisualization and statistical tests.iiiLay SummaryEarly prediction of neurodegenerative disorders such as Alzheimer’s disease (AD)and related dementias is important in developing early medical supports and so-cial supports, and may identify ideal stages for testing novel therapeutics aimedat preventing disease progression. Changes in speech and language patterns canoccur in dementia in its earliest stages and may worsen as the disease progresses.This has led to recent attempts to create automatic methods that predict dementiathrough language analysis. In addition to features extracted from language sam-ples, previous works have improved the prediction accuracy by introducing sometask-specific features. But task-specific features prevent the model from generaliz-ing to other tests. Our work focuses on building classification models without anytask-specific features. We explore three approaches and find one such model whichachieves state-of-the-art performance. We also perform detail analyses to interprethow the best performer makes a prediction.ivPrefaceAll of the work presented henceforth was conducted in the Laboratory for Com-putational Intelligence in the Department of Computer Science at the Universityof British Columbia, in collaboration with Dr. Thalia Field at the UBC Faculty ofMedicine. I was the lead researcher, responsible for coding, data preprocessing,result analysis, plots, concept formation and first drafts of the manuscripts. Dr.Giuseppe Carenini and Dr. Hyeju Jang were responsible for concept formation,draft edits, interpreting the results and suggestions for improvement. Dr. ThaliaField was responsible for editing medical related material. The baseline modelwas implemented by Vaden Masrani, a PhD student at UBC.A version of Chapter 6 has been accepted as proceedings of the Machine Learn-ing for Healthcare Conference 2019 [Weirui Kong, Hyeju Jang, Giuseppe Carenini,Thalia Field. A Neural Model for Predicting Dementia from Language.]. I was thefirst author, responsible for all major areas of concept formation, experiment designand analysis, as well as the majority of manuscript composition. Hyeju Jang andThalia Field contributed to manuscript edits, refining the paper to a large extent.Giuseppe Carenini was the supervisory author on this project and was involvedthroughout the project in concept formation and manuscript edits.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Fusing Features from Different Modalities . . . . . . . . . 31.1.2 A Novel Feature: Coherence Score . . . . . . . . . . . . 31.1.3 An End-to-end Neural Model: Hierarchical Attention Net-works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Reproducibility . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4vi2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Computational Approaches to Dementia Prediction . . . . . . . . 52.2 Multimodal Learning . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Coherence Models . . . . . . . . . . . . . . . . . . . . . . . . . 73 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 DementiaBank . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Multimodal Embedding for Feature Fusion . . . . . . . . . . . . . . 114.1 Our Joint Embedding Method . . . . . . . . . . . . . . . . . . . 114.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . 134.2.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . 144.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 A Novel Feature: Coherence Score . . . . . . . . . . . . . . . . . . . 175.1 A Neural Coherence Model . . . . . . . . . . . . . . . . . . . . . 185.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . 195.2.2 Experiment Results . . . . . . . . . . . . . . . . . . . . . 215.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 An End-to-end Neural Model: Hierarchical Attention Networks . . 246.1 Hierarchical Attention Networks . . . . . . . . . . . . . . . . . . 246.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . 256.2.2 Experiment Results on DementiaBank . . . . . . . . . . . 276.2.3 Analysis of Effects of Dataset Size . . . . . . . . . . . . . 276.2.4 Analysis of Attention . . . . . . . . . . . . . . . . . . . . 286.2.5 Evaluation on the Blog Corpus . . . . . . . . . . . . . . . 326.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 35Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38viiList of TablesTable 3.1 Demographics of the DementiaBank dataset. . . . . . . . . . . 10Table 4.1 Results of the multimodal feature embedding evaluation. Num-bers in the model name indicate how many features are used af-ter feature selection. Numbers in parenthesis show the changein performance compared to the corresponding baseline. . . . . 15Table 4.2 Results of the shared representation evaluation. . . . . . . . . . 16Table 5.1 Statistics on the DementiaBank, WSJ and VIST datasets. Wecompute the number of samples as # Doc., and the average num-ber of sentences per document as Avg. # Sen. . . . . . . . . . 20Table 5.2 Results on the effectiveness of the coherence feature. The per-formance metric is accuracy. Numbers in parenthesis show thechange in performance. L&A features denote linguistic andacoustic features. . . . . . . . . . . . . . . . . . . . . . . . . . 22Table 6.1 Features used by traditional methods. Info: information unitfeatures. Spatial: spatial neglect features. . . . . . . . . . . . . 26Table 6.2 Binary classification with 10-fold cross-validation. Note thatresults of Fraser’s model and Masrani’s model are from the orig-inal papers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Table 6.3 Contingency table (numbers in parenthesis are expectation val-ues). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Table 6.4 Blog information as of April 4th, 2017. . . . . . . . . . . . . . 33Table 6.5 Binary classification with 9-fold cross-validation on blog corpus. 33viiiList of FiguresFigure 3.1 The Cookie Theft picture. . . . . . . . . . . . . . . . . . . . 10Figure 4.1 Neural feature fusion frameworks. . . . . . . . . . . . . . . . 12Figure 5.1 A document sample and the corresponding entity grid table.The figure is taken from the original paper [32]. S denotessubject, O denotes object, X denotes other and - means theword is absent from the sentence. . . . . . . . . . . . . . . . 19Figure 5.2 Neural coherence model. The figure is taken from the originalpaper [32]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 5.3 Distributions of coherence scores for AD patients and healthycontrols. x-axis denotes coherence score and y-axis denotesprobability density. . . . . . . . . . . . . . . . . . . . . . . . 22Figure 6.1 Hierarchical attention network for dementia prediction. . . . . 26Figure 6.2 Test accuracy by varying training data proportions. . . . . . . 29Figure 6.3 Visualization of attention. . . . . . . . . . . . . . . . . . . . 30Figure 6.4 Attention frequency vs. random frequency. . . . . . . . . . . 32ixGlossaryAD Alzheimer’s diseaseBDAE Boston Diagnostic Aphasia ExaminationCCA Canonical Correlation AnalysisCNN convolutional neural networksHAN Hierarchical Attention NetworksMMSE Mini Mental State ExaminationML machine learningMFCC Mel-frequency Cepstral CoefficientNLP natural language processingOPTIMA Oxford Project to Investigate Memory and AgingRNN recurrent neural networkxAcknowledgmentsFirst of all, I would like to thank my supervisors Dr. Giuseppe Carenini and Dr.Thalia Field, who supported me all the way. You are not only first-class scientists,but also kind friends to me. Thank you for your advice and encouragement. I amgrateful for working with you, and for the opportunity to learn from you.I must also express my gratitude to Dr. Hyeju Jang. You taught me a lot aboutscientific writing and provided tons of detailed improvements for our work. Yourenthusiasm for research motivated me.I want to thank my dear friends, who stood by my side in good and hard times.To Xiaoxuan Lou, Bowei E and Tonghe Wang, you are the best roommatesever. We lived and studied together, for me I cherish each and every moment.Although we are now in different countries, all the things we have been throughbuild up our brotherhood and nothing can break the bond between us.To my badminton teammates Mengqi Li, Haocong Shi, Angli Xue, WeigengYang, Hang Zhou and Ye Fan, it’s a pleasure to fight for trophies with you. I’llalways remember how you comforted me when my mistake led to losing the game-deciding point, and the tremendous cheer from you for my nice smashes.To my childhood buddies Xingyan Chen, Bo Hu, Ziyi Liu, Wangyang Dai,Zhengri Xiong, Hongrui Tu and Yanxin Zhu, you are all special to me. We grew uptogether in a small town and shared so many pleasant memories. I really appreciatehaving you such treasures.Finally, Rong Hu and Qing Kong, my parents who have been trying to givetheir everything to me. They never went to university, but have the power to createme a world full of love. I am very fortunate and proud to be your son.xiDedicationTo my grandma, who means more to me than anything else. Wishing you goodhealth and happiness. Every milestone ahead in my life, I want to celebrate withyou.xiiChapter 1IntroductionDementia is a progressive cognitive impairment caused by neurodegenerative dis-ease, which affects more than 46.8 million people around the world [1]. Amongdiverse types of dementia, Alzheimer’s disease (AD), which accounts for 60% -80% of all dementia diagnosis, is among the most financially costly diseases indeveloped countries [8]. Although there is not yet a cure for AD, research suggeststhat novel therapeutics will be most effective if given early in the disease course[36].However, predicting AD especially in its early stages is difficult. A diagnosisof dementia involves clinical opinion based on functional status, cognitive per-formance on standardized tests and resource-intensive specialized tests, such aslumbar puncture or advanced neuroimaging [30]. In developing countries, accessto some or all of these resources may not be available, and this is reflected in thehigher than average rates of undiagnosed dementia in those regions. All around theworld, only approximately 25% of the 46.8 million dementia population receive aformal diagnosis [1]. Therefore, a non-invasive diagnostic tool that is inexpensiveand easy to administer is of great importance to dementia patients, especially thosein developing countries.One promising direction is to design a tool that can assist in prediction of pre-clinical disease by using automated analysis of language. Language is one of thefirst facilities afflicted by the disease and subtle changes in language are observeda year or more before dementia is diagnosed, according to longitudinal studies on1people with AD [2]. These changes include, for example, low grammatical com-plexity, limited vocabulary and frequent word finding problems [21].Given that linguistic deficits are early signs of dementia, researchers havedeveloped dementia prediction systems based on language by applying machinelearning (ML) and natural language processing (NLP). Most prior work built com-putational models on the dataset DementiaBank [7], a publicly available datasetthat contains audio recordings and transcripts of participants (people with dementiaand healthy controls) describing the Cookie Theft picture (Figure 3.1). Prior workused not only acoustic features and various linguistic features but also task-specificfeatures such as information units. Information units [9] are objects and actionsappearing in the picture (e.g., mother, stool, overflowing, and drying), which areusually pre-defined by human experts. Information unit features measure how wella participant captures key concepts in the picture. Based on these task-specificfeatures as well as linguistic and acoustic features, prior models using traditionalclassification methods such as logistic regression have been shown to give reliableAD prediction [12, 28]. Although the task-specific features are effective for de-mentia prediction, one major disadvantage is that they are specific to a particularpicture. If participants are asked to describe a different picture, information unitsin the picture need to be re-defined.The advances in neural networks, especially the recent deep neural models,undermine the necessity of feature engineering. The interactions between neurons,the hierarchical network structure, and an appropriate loss function make the deepmodels capable of tackling complex tasks even with raw data as input. The powerof neural models is likely to make up the absence of some well-designed features.In this thesis, we explore neural models for dementia prediction without using task-specific features. We delve into three different directions and our contributions canbe summarized as below.21.1 Contributions1.1.1 Fusing Features from Different ModalitiesWe propose to use a neural network model for combining task-agnostic multimodalfeatures from prior work. Previous work [12, 28] extracted various linguistic andacoustic features for dementia prediction. Then, they combined all these featuresby simple concatenation. In this thesis, we combine the two groups of featuresby using the neural multimodal embedding framework [22]. We demonstrate thatcombining multimodal features in this way allows obtaining performance compa-rable to prior work using feature selection.1.1.2 A Novel Feature: Coherence ScoreWe extract a new type of task-agnostic features by a neural network model. Pre-vious literature [10, 11, 23] has shown that people with dementia tend to haveproblems with discourse coherence including impairment in global coherence, dis-ruptive topic shift, frequent use of filler phrases, and less use of connective words.Linguistic features in previous studies capture language deficits with respect tocertain aspects of discourse coherence at a lexical and syntactic level (e.g., wordrepetitiveness, syntactic complexity, and vocabulary richness). However, no priorwork has attempted to investigate semantic level of discourse coherence for pre-dicting dementia – specifically, the patterns of how people repeat entities to makea coherent speech. In this thesis, we compute discourse coherence scores based onentity transition patterns by using the neural coherence model [32], and apply themfor dementia prediction.1.1.3 An End-to-end Neural Model: Hierarchical AttentionNetworksWe propose to use a neural network model in an end-to-end manner to avoid anytask-specific features and alleviate the problem of manual feature engineering. Weapply a neural framework, called Hierarchical Attention Networks (HAN) [38] forthe task, and obtain results comparable to traditional models that use task-specificfeatures. By including a demographic feature (age), our model achieves state-of-3the-art performance, improving the classification accuracy of the top-performertraditional method which also uses age, from 84.4% to 86.9%. With the attentionmechanism in HAN, we analyze the model predictions, and provide some insightson their interpretation.We also apply HAN to a dementia blog corpus, and discuss the results in com-parison to prior work. In essence, on this corpus of written text, the neural methodis not a competitive solution.1.2 ReproducibilityThe code to reproduce experiments in Chapter 4 and Chapter 5 is available at https://github.com/arankong/dementia classifier and the code to reproduce all resultsand the corresponding plots in Chapter 6 is at https://github.com/arankong/han.1.3 Thesis OverviewThe rest of the thesis is organized as follows: in Chapter 2 we review prior studiesthat focus on traditional ML and NLP approaches for dementia prediction. We alsoprovide backgrounds on multimodal learning and discourse coherence models. Af-ter that, in Chapter 3 we provide an overview of datasets used for our experiments.Then, the three contributions are each described in their own chapters (4, 5, 6, re-spectively). Lastly, in Chapter 7 we conclude and suggest some future directions.4Chapter 2Related WorkComputational approaches for automatic dementia prediction have received grow-ing attentions in recent years. In this chapter, we first discuss previous computa-tional works for dementia prediction (Section 2.1). Then, to provide some back-ground to the neural models used for our studies, we review multimodal learningmethods (Section 2.2) and discourse coherence models (Section 2.3).2.1 Computational Approaches to Dementia PredictionPrior research has shown that NLP and ML techniques that exploit various featurescan predict dementia by classifying dementia patients from healthy controls.Ahmed et al. [3] proposed features that were helpful for identifying dementiafrom speech, using data collected in the Oxford Project to Investigate Memory andAging (OPTIMA) study. They found that language was progressively impaired asthe disease progressed and suggested using semantic, lexical content and syntacticcomplexity features for classification.Orimaye et al. [33] used diverse machine learning methods with lexical andsyntactic features to distinguish between dementia patients and healthy adults onthe DementiaBank dataset [7]. They compared five different classifiers includingsupport vector machines (SVMs), naive Bayes, decision trees, neural networks andBayesian networks, and reported that SVMs showed the best performance with aF-score of 74%.5In another study, Al-Hameed et al. [4] extracted acoustic features from the au-dio files of the DementiaBank dataset, building a regression model to predict MiniMental State Examination (MMSE) scores used for dementia prediction (rangingfrom 0 to 30). This work used only acoustic features, and their regression modelpredicted MMSE scores with a mean absolute error less than 4.Fraser et al. [12] explored a broad spectrum of both linguistic and acousticfeatures, demonstrating the necessity of feature selection. They found that optimalperformance was obtained when 35-50 features were used, and the performancedropped off dramatically with a feature set size larger than 50. They achieved anaccuracy of 81.96% in distinguishing individuals with AD from those without.As briefly mentioned in the introduction, the DementiaBank is associated witha set of human-defined information units representing key components of the CookieTheft picture, such as subjects, objects, locations and actions [9]. Upon the infor-mation units, Masrani [28] proposed a novel feature group called spatial neglectfeatures. They vertically split the picture into two halves and computed featuresthat measure spatial neglect, e.g., count of mentions of any information unit foreach region. Combing their new feature group with linguistic, acoustic, informa-tion unit features and the demographic feature (age), followed by a feature selectionstep, they achieved the accuracy of 84.4%.Our study differs from previous approaches in that, we aim to build modelswithout any task-specific features, while achieving comparable or even better per-formance.2.2 Multimodal LearningOur attempt to fuse features from different modalities is inspired by the researchfield of multimodal learning. The multimodal representation can be divided intotwo types, i.e., joint representation and coordinated representation. The idea ofjoint representation is to build one common representation for different modalities.The simplest way is to concatenate features from different modalities, which isalso known as early fusion. But this naive concatenation does not help us gain anyinsights into the data. More advanced method is to train a multimodal autoencoder[31], where they adopted an unsupervised training scheme to learn a shared repre-6sentation from different modalities. Besides, Zadeh et al. [39] made a tensor out ofthe features from 3 modalities, and used the 3D tensor as the joint representation.The other type of multimodal representation, namely coordinated represen-tation, aims at building different representations for each modality while puttingcertain constraints on these representations. The constraints include similarity-based methods (e.g., cosine distance), structure constraints (e.g., orthogonality)and correlation maximization like Canonical Correlation Analysis (CCA) [17]. Inparticular, a learning approach, joint embeddings, is very successful in buildingcoordinated representations of two modalities [22].An ablation study on the DementiaBank dataset shows that the classificationaccuracy of logistic regression with linguistic features is 0.740, whereas the accu-racy drops to 0.713 when both linguistic features and acoustic features are used,indicating that there are more irrelevant features in the acoustic feature group.Therefore, we prefer coordinated multimodal representations to one shared rep-resentation for the linguistic modality and acoustic modality. Specifically, we usea similar training process as Kiros et al. [22]’s.2.3 Coherence ModelsModelling document coherence is an active area of NLP. There are various coher-ence modelling methods: entity-based models, graph-based models [16], modelsbased on discourse relations [26], models based on distributed sentence represen-tations [25], etc. We use a neural entity-based coherence model [32] for the coher-ence feature extraction because it is conceptually easy and has obtained high per-formance in many coherence evaluation tasks. In entity-based coherence models,document is about entities (nouns could serve as entity candidates) and coherenceis created by repeated entity mentions [15]. Barzilay and Lapata [6] proposed anentity grid model that computes coherence score based on entity transitions (thegrammatical role switches across sentences). Nguyen and Joty [32] made a neu-ral version of the entity grid model. They transformed the grammatical role ofeach entity in grid into distributed representation, and used convolutional neuralnetworks (CNN) [24] to capture entity transition patterns. Their model achievedhigh performance in several tasks like sentence ordering and summary coherence7rating. We use this neural coherence model to generate coherence scores for theDementiaBank samples.8Chapter 3DatasetsWe use two dementia datasets for this work: one consists of samples of spokenlanguage and the other consists of samples of written language. We detail thespoken one, the DementiaBank dataset below, which is used throughout this thesis(in Chapter 4, 5 and 6). The dataset of written samples is used only in Chapter 6and will be introduced in Section 6.2.5. Besides, we also use two large corporafor training the neural coherence model and we introduce them along with thecoherence model in Section 5.2.1.3.1 DementiaBankThe DementiaBank corpus was collected for the study of communication in demen-tia, between 1983 and 1988 at the University of Pittsburgh [7]. It contains interviewrecordings and manually-transcribed transcripts of English-speaking participantsdescribing the Cookie Theft picture (Figure 3.1). The participants are categorizedinto dementia patient and healthy control groups. Of the 309 dementia samples,257 samples are classified as possible/probable AD, and the remaining samples asother types of dementia. Our study uses only the 257 AD samples and 242 healthyelderly control samples. Statistics about the DementiaBank samples used in thisstudy are listed in Table 3.1.9Figure 3.1: The Cookie Theft picture.Table 3.1: Demographics of the DementiaBank dataset.Diagnosis Samples Mean Words Mean AgeAD 257 104.98 (s=59.8) 71.72 (s=8.47)Control 242 113.56 (s=58.5) 63.95 (s=9.16)10Chapter 4Multimodal Embedding forFeature FusionIn this chapter, we describe our experiment on using a neural network model forcombining existing multimodal features. Prior work [28] showed effectiveness ofa variety of linguistic and acoustic features for dementia prediction. To obtain acombined multimodal feature representation, they used a simple concatenation asa fusion mechanism, which is easy to implement. However, this mechanism canwind up being very high dimensional, and could be less effective when featureshave different frame rates [27]. Here, we propose to use a joint embedding methodbased on pairwise ranking.In Section 4.1 we explain our joint embedding method using pairwise rankingin detail. After that, in Section 4.2 we compare the performance of this featurefusion scheme for dementia prediction against the simple concatenation method.We discuss the results in Section 4.3.4.1 Our Joint Embedding MethodTo combine features in different modalities, we use a joint embedding methodadapted from [22]. The main idea in this method is pairwise ranking – a matchedpair of linguistic and acoustic embeddings, or the original pair of linguistic and11acoustic embeddings from the same data sample1, should have a shorter distancethan random pairs. To implement this idea, our model is composed of three parts:building linguistic representations (embeddings), acoustic representations, and co-ordinating the two representations. Figure 4.1 shows the model architecture.(a) Proposed joint embedding method (b) Pairwise ranking loss to minimize dis-tance between matched pairs of samplesFigure 4.1: Neural feature fusion frameworks.First, to build linguistic embeddings, we begin with extracting linguistic fea-tures from texts as in [28]. These features (N = 99 in total) include parts-of-speech (N = 15), context-free-grammar rules (N = 43), syntactic complexity(N = 27), vocabulary richness (N = 4), psycholinguistic (N = 5), and repet-itiveness (N = 5). Then, we project these features into the embedding space ofdimension m by using an encoder. This encoder is used for linear transformation(R99 → Rm), and hence it consists of one linear layer without an activation func-tion. We choose 50 as the embedding size m for our experiment because in [28],the performance of dementia prediction drastically dropped when using more than50 features at the feature selection step.In the same fashion, we build acoustic embeddings. We extract acoustic fea-tures from audio recordings following [28]. The acoustic features (N = 172 intotal) were derived from speech samples by using the Mel-frequency Cepstral Co-efficient (MFCC) technique [20]. We use another encoder (R172 → Rd) for linearly1Note that every sample in the DementiaBank dataset provides one matched linguistic and acous-tic embedding pair.12transforming the acoustic features into acoustic feature embeddings. The embed-ding size d is also set to 50.After obtaining linguistic and acoustic embeddings, we coordinate these em-beddings by using a loss function based on pairwise ranking as in [22]. As shown inFigure 4.1b, given a matched pair of features (linguistici, acoustici), the distancein the embedding space between linguistici and acoustici should be smaller thanthe distance between linguistici and any other acoustic embeddings acousticj ,and the distance between acoustici and any other linguistic embeddings linguisticj .For each matched pair, we compute the loss against all other pairs. The loss func-tion for matched pairs (linguistici, acoustici) and (linguisticj , acousticj) is de-fined asL(θ) = max{0, α− d(linguistici, acoustici) + d(linguistici, acousticj)}+max{0, α− d(linguistici, acoustici) + d(linguisticj , acoustici)},where θ denotes learnable parameters of our encoders, α is an arbitrary positiveconstant, i 6= j, and d(x, x′) is a distance measure. We use cosine similarity as thedistance measure in our experiment.After training this neural model that consists of two encoders using the pairwiseranking loss, we obtain coordinated linguistic and acoustic embeddings from rawlinguistic and acoustic features. These embeddings are then used for dementiaprediction.4.2 Experiment4.2.1 Experiment SettingsThe proposed method results in two embeddings: linguistic and acoustic. To evalu-ate the proposed method, we use the resulting embeddings as features for dementiaprediction. Specifically, we build logistic regression classifiers using three differentfeature groups. The EMBEDDED-L model uses only the linguistic embeddings andthe EMBEDDED-A model uses the acoustic embeddings alone. The third model,13EMBEDDED-L&A, uses both linguistic and acoustic embeddings by concatenatingthem. Note that we do not conduct feature selection for all the models above sinceour features are already compacted.We compare these embedding models against three types of correspondingbaselines that use task-agnostic features (i.e., linguistic and acoustic features) fromprior work. These baselines include BASELINE-L, BASELINE-A, and BASELINE-L&A. BASELINE-L uses only the original linguistic features. BASELINE-A usesthe original acoustic features. BASELINE-L&A combines those two types of fea-tures by concatenating them. We perform feature selection for these baseline mod-els as in [28]. We use Pearson’s correlation coefficients to select top k features.In addition, we also compare our models to neural network based baselinesEMBEDDED-SHARED that build one shared representation for linguistic and acous-tic features. To do that, we first combine the two groups of features by concate-nating them, and then use an autoencoder to construct a shared embedding. Theencoder embeds the concatenated features (N = 271, total size of the two featuregroups) into a hidden space of dimension h. We use a linear layer with the ReLUactivation function for the encoder (R271 → Rh). Then the decoder (Rh → R271)reconstructs the combined features from its hidden representation. We compute theL2-norm as the reconstruction loss. The vector in the hidden space is regarded asone shared representation of the linguistic and acoustic features, and we use thisvector as features for dementia prediction. We set the dimension of hidden space to50 and 100, denoting by EMBEDDED-SHARED 50 and EMBEDDED-SHARED 100respectively.We perform 10-fold cross validation following the practice in prior work [12,28]. Because the weights of our encoders are randomly initialized, we report theaverage performance on ten different runs as the performance of our model.4.2.2 Experiment ResultsOur experiment results are listed in Table 4.1 and Table 4.2. We first comparemodels that use embeddings against models that use raw features. As shownin Table 4.1, models using embeddings from our joint method (EMBEDDED-L,EMBEDDED-A, and EMBEDDED-L&A) drastically outperforms models using their14corresponding raw features (BASELINE-L, BASELINE-A, and BASELINE-L&A).This suggests that embeddings from our method contain more predictive informa-tion than raw features.This pattern is also observed after performing feature selection on the baselinemodels using raw features. Our models EMBEDDED-L and EMBEDDED-A im-prove over BASELINE-L 50 and BASELINE-A 50 which select 50 important fea-tures by using Pearson’s correlation coefficient. EMBEDDED-L and EMBEDDED-A also outperform BASELINE-L BEST and BASELINE-A BEST which show bestperformances among all k values for feature selection. In addition, EMBEDDED-L&A shows improvement over baselines selecting 50 features (BASELINE-L&A50) and 100 features (BASELINE-L&A 100). EMBEDDED-L&A also shows per-formance comparable to EMBEDDED-L&A BEST, which selects 47 features. Theseresults indicate that our joint embedding method generates linguistic and acousticembeddings that perform in a similar degree to the effect of feature selection.Table 4.1: Results of the multimodal feature embedding evaluation. Num-bers in the model name indicate how many features are used after featureselection. Numbers in parenthesis show the change in performance com-pared to the corresponding baseline.Models Accuracy F-scoreBaseline-L no feature selection 0.728 0.738Baseline-L 50 0.723 0.732Baseline-L best (k = 60) 0.740 0.747Embedded-L 50 0.746 (+0.006) 0.749 (+0.002)Baseline-A no feature selection 0.567 0.578Baseline-A 50 0.499 0.522Baseline-A best (k = 152) 0.601 0.623Embedded-A 50 0.615 (+0.014) 0.625 (+0.002)Baseline-L&A no feature selection 0.665 0.671Baseline-L&A 50 0.699 0.702Baseline-L&A 100 0.635 0.653Baseline-L&A best (k = 47) 0.709 0.719Embedded-L&A 100 0.708 (-0.001) 0.708 (-0.011)Table 4.2 shows the comparison between EMBEDDED-L&A and baselines us-15ing both linguistic and acoustic features. BASELINE-L&A models use raw linguis-tic and acoustic features with or without using feature selection, and EMBEDDED-SHARED models use a simple autoencoder to transform concatenated features into ashared embedding. As seen from the results, our model EMBEDDED-L&A outper-forms all other baseline models using both groups of features. The neural baselines,EMBEDDED-SHARED 50 and EMBEDDED-SHARED 100 are not as competitive asour joint embedding method.Table 4.2: Results of the shared representation evaluation.Models Accuracy F-scoreBaseline-L&A no feature selection 0.665 0.671Baseline-L&A 50 0.699 0.702Baseline-L&A 100 0.635 0.653Embedded-shared 50 0.677 0.679Embedded-shared 100 0.666 0.669Embedded-L&A 100 0.708 0.7084.3 DiscussionOur experiment results show that linguistic and acoustic embeddings generatedby our joint embedding method are more informative than raw features or these se-lected important features. The results suggest that the pairwise ranking idea behindour method is capable of adding more predictive information when performing di-mension reduction in our neural architecture than using feature selection, especiallyfor the same feature dimension.However, the improvements over baselines using feature selection with the bestk are not huge especially for dementia prediction only using linguistic features.When using both linguistic and acoustic features, the baseline using best k featuresslightly outperforms our model. This could mean that using our joint embeddingmethod could be considered as an alternative to attempting to find the best k forfeature selection.16Chapter 5A Novel Feature: CoherenceScoreNeural network models can be used for devising a new feature type for dementiaprediction. In this chapter, we experiment on discourse coherence for dementiaprediction. We use an existing neural network based NLP approach for computingdiscourse coherence.People with dementia have been reported to show impairment in discourse co-herence such as disruptive topic shift, frequent use of filler phrases, and less useof connective words [10, 11, 23]. The linguistic features used in previous compu-tational studies for dementia prediction capture language deficits including somediscourse coherence at a lexical and syntactic level (e.g., word repeat, syntacticcomplexity, and vocabulary richness). However, no prior work has attempted touse the overall coherence level of a speech for dementia prediction. In this chap-ter, we compute discourse coherence scores using a state-of-the-art neural basedcoherence model, and apply them for dementia prediction.In Section 5.1, we briefly explain the coherence model we use for obtaining thediscourse coherence feature. In Section 5.2, we evaluate this new type of featurefor dementia prediction. In Section 5.3, we discuss the results.175.1 A Neural Coherence ModelWe compute discourse coherence scores using the neural network based coherencemodel proposed by Nguyen and Joty [32], which operationalizes local coherenceof a discourse segment based on the Centering theory [15]. The Centering theoryclaims that certain entities mentioned in an utterance are more central than othersand that this property imposes constraints on a speaker’s use of different types ofreferring expressions. They also argue that the compatibility between centeringproperties of an utterance and choice of referring expression affects the coherenceof discourse. Based on the theory, Nguyen and Joty [32]’s model calculates coher-ence scores that measure how sentences are bound together to deliver a meaningas a whole by capturing entity transition patterns. We use their model becauseit shows the state-of-the-art performance among the systems that implement dis-course coherence based on the Centering theory.To capture entity transition patterns, the model first requires an entity grid tableof input data. A transition of one entity is defined as the grammatical role switchesof the entity across sentences. For example, in a document that consists of threesentences, if an entity is mentioned as the subject of the first sentence, the object ofthe second sentence, and not mentioned in the third sentence, its transition can bedenoted as {S, O, -}. The transitions of all entities in a document are converted intoan entity grid table (see Figure 5.1), and used as the input to the neural networkcoherence model.The neural network coherence model is based on a convolutional neural net-work (CNN) [24] for computing coherence scores of a text in an end-to-end fashion(see Figure 5.2). The intuition of this neural coherence model is that each convolu-tional filter tries to detect a specific transition pattern (e.g., {S-S-O-X} for a coher-ent text) which is informative for determining the coherence level. The CNN layerperforms convolution operation on the transitions of each entity independently,followed by a max-pooling and a linear layer that generates a real-valued score.During training, a pair of documents, i.e., the original document (considered as co-herent) and its randomly permuted version (considered as incoherent) are fed to thecoherence model at the same time. The model outputs two scores, φ(original|θ)18Figure 5.1: A document sample and the corresponding entity grid table. Thefigure is taken from the original paper [32]. S denotes subject, O de-notes object, X denotes other and - means the word is absent from thesentence.and φ(permuted|θ). A pairwise ranking loss defined asL(θ) = max{0, 1− φ(original|θ) + φ(permuted|θ)}forces the model to produce a higher score for the original document.After training the model, we compute coherence scores for data samples in theDementiaBank dataset. Then, we use the scores for predicting dementia as features.5.2 Experiment5.2.1 Experiment SettingsTo train the coherence model, we use three different corpora: the DementiaBankdataset, the Wall Street Journal (WSJ) dataset [34], and the Visual Storytelling19Figure 5.2: Neural coherence model. The figure is taken from the originalpaper [32].(VIST) dataset [19]. First, we use the training data of DementiaBank for buildingthe coherence model. However, the DementiaBank dataset might be too small forlearning the deep neural model. Therefore, we also try larger datasets. We use theWSJ dataset for training as in [32]. Additionally, to experiment with a dataset thatis more similar to DementiaBank than WSJ in terms of language style, we use theVIST dataset in which participants were asked to write a story based on a sequenceof pictures. The statistics of the three datasets used for model training are shownin Table 5.1. For large datasets WSJ and VIST, we make a 50%: 10%: 40% splitfor training, validation and test according to [32]. For DementiaBank, we perform10-fold cross validation.Table 5.1: Statistics on the DementiaBank, WSJ and VIST datasets. We com-pute the number of samples as # Doc., and the average number of sen-tences per document as Avg. # Sen.Dataset # Doc. Avg. # Sen.DementiaBank 499 12.8WSJ 2431 21.8VIST 50197 520To follow the pairwise ranking training scheme, we generate 20 random per-mutations for each document. One permuted version consists of all the sentencesof an original document in a rearranged order. The original document is treated ascoherent, and its permutations are regarded as incoherent. Then, we use a pair ofan original document and its permuted version as input for training.We set model hyperparameters as suggested by the original paper of the model[32]. We set the number of filters to be 150, max-pooling size to be 6 and entityembedding size to be 100. The dropout ratio is 0.5, the mini batch size is 64, andthe optimizer is RMSprop [18]. We use early stopping with a patience setting of 5epochs.All trained models on the three datasets reported test accuracy of higher than75%. Based on these trained models, we compute coherence scores for Dementia-Bank, and use the scores for dementia prediction as features.To investigate the effectiveness of the proposed feature, we use logistic re-gression for classification. We evaluate the performance when coherence score isthe only feature, and when it is combined with other task-agnostic features (i.e.,linguistic and acoustic features). Our baselines include majority class baseline, amodel using only linguistic features, a model using only acoustic features, and amodel using both linguistic and acoustic features.5.2.2 Experiment ResultsTable 5.2 reports the results. The first row in the table represents our baseline mod-els without the coherence feature. In particular, 0.515 is the accuracy of the major-ity class classifier. The coherence score, when being the only feature, can improvethe accuracy by as much as 4%. When trained on DementiaBank, the coherencefeature has a boost of 0.4% when combined with linguistic features. However, inother cases it has no effect, or even hurts the performance, when combined withother task-agnostic features.5.3 DiscussionOur new coherence feature does not perform well for dementia prediction althoughusing the feature outperforms the majority class baseline. Here, we investigate the21Table 5.2: Results on the effectiveness of the coherence feature. The perfor-mance metric is accuracy. Numbers in parenthesis show the change inperformance. L&A features denote linguistic and acoustic features.Models No other features Linguistic Acoustic L&A featuresBaseline 0.515 0.740 0.601 0.713DementiaBank 0.555 (+0.04) 0.744 (+0.004) 0.599 (-0.002) 0.711 (-0.002)WSJ 0.543 (+0.028) 0.734 (-0.006) 0.605 (+0.006) 0.709 (-0.004)VIST 0.527 (+0.012) 0.734 (-0.006) 0.603 (+0.004) 0.713 (-0)coherence score feature more closely.Our assumption behind using the coherence feature is that AD patients wouldgive a less coherent picture description than healthy elderly people. To verify thisassumption, we examine the relationship between coherence scores and AD. Figure5.3 represents the distributions of coherence scores for AD patients and healthycontrols.Figure 5.3: Distributions of coherence scores for AD patients and healthycontrols. x-axis denotes coherence score and y-axis denotes probabilitydensity.From the graphs, we can see that the distributions of both groups look alike inall three cases, which indicates that the coherence scores do not have much predic-tive power to distinguish AD patients from healthy controls. Our coherence modelis based on the idea that referring to the same entities shows some patterns related22to discourse coherence. However, describing the Cookie Theft picture does notseem to require mentioning the same entity repeatedly too many times. Therefore,it is possible that dementia does not greatly affect this type of coherence which canbe captured by the model when describing the Cookie Theft picture.23Chapter 6An End-to-end Neural Model:Hierarchical Attention NetworksIn this chapter we detail our experiments of using Hierarchical Attention Networks(HAN) [38] for dementia prediction. HAN is an end-to-end neural network model,which allows avoiding any feature engineering. It has been very successful in sev-eral text categorization tasks like sentiment estimation [40] and topic classification[37].In Section 6.1 we introduce the original HAN model and our modified version.Then, in Section 6.2 we evaluate HAN models and baselines on the DementiaBankdataset, and test their performance when using only small portions of the dataset.Particularly, in Section 6.2.4 we analyze the information captured by the attentionmechanism. We further compare the performance of the HAN model and onetraditional model on a written text dataset in Section 6.2.5. Finally in 6.3 we discussour work on applying HAN to dementia prediction.6.1 Hierarchical Attention NetworksFigure 6.1 illustrates the overall architecture of the HAN model for dementia pre-diction. The model input are words from one interview sample (i.e., a descriptionof the Cookie Theft picture). The model output is the probability distribution overtwo categories, AD and healthy. The model consists of a word sequence encoder, a24word-level attention layer, a sentence encoder and a sentence-level attention layer.We briefly introduce the functionality of each layer. For more details, refer to[38]. The word encoder uses the bidirectional GRU [5], an efficient implemen-tation of recurrent neural network (RNN). It encodes each word in one sentenceinto a hidden vector, given the context of other words in the sentence. Then theword-level attention layer puts different weights on each word vector, producing aweighted hidden vector of the sentence. Once we get all the sentence vectors ofthe input sample, we feed them into another bidirectional GRU, i.e., a sentence en-coder. This sentence encoder along with the sentence-level attention layer builds aweighted vector (denoted by v in Figure 6.1) for the whole document, which is thelatent representation of an input sample by applying attention mechanism to bothword level and sentence level. Finally a linear layer projects v to a 2-dimensionalvector, on which a softmax operation is performed. The output is the probabilitydistribution for AD and healthy. Negative log likelihood of the correct label is usedas the training loss.We evaluate the performance of two models, one is the original HAN modeland the other incorporates demographic information by concatenating v with theage of the interviewee. Since the scale of age ([50, 90] in our dataset) is muchlarger than the values of elements of v (typically in [-1, 1]), we standardize the age,making it zero mean and unit variance before concatenating it with v.6.2 Experiment6.2.1 Experiment SettingsAs in the previous two chapters, we perform 10-fold cross validation. The reportedperformance is an average across the 10 folds. For evaluation metrics, we computeprediction accuracy, precision, recall and F-score.The age is an important predictor of dementia according to Gao et al. [14].Our demographic-based baseline uses only the ages of participants as featuresto demonstrate the predictiveness of the age feature. In addition to the simpledemographic-based baseline, five models are tested for comparison: the model byFraser et al. [12]; the model by Masrani [28] which obtained the best results among25Figure 6.1: Hierarchical attention network for dementia prediction.previous studies; a bidirectional GRU model; and the two HAN based models men-tioned before. In Table 6.1 we list different feature groups leveraged in the tradi-tional methods.Table 6.1: Features used by traditional methods. Info: information unit fea-tures. Spatial: spatial neglect features.Dataset Methods Linguistic Acoustic Info Spatial AgeDementiaBankAge only × × × × XFraser et al. [12] X X X × ×Masrani [28] X X X X XDementia Blog Masrani et al. [29] X × × × ×The bidirectional GRU model has the same structure as the word encoder ofour HAN model, including the word level attention. Instead of using a sentence26encoder, it builds a document representation via a max-pooling operation acrosssentence embeddings. The document representation is fed to a linear layer andsoftmax function to produce the prediction. We consider this bi-GRU model as abaseline to investigate the effect of the hierarchical architecture of the HAN model.To ensure the best results, all five approaches involve a model selection onthe training data, within each step of the 10-fold cross validation procedure. Forthe first two traditional models, they select k features with the highest Pearson’scorrelation coefficients between each feature and the binary class in the trainingset. This subset of features are used for building the classifier. For the bi-GRUbaseline and the HAN based models, within the training set we further reserve10% of the samples for validation. We then train a model on the remaining trainingsamples for many iterations, storing the model parameters after each iteration. Thevalidation data is used for selecting the model that achieves the lowest validationloss.For the hyper parameters of the HAN models, we set the word embeddingdimension to be 300 and the GRU dimension to be 100. The word embeddingsare initialized randomly. For training, we use SGD (stochastic gradient descent)with momentum of 0.9 and learning rate of 0.1. The bi-GRU baseline has the samesetting as the HAN models. Those hyper parameters are not fine-tuned.6.2.2 Experiment Results on DementiaBankTable 6.2 summarizes the results. The HAN model achieves performance of 0.815in both accuracy and F-score. When combined with the age feature, the HAN-AGE model resultes in a remarkable boost in performance, 2.5% improvementin accuracy and 3% improvement in F-score over Masrani [28]. In addition, theHAN model shows a significant increase in performance compared to the bi-GRUbaseline, demonstrating the higher capacity as a result of leveraging hierarchy inHAN.6.2.3 Analysis of Effects of Dataset SizeIn general, training deep neural network models require large data. To investigateif the HAN models are robust to the size of the training data, we evaluated the two27Table 6.2: Binary classification with 10-fold cross-validation. Note that re-sults of Fraser’s model and Masrani’s model are from the original papers.Model Accuracy Precision Recall F-scoreBaseline (age only) 0.595 0.591 0.729 0.653Fraser et al. (no age) 0.820 - - -Masrani (with age) 0.844 - - 0.846bi-GRU baseline 0.748 0.750 0.811 0.768HAN 0.815 0.839 0.818 0.815HAN-AGE 0.869 0.859 0.904 0.876HAN-based models and Masrani’s model from the last experiment with differentproportions of the dataset. We also included a logistic regression classifier with agebeing the only feature. Figure 6.2 reports test accuracy when we repeated the previ-ous experiment with 5%, 15%, 25%, 50% and 75% of the original DementiaBankdataset. For each proportion setting, we ran 5 independent experiments (randomlyselecting the target subset of the data) and computed the mean and standard de-viation. We can see that age is very informative, since a majority class classifierwould have an accuracy around 0.5. Note that the performance of HAN dropsdramatically when limited training data is used, whereas the HAN-AGE model ismuch less sensitive to the size of training data. The HAN-AGE model maintains arelatively high performance even with only 5% of the data samples.6.2.4 Analysis of AttentionDuring the training process, the attention mechanism makes the HAN model learnwhich words are important in predicting a given label. To explore this informa-tion captured by the attention mechanism, we first performed a qualitative analysisby visualizing the hierarchical attention layers on a small subset of our data (seeFigure 6.3).In the visualization, each line represents a sentence. Blue denotes the sentenceattention weight and red denotes the word attention weight. Figure 6.3 shows thatthe model tends to select words like overflowing, stool, mother, anddrying, and their corresponding sentences. Interestingly, these words belong to28Figure 6.2: Test accuracy by varying training data proportions.the set of information units defined by human experts for the Cookie Theft picture.To analyze how much information captured by the attention mechanism overlapswith the human defined information units, we performed further quantitative anal-ysis.In particular, we performed a statistical test to investigate if HAN pays moreattention to information unit words, compared to other words. In order to do this,we considered two categories to which every word token1 in our dataset belongs to:(i) the word is either in the set of information unit words or not (ii) the word is themost attended in its sentence or not. We then went through all the word tokens inthe dataset and counted the frequencies of these two categories. Table 6.3 shows theresulting contingency table. Now the χ2 test can tell us whether the two categoriesare dependent on each other. More technically, it can tell us whether there is a1A word token is a specific occurrence of a word type in a text, for instance the text “the boy istelling the girl but the girl is not listening” contains 8 word types and 12 word tokens.29(a) Sample id: 059-2 Diagnosis: control Prediction: control(b) Sample id: 007-3 Diagnosis: AD Prediction: ADFigure 6.3: Visualization of attention.statistical significant difference between the expected frequencies (in parenthesis)and the observed frequencies in the two categories.Table 6.3: Contingency table (numbers in parenthesis are expectation values).Most emphasized Not most emphasized TotalInformation unit 1481 (823) 7599 (8257) 9080Non information unit 4889 (5547) 56270 (55612) 61159Total 6370 63869 7023930The result χ2 = 663, p < 0.00001 shows that the two categories are dependenton each other, i.e., information unit does affect the attention level, with the numberof information unit words being the most emphasized (1481), being much biggerthan its expectation value (823). So HAN appears to be able to capture similarinformation to the one specified by human experts.Now an interesting question that is still open is whether the attention model isuniformly paying more attention to all the information unit words or it is focusingon a specific subset of the information unit words. To answer this question, wedefine and compute the attention frequency and the random frequency for eachof the 20 human-defined information units. More specifically, for an informationunit word, the attention frequency was computed as the number of times it was theword with the highest word attention weight in a sentence. Let Sw denote the set ofall sentences containing word w and weight(c, s) be the attention weight of wordtoken c in sentence s, we can formalize the computation of attention frequency forword type w asAttention-Frequency(w) =∑s∈SwI[w = argmaxcweight(c, s)],where I is an indicator function.In contrast, the random frequency was computed as the expected number oftimes the word would have the highest word attention weight, if weights wereassigned randomly within each sentence. Therefore it is defined as follows:Random-Frequency(w) =∑s∈Sw1|s| ,where Sw denotes the set of all sentences containing word w and |s| is the lengthof the sentence. The rationale is that if attention weights are assigned at random, aword in a sentence will have the highest attention with probability 1/|s|.In Figure 6.4, the x-axis are 20 human defined information units and y-axisshows their respective frequencies. The results indicate that the model does notattend to all information unit words uniformly. It strongly attends to words likewoman, window, stool, sink, water, wash, cookie, exterior31and plate, but pays less attention to words like dishes, boy, girl, etc.than what would be expected by their random appearance. Currently, we do nothave a satisfactory explanation for why the word attention model is attending moreto that specific subset of information unit words.Figure 6.4: Attention frequency vs. random frequency.6.2.5 Evaluation on the Blog CorpusWe evaluate HAN on the Dementia Blog Corpus to see how it performs on writtenlanguage. The Dementia Blog Corpus was created by Masrani et al. [29] by col-lecting blog posts written by authors with and without dementia. In particular, theyscraped the text of 2805 posts from 6 public blogs up to April 4th, 2017. Threeblogs were written by dementia patients, and three written by family members ofdementia patients were used as control. There are a total of 1654 samples writtenby persons with dementia and 1151 from healthy controls. Table 6.4 summarizesstatistics of the Dementia Blog dataset.We compare our model to Masrani et al. [29]’s, which built and tested tradi-tional models for predicting dementia on the blog dataset using only the linguistic32Table 6.4: Blog information as of April 4th, 2017.URL (http://*.blogspot.ca) Posts Mean Words Diagnosis Gender/Ageliving-with-alzheimers 344 263.03 (s=140.28) AD M, 72 (approx)creatingmemories 618 242.22 (s=169.42) AD F, 61parkblog-silverfox 692 393.21 (s=181.54) Lewy Body M, 65journeywithdementia 201 803.91 (s=548.34) Control F, unknownearlyonset 452 615.11 (s=206.72) Control F, unknownhelpparentsagewell 498 227.12 (s=209.17) Control F, unknownfeatures, as shown in Table 6.1. We used 9-fold cross validation as in [29], whereeach test fold contains all posts from one dementia blog and one control blog, andthe posts from the remaining four blogs were used in the training fold. The modelselection process was carried out as described in section 6.2.1.Table 6.5: Binary classification with 9-fold cross-validation on blog corpus.Model Accuracy F-scoreMajority class 0.590 0.742Masrani et al. [29] 0.724 0.785HAN 0.579 0.582The experiment results are summarized in Table 6.5. The traditional modeldemonstrates that dementia can also be automatically predicted from written textin the form of blog posts. However, the HAN model fails in this task. A keydifference that may explain this result, is that the samples in DementiaBank aredescriptions of one single picture and so are all about the same topic (i.e., same ob-jects and events, resulting in a corpus vocabulary of 1828 word types). In contrast,samples from the blog data cover a large variety of topics, ranging from regularmedical appointments to re-connecting an old friend on Facebook (with a muchlarger vocabulary size of 27413). The HAN model succeeded in focusing on infor-mative concepts shown in the Cookie Theft picture, with the help of the attentionlayers. However, for blog data there are no such concepts shared across all blogposts. Thus the data are likely not sufficient to cover such a much larger vocab-ulary, resulting in the extremely poor performance of HAN. On the contrary, the33traditional machine learning method is quite effective on blog posts, likely becauseits large human engineered set of features also include features that are not lexi-cally based (i.e., based on words), but instead capture task-independent aspects oflanguage like syntactic constituents and syntactic complexity.To further explore the large difference in performance between neural and tra-ditional methods on blog data, we conducted an additional experiment. Unlike theoriginal split setting where all posts from the same blog are contained either in thetraining fold or the test fold, here we shuffle all the posts regardless which blogsthey belong to, and divide them into 10 folds for cross validation. In this scenario,posts from the same blog will very likely appear in both the training and testingdata, creating a form of data contamination. Not surprisingly, the HAN model isvery accurate on this artificial task, with an average accuracy and F-score as highas 0.934 and 0.944, respectively. This could be because HAN captures the writ-ing style and topics of each blogger rather than informative patterns for dementiaprediction.6.3 DiscussionWe extend previous work based on traditional machine learning methods and engi-neered features, by applying a neural model on language samples of elderly peopleto classify dementia patients from healthy controls. When not including the de-mographic feature (age), HAN matches the performance of the best model withoutage. By incorporating age as extra information, the model not only achieves thestate-of-the-art performance on the DementiaBank dataset, but can give a decentprediction accuracy even when trained with a small portion of the available data.Visualization and statistical analysis reveal that the attention mechanism of themodel manages to capture similar key concepts as the information unit featuresspecified by human experts. Meanwhile, the blog experiment results indicate thatHAN is not a universal classifier for predicting dementia from language. In the taskwhere samples are not all about a single topic, a traditional model that exploits lin-guistic features (e.g., syntactic complexity, context-free grammar rules) is a betterchoice than HAN.34Chapter 7Conclusions and Future WorkEarly prediction of dementia is extremely important, as researchers believe thatearly diagnosis will be key to slowing and stopping the disease. Currently, a diag-nosis is based on clinical expertise and cognitive screening tests, which have lim-ited accuracy in earlier stages of disease, or invasive and resource-intensive testing,such as lumbar puncture or specialized neuroimaging. In this study, we tackled theproblem of predicting dementia from language. In particular, we explored neuralnetwork models in the direction of avoiding any task-specific features, which couldbe easily generalized to other language datasets of dementia. This thesis has madethree main contributions towards this effort.First, we proposed to use a joint embedding approach to combine two mul-timodal task-agnostic feature groups, i.e., linguistic and acoustic features. Theexperiment results on the DementiaBank dataset showed that our models using thepairwise ranking scheme give performances comparable to baseline models usingfeature selection.Secondly, we proposed a novel feature about discourse coherence, which isalso task-agnostic for dementia prediction. Unlike previous linguistic features thattried to detect language differences at a lexical and syntactic level, the new coher-ence feature aimed at capturing the language changes caused by AD in a higherlevel. We applied a neural coherence model [32] based on the Centering theory togenerate coherence scores for the DementiaBank dataset. The logistic regressionclassifier using the coherence score as the only feature outperformed the majority35class baseline by 4%. However, the coherence feature did not perform well whenused together with other task-agnostic features. Our analysis indicated that ADpatients and healthy controls do not show much difference with respect to this typeof coherence, on the task of describing the Cookie Theft picture, possibly becausesuch task does not seem to require mentioning the same entity repeatedly too manytimes.Lastly, we applied Hierarchical Attention Networks (HAN) framework [38] fordementia prediction, which does not require any feature engineering. Our experi-ments on the DementiaBank dataset showed that HAN obtained comparable resultsto traditional models that use task-specific features, and the modified HAN-AGEmodel achieved new state-of-the-art classification performance. In experiments ofattention mechanism analysis, we found that the words emphasized by the attentionmodel overlapped but differ from the information units defined by human experts.Further investigation for explaining this difference is left as future work. More-over, we evaluated the HAN model on a dementia blog dataset. Interestingly, thesame neural model did not work well on this corpus of written text, suggesting thatdementia prediction from language may require different methods depending onthe genre of the source language.Although our task-agnostic methods were only tested on an English datasetdescribing the Cookie Theft picture, it could be generalized to other cultures andlanguages. It would be particularly useful for applying such methods in developingcountries, which have an even more pressing need for inexpensive solutions.Currently, a key limitation of predicting dementia from language is the scarcityof related data sets. The DementiaBank dataset seems to contain sufficient data(257 AD samples and 242 controls) to train neural text categorization models likeHAN. However, there are only 5 vascular dementia samples and no sample at allfor other types of dementia (e.g., dementia with Lewy body). Automatic predictionof different sub-types of dementia will not be possible until more data is collected.Moreover, one interesting area of future work would be collecting a datasetwith other modalities. Specifically, in a picture description task we could recordthe facial expressions of participants through a camera and their eye movementsthrough an eye tracker. Similar ideas have been explored by Fraser et al. [13] andPoria et al. [35]. With data sources containing more than just speech, we could for36instance extract new features and apply multimodal learning methods to this newdataset, and might potentially achieve even better performance than what reportedin this thesis.37Bibliography[1] A.D. International. Dementia statistics.https://www.alz.co.uk/research/statistics, 2015. Accessed: 2019-2-13. →page 1[2] S. Ahmed, C. A. de Jager, A.-M. Haigh, and P. Garrard. Semantic processingin connected speech at a uniformly early stage of autopsy-confirmedalzheimer’s disease. Neuropsychology, 27(1):79, 2013. → page 2[3] S. Ahmed, A.-M. F. Haigh, C. A. de Jager, and P. Garrard. Connectedspeech as a marker of disease progression in autopsy-proven alzheimersdisease. Brain, 136(12):3727–3737, 2013. → page 5[4] S. Al-Hameed, M. Benaissa, and H. Christensen. Detecting and predictingalzheimer’s disease severity in longitudinal acoustic data. In Proceedings ofthe International Conference on Bioinformatics Research and Applications2017, pages 57–61. ACM, 2017. → page 6[5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473, 2014. →page 25[6] R. Barzilay and M. Lapata. Modeling local coherence: An entity-basedapproach. Computational Linguistics, 34(1):1–34, 2008. → page 7[7] J. T. Becker, F. Boiler, O. L. Lopez, J. Saxton, and K. L. McGonigle. Thenatural history of alzheimer’s disease: description of study cohort andaccuracy of diagnosis. Archives of Neurology, 51(6):585–594, 1994. →pages 2, 5, 9[8] L. W. Chambers, C. Bancej, and I. McDowell. Prevalence and monetarycosts of dementia in Canada: population health expert panel. AlzheimerSociety of Canada in collaboration with the Public Health Agency , 2016. →page 138[9] B. Croisile, B. Ska, M.-J. Brabant, A. Duchene, Y. Lepage, G. Aimard, andM. Trillet. Comparative study of oral and written picture description inpatients with alzheimer’s disease. Brain and language, 53(1):1–19, 1996. →pages 2, 6[10] B. H. Davis. So, you had two sisters, right? functions for discourse markersin alzheimers talk. In Alzheimer Talk, Text and Context, pages 128–145.Springer, 2005. → pages 3, 17[11] C. Ellis, A. Henderson, H. H. Wright, and Y. Rogalski. Global coherenceduring discourse production in adults: A review of the literature.International journal of language & communication disorders, 51(4):359–367, 2016. → pages 3, 17[12] K. C. Fraser, J. A. Meltzer, and F. Rudzicz. Linguistic features identifyalzheimers disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422, 2016. → pages 2, 3, 6, 14, 25, 26[13] K. C. Fraser, K. L. Fors, D. Kokkinakis, and A. Nordlund. An analysis ofeye-movements during reading for the detection of mild cognitiveimpairment. In Proceedings of the 2017 Conference on Empirical Methodsin Natural Language Processing, pages 1016–1026, 2017. → page 36[14] S. Gao, H. C. Hendrie, K. S. Hall, and S. Hui. The relationships betweenage, sex, and the incidence of dementia and alzheimer disease: ameta-analysis. Archives of general psychiatry, 55(9):809–815, 1998. →page 25[15] B. J. Grosz, S. Weinstein, and A. K. Joshi. Centering: A framework formodeling the local coherence of discourse. Computational linguistics, 21(2):203–225, 1995. → pages 7, 18[16] C. Guinaudeau and M. Strube. Graph-based local coherence modeling. InProceedings of the 51st Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), volume 1, pages93–103, 2013. → page 7[17] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlationanalysis: An overview with application to learning methods. Neuralcomputation, 16(12):2639–2664, 2004. → page 7[18] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machinelearning lecture 6a overview of mini-batch gradient descent. → page 2139[19] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, J. Devlin,A. Agrawal, R. Girshick, X. He, P. Kohli, D. Batra, et al. Visual storytelling.In 15th Annual Conference of the North American Chapter of theAssociation for Computational Linguistics (NAACL 2016), 2016. → page 20[20] James Lyons. Mel frequency cepstral coefficient (mfcc) tutorial.http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/, 2013. Accessed:2018-11-18. → page 12[21] D. Kempler. Language changes in dementia of the alzheimer type. Dementiaand communication, pages 98–114, 1995. → page 2[22] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semanticembeddings with multimodal neural language models. arXiv preprintarXiv:1411.2539, 2014. → pages 3, 7, 11, 13[23] M. Laine, M. Laakso, E. Vuorinen, and J. Rinne. Coherence andinformativeness of discourse in two dementia types. Journal ofNeurolinguistics, 11(1-2):79–87, 1998. → pages 3, 17[24] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. → pages 7, 18[25] J. Li and E. Hovy. A model of coherence based on distributed sentencerepresentation. In Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP), pages 2039–2048,2014. → page 7[26] Z. Lin, H. T. Ng, and M.-Y. Kan. Automatically evaluating text coherenceusing discourse relations. In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 997–1006. Association for ComputationalLinguistics, 2011. → page 7[27] Louis-Philippe Morency, Tadas Baltrusaitis. Tutorial on multimodal machinelearning. https://www.cs.cmu.edu/∼morency/MMML-Tutorial-ACL2017.pdf,2017. Accessed: 2019-1-16. → page 11[28] V. Masrani. Detecting dementia from written and spoken language. Master’sthesis, University of British Columbia, 2018. → pages2, 3, 6, 11, 12, 14, 25, 26, 2740[29] V. Masrani, G. Murray, T. Field, and G. Carenini. Detecting dementiathrough retrospective analysis of routine blog posts by bloggers withdementia. BioNLP 2017, pages 232–237, 2017. → pages 26, 32, 33[30] F. Nensa, K. Beiderwellen, P. Heusch, and A. Wetter. Clinical applicationsof pet/mri: current status and future perspectives. Diagnostic andInterventional Radiology, 20(5):438, 2014. → page 1[31] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodaldeep learning. In Proceedings of the 28th international conference onmachine learning (ICML-11), pages 689–696, 2011. → page 6[32] D. T. Nguyen and S. Joty. A neural local coherence model. In Proceedingsof the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1320–1330, 2017. → pagesix, 3, 7, 18, 19, 20, 21, 35[33] S. O. Orimaye, J. S.-M. Wong, and K. J. Golden. Learning predictivelinguistic features for alzheimers disease and related dementias using verbalutterances. In Proceedings of the Workshop on Computational Linguisticsand Clinical Psychology: From Linguistic Signal to Clinical Reality, pages78–87, 2014. → page 5[34] D. B. Paul and J. M. Baker. The design for the wall street journal-based csrcorpus. In Proceedings of the workshop on Speech and Natural Language,pages 357–362. Association for Computational Linguistics, 1992. → page19[35] S. Poria, E. Cambria, N. Howard, G.-B. Huang, and A. Hussain. Fusingaudio, visual and textual clues for sentiment analysis from multimodalcontent. Neurocomputing, 174:50–59, 2016. → page 36[36] H. Posner, R. Curiel, C. Edgar, S. Hendrix, E. Liu, D. A. Loewenstein,G. Morrison, L. Shinobu, K. Wesnes, and P. D. Harvey. Outcomesassessment in clinical trials of alzheimers disease and its precursors:readying for short-term and long-term clinical trial needs. Innovations inclinical neuroscience, 14(1-2):22, 2017. → page 1[37] A. Tsaptsinos. Lyrics-based music genre classification using a hierarchicalattention network. arXiv preprint arXiv:1707.04678, 2017. → page 24[38] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchicalattention networks for document classification. In Proceedings of the 201641Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, pages1480–1489, 2016. → pages 3, 24, 25, 36[39] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency. Tensor fusionnetwork for multimodal sentiment analysis. arXiv preprintarXiv:1707.07250, 2017. → page 7[40] L. Zhang, S. Wang, and B. Liu. Deep learning for sentiment analysis: Asurvey. Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery, 8(4):e1253, 2018. → page 2442