@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Computer Science, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Wu, Ming-An"@en ; dcterms:issued "2015-10-24T07:46:36"@en, "2015"@en ; vivo:relatedDegree "Master of Science - MSc"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description "User-adaptive visualization can provide intelligent personalization to aid the user in the information processing. The adaptations, in the form as simple as helpful highlighting, are applied based on user’s characteristics and preferences inferred by the system. Previous work has shown that binary labels of user’s cognitive abilities relevant for processing information visualizations could be predicted in real time, by leveraging user gaze patterns collected via a non-intrusive eye-tracking device. The classification accuracies reported were in the 59–65% range, which is statistically more accurate than a majority-class classifier, but not of great practical significance. In this thesis, we expand on previous work by showing that significantly higher accuracies can be achieved by leveraging summative statistics on a user’s pupil size and head distance to the screen measurements, also collected by an eye tracker. Our experiments show that these results hold for two datasets, providing evidence of the generality of our findings. We also explore the sequential nature of gaze movement by extracting common substring patterns and using the frequency of these patterns as features for classifying user’s cognitive abilities. Our sequence features are able to classify more accurately than the majority-class baseline, but unable to outperform our best classification model with the summative eye-tracking features."@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/55086?expand=metadata"@en ; skos:note "Inferring User Cognitive Abilities from Eye-Tracking DatabyMing-An WuB.Sc., The University of British Columbia, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)The University of British Columbia(Vancouver)October 2015c©Ming-An Wu, 2015AbstractUser-adaptive visualization can provide intelligent personalization to aid the userin the information processing. The adaptations, in the form as simple as helpfulhighlighting, are applied based on user’s characteristics and preferences inferredby the system. Previous work has shown that binary labels of user’s cognitive abil-ities relevant for processing information visualizations could be predicted in realtime, by leveraging user gaze patterns collected via a non-intrusive eye-trackingdevice. The classification accuracies reported were in the 59–65% range, which isstatistically more accurate than a majority-class classifier, but not of great practicalsignificance. In this thesis, we expand on previous work by showing that signifi-cantly higher accuracies can be achieved by leveraging summative statistics on auser’s pupil size and head distance to the screen measurements, also collected byan eye tracker. Our experiments show that these results hold for two datasets, pro-viding evidence of the generality of our findings. We also explore the sequentialnature of gaze movement by extracting common substring patterns and using thefrequency of these patterns as features for classifying user’s cognitive abilities. Oursequence features are able to classify more accurately than the majority-class base-line, but unable to outperform our best classification model with the summativeeye-tracking features.iiPrefaceThis thesis stems from the Advanced Tools for User-Adaptive Visualization project.The project team identified the general direction of this research. The datasetsused in this thesis, described in Chapter 3, were collected by other project mem-bers, and they are cited where appropriate. The data processing, classificationexperiments, and statistical analyses were done by myself. Cristina Conati pro-vided supervision and guidance for this research. Other project members, includ-ing Giuseppe Carenini, Dereck Toker, Sébastien Lallé, and Matthew Gingerich,provided feedback during the development of the research.A version of Chapter 4 appears in a conference paper manuscript1.1M. Wu and C. Conati. Inferring User Cognitive Abilities during Visualization Processing UsingComprehensive Eye-Tracking Data.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Individual Differences in Visualization Effectiveness . . . . . . . 42.2 User Model Acquisition . . . . . . . . . . . . . . . . . . . . . . . 52.3 Gaze in User Characteristics Classification . . . . . . . . . . . . . 62.4 Pupillometry and Posture in User Modelling . . . . . . . . . . . . 62.5 Gaze Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . 7iv3 Eye-Tracking Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Summative Features . . . . . . . . . . . . . . . . . . . . . . . . . 123.2.1 Gaze Features . . . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Pupil Features . . . . . . . . . . . . . . . . . . . . . . . 153.2.3 Head Distance Features . . . . . . . . . . . . . . . . . . . 163.3 AOI Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Revisiting Summative Features . . . . . . . . . . . . . . . . . . . . . 174.1 Classification Targets . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Model Evaluation Design . . . . . . . . . . . . . . . . . . . . . . 194.4 Comparisons of Classifiers and Feature Sets . . . . . . . . . . . . 194.4.1 Comparison of Classifier Performance . . . . . . . . . . . 224.4.2 Comparison of Feature Set Performance . . . . . . . . . . 224.4.3 Performance of the Best Classification Model . . . . . . . 244.4.4 Comparison with Previous Work on Gaze Features . . . . 274.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 285 Exploring Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1 AOI Modifications . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Sequence Patterns as Features . . . . . . . . . . . . . . . . . . . 325.2.1 Pattern Frequency and Cognitive Abilities . . . . . . . . . 325.2.2 Pattern Selection Criteria . . . . . . . . . . . . . . . . . . 335.3 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . 355.4 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.4.1 Evaluation on Pattern Selection Criteria . . . . . . . . . . 365.4.2 Evaluation on Sequence Feature Performance . . . . . . . 405.5 Variations of Sequence Feature . . . . . . . . . . . . . . . . . . . 455.5.1 Collapsing Sequences . . . . . . . . . . . . . . . . . . . 455.5.2 Extracting Long AOI Visits . . . . . . . . . . . . . . . . 455.5.3 Annotating Sequence with Duration . . . . . . . . . . . . 465.5.4 Language Model . . . . . . . . . . . . . . . . . . . . . . 46v5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 476 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 486.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2.1 Replication on Additional Datasets . . . . . . . . . . . . 496.2.2 Using Additional Gaze Features . . . . . . . . . . . . . . 496.2.3 Re-examining Binary Split . . . . . . . . . . . . . . . . . 506.2.4 More Sophisticated Sequence Processing . . . . . . . . . 506.2.5 System Prototype Implementation . . . . . . . . . . . . . 51Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 58B Additional Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 69B.1 Visualization Expertise and Locus of Control . . . . . . . . . . . 69B.2 Over-time Accuracy for Best Non-Pupil Feature Set . . . . . . . . 72viList of TablesTable 3.1 Summary of dataset parameters. . . . . . . . . . . . . . . . . . 12Table 3.2 List of summative features and groups. . . . . . . . . . . . . . 13Table 4.1 Main effects and the interaction effects in the six ANOVAs. . . . 20Table 4.2 Ranking of feature set by accuracy for the random forest classifier. 22Table 4.3 Comparison of peak accuracies between the previous experi-ments and the current results. . . . . . . . . . . . . . . . . . . 25Table 4.4 Feature importance scores for the pupil and head distance toscreen features. . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 4.5 Comparisons of classification accuracies with the gaze featuresbetween previous experiments and current implementation. . . 28Table 5.1 Examples of Sequence Support (SS) and Average Pattern Fre-quency (APF) calculations. . . . . . . . . . . . . . . . . . . . . 34Table 5.2 Contingency table used in the χ2 test for SS. . . . . . . . . . . 35Table 5.3 Examples of the samples used in the t-test for APF. . . . . . . . 35Table 5.4 Sequence pattern counts under the feature selection criteria. . . 36Table 5.5 Main effects and the interaction effects in the six ANOVAs. . . . 38Table 5.6 Pairwise comparisons of random forest and logistic regression,using the S1+S2 feature selection criterion. . . . . . . . . . . . 40Table 5.7 Main effect of feature set in the six ANOVAs. . . . . . . . . . . 42Table 5.8 Comparisons of the sequence feature set against the baseline. . 42Table 5.9 Comparisons of the summative feature sets with the sequencefeature set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43viiTable 5.10 Comparisons of adding sequence features to summative featuresets with summative feature sets alone. . . . . . . . . . . . . . 43Table 5.11 Rankings of feature set by classification accuracy for the ran-dom forest classifier. . . . . . . . . . . . . . . . . . . . . . . . 45Table A.1 List of sequence patterns that pass the S1 pattern selection cri-terion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Table B.1 Ranking of feature set by accuracy for the random forest classi-fier in classifying visualization expertise and locus of control. . 70viiiList of FiguresFigure 3.1 The bar chart and the radar chart used in the BAR+RADAR study. 10Figure 3.2 The bar chart used in the INTERVENTION study. . . . . . . . . 11Figure 3.3 Heatmap of the fixations on the bar chart in the INTERVEN-TION study. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 3.4 Illustration of saccade measures. . . . . . . . . . . . . . . . . 14Figure 4.1 Accuracies of classifying user characteristics using the threeclassifiers and six feature sets. . . . . . . . . . . . . . . . . . 21Figure 4.2 Accuracies of classifying user characteristics using the six fea-ture sets with the random forest classifier. . . . . . . . . . . . 23Figure 4.3 Trends in classification accuracy for random forest with P+HD,as a function of amount of observed data in a trial. . . . . . . 24Figure 4.4 Class accuracies of logistic regression with GAZE feature setand random forest with P+HD feature set. . . . . . . . . . . . 26Figure 5.1 Area of Interest (AOI) modification for the INTERVENTIONdataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 5.2 AOI modification for the radar chart in the BAR+RADAR dataset. 31Figure 5.3 Comparison of pattern selection criteria between S1 and S1+S2. 39Figure 5.4 Comparison of classifiers with the patterns selected by S1+S2. 41Figure 5.5 Comparison of classification performance between sequencepattern features and summative features. . . . . . . . . . . . . 44Figure B.1 Accuracies of classifying expertise and locus of control in theINTERVENTION dataset. . . . . . . . . . . . . . . . . . . . . 71ixFigure B.2 Trends in classification accuracy for random forest with thebest non-pupil feature set, as a function of amount of observeddata in a trial. . . . . . . . . . . . . . . . . . . . . . . . . . . 73xList of AbbreviationsAPF Average Pattern FrequencyANOVA Analysis of VarianceAOI Area of InterestEMDAT Eye Movement Data Analysis Toolkit, a library for processing and ex-tracting features from eye-tracking dataSS Sequence SupportWM Working MemoryxiAcknowledgmentsThere are several people that I would like to thank.I am grateful for the supervision of Cristina Conati, who helped me with herexperience and offered me advice and guidance throughout my study.I thank Gail Murphy, my second reader, for her speedy review and helpfulcomments.Thank you to Giuseppe Carenini, Dereck Toker, Sébastien Lallé, and MatthewGingerich for all their help.Thank you to my fellow students, Zongxu Mu, Rui Ge, Yifan Peng, Ben Zhu,Shuochen Su, Yidan Liu, and Jianing Yu for all the fun times.Thank you to Kailun Zhang, who always supported me, encouraged me, andcheered me up when needed with her sense of humour.I owe a big thank you to my parents, Zhenzhen An and Changhua Wu, fortheir unconditional support and encouragement they have given me throughout theyears.My research was funded by the Natural Science and Engineering ResearchCouncil of Canada.xiiChapter 1IntroductionIn this digital age, data is collected in all aspects of human life, from scientific re-search to personal health and finance, and people are tasked with interpreting andanalyzing an increasing amount of data. Leveraging the natural strengths and lim-itations of human physiology, cognition, and psychology, tools have been createdto help people understand their data. Information visualization is one of such tools.It offloads cognition to the perceptual system, thus frees up a person’s limited cog-nitive and memory resources for solving higher-level tasks.Although information visualization systems have been created for a wide va-riety of data and task domains, research has shown that individual differences canaffect task performance and visualization preferences. Previous work has startedexploring user-adaptive visualization systems to incorporate these individual dif-ferences into the design and further improve the tools that people use for their dataanalysis tasks.For a user-adaptive system to provide personal adaptations for each user, thesystem must get to know its users. The system builds a user model that containsthe relevant attributes of a user, which it then uses to perform the appropriate adap-tations to meet the user’s needs. The relevant user attributes may include demo-graphics, interests, preferences, cognitive abilities, emotional states, etc.This thesis focuses on building the part of the user model related to user’s cog-nitive abilities. Specifically, we are interested in classifying, in real time, threeuser characteristics – perceptual speed, visual working memory (WM), and verbal1WM – from user’s gaze behaviours. Observing a user from from an eye trackerprovides a non-intrusive mean of understanding the behaviour of the user, thus al-lows the system to build a user model without complete cooperation from the user.Methods that require cooperation from the user can be time inefficient and evendisruptive. Previous work achieved promising results in classifying a number ofuser characteristics, and our goal is to further improve the classification accuracy.1.1 Research QuestionsTo improve the accuracy of classifying user characteristics, we experiment with anumber of eye-tracking feature categories along with a set of classification algo-rithms. Previous work [17, 41] used summative gaze features, e.g., fixation dura-tion and saccade length, and achieved promising classification accuracies. We areinterested in improving the accuracy by using additional measurements from theeye tracker, such as pupil size and head distance to screen. Also, given the previ-ous findings of gaze movement patterns that occur more or less frequently for userswith certain characteristics, we test the performance of these sequence patterns asfeatures in classification tasks.Our research questions are:1. Can we improve the accuracy of classifying user’s cognitive abilities thanprevious work achieved using summative gaze features, by using additionalmeasures of user’s pupil size and head distance to screen also collected bythe eye tracker?2. Can sequential patterns of user’s gaze movement be used to further improvethe classification of user’s cognitive abilities?1.2 ContributionsThe main contributions of the thesis include the finding of a stronger summativeeye-tracking feature set and classifier combination that greatly improves the accu-racy of classifying user’s cognitive abilities. We incorporate additional measuresfrom the eye-tracker than previous work, and we compare the performance of dif-ferent feature set combinations. Our experiment results show that the pupil size2and head distance to screen features with the random forest classifier performs sig-nificantly better than other summative features we considered.The second contribution is the exploration of patterns in gaze movement se-quence. We propose and analyze methods for extracting patterns from the gazesequences. We show that patterns extracted from the sequences are useful featuresfor classifying user’s cognitive abilities, though they do not perform as well as thesummative eye-tracking features.1.3 OverviewThis thesis discusses experiments of classifying several user cognitive abilities withdifferent sets of eye-tracking measures. Background and previous work related tothis research are introduced in Chapter 2. The eye-tracking data and feature setsare described in Chapter 3. Chapter 4 discusses improvements made in classifyingwith summative features. Chapter 5 explores patterns in the gaze movement aspotential features in the classification tasks. Finally, we discuss future work andconclude this thesis in Chapter 6.3Chapter 2Related WorkPrevious research has shown that individual differences, such as in cognitive abili-ties, can influence the effectiveness of information visualizations. These individualdifferences can be accounted for through the designing of user-adaptive visualiza-tion systems. Such systems require building user models that describe the usercharacteristics, according to which the system can then perform the appropriateadaptations. To acquire user characteristic information, previous work has em-ployed user’s gaze behaviour, which was shown to be influenced by their cognitiveabilities, to predict these characteristics using machine learning techniques. Inan effort to improve the previous results, this work also draws from the research inuser modelling via pupillometry and posture, and we leverage differential sequencemining technique to explore patterns in the sequential gaze movement and expandour feature set.2.1 Individual Differences in Visualization EffectivenessSince early human-computer interaction research that pointed to the advantages ofconsidering individual user characteristics in interface design [2, 13], the effect ofindividual differences on visualization effectiveness has been studied in a variety ofcontexts, and various types of individual differences have been found to influencea user’s experience during visualization tasks. Chen [6] found a strong correlationbetween associative memory and user performance of a spatial-semantic search4task in a three-dimensional information visualization system. Velez et al. [48]found not only a large diversity of spatial ability in the population, but also a corre-lation between spatial ability and the task performance with a three-dimensional vi-sualization. Conati and Maclaren [8] found that user’s perceptual speed influencesthe relative effectiveness of two alternative visualizations for a given visualizationtask. Ziemkiewicz et al. [51] discovered that the personality trait known as locus ofcontrol correlates with user’s performance on a visualization that employs a con-tainer metaphor. Toker et al. [45], Carenini et al. [5], and Conati et al. [9] showedthat cognitive abilities like perceptual speed, visual WM, and verbal WM can impactboth user’s visualization preference and task completion time; these three studiesinvolved different information visualization systems, indicating the generality ofthe effects. Other work has shown that these cognitive abilities impact the process-ing of specific visualization elements. For instance, Toker et al. [46] found thatusers with lower levels of perceptual speed need greater visual effort to processlegend and labels in bar charts.These findings demonstrate the need for considering individual differences invisualization design, suggesting that users could benefit from receiving personal-ized intervention that can facilitate their processing as needed during interaction.2.2 User Model AcquisitionOne of the strategies for incorporating user characteristics to improve usability, asidentified by Innocent [22], is designing an adaptive interface system. An essentialcomponent of such system is the user model, which contains knowledge about theuser that is useful for adaptation. Depending on the application context, the rele-vant user information may include user characteristics (e.g., demographic, person-ality, cognitive ability, prior experience), past behaviours (e.g., browsing history,application usage frequency), and even emotional state [2, 18, 24].There are two types of method for learning information about users (i.e., build-ing the user model): explicit self-reports and -assessment by the user and non-explicit inputs [24]. As Jameson [24] argues, using non-explicit inputs has theadvantage of being less obtrusive and requiring less effort from the user. Sensingdevices, e.g., eye trackers, can provide such non-explicit inputs, though sophisti-5cated data processing is required to extract meaningful features. Feature extractionfrom low-level sensor data is typically done with machine learning and patternrecognition techniques [31].2.3 Gaze in User Characteristics ClassificationPrevious work has showed the impact of several user characteristics on their gazebehaviours. Toker et al. [46] and Toker and Conati [44] found that, during informa-tion visualization processing, user’s perceptual speed, visual WM, and verbal WMcan influence several parameters of their gaze, such as fixation duration, number offixations within a specific area of interest, and transitions between areas of inter-est. For example, users with weaker verbal working memory had relatively morefixations on the textual elements of the visualization interfaces.Based on these findings on the relationship between cognitive abilities andgaze, two previous works have attempted to reveal these individual differencesthrough user’s gaze behaviours during their interaction with the visualizations. Ste-ichen et al. [41] and Gingerich and Conati [17] developed machine learning clas-sifiers that were able to predict several cognitive abilities based on gaze features.They achieved 59–64% accuracies at predicting the binary levels of perceptualspeed, visual WM, and verbal WM, using gaze measures as the features. Our goalis to improve the accuracies in these classification tasks, using alternative classifieralgorithms and additional features; we compare against their results in details inlater chapters.2.4 Pupillometry and Posture in User ModellingRecently, there has been increasing interest in using pupillometry for user model-ing to track user states. Iqbal et al. [23] studied the effects of task type and taskswitching on the fluctuation of user’s mental workload, which was measured bytheir pupil size. Prendinger et al. [36] attempted to use pupil size as a predictor ofuser’s preferences for visually presented objects, but pupil features were ineffec-tive in their task. In the study by Martínez-Gómez and Aizawa [32], pupillometryprovides some of the most discriminative features in recognizing language skillsand level of understanding in reading tasks. Lallé et al. [30] used pupil size among6other features in predicting user’s learning curve and achieved encouraging results.The distance between user’s head to the screen has been used in user modelingas a proxy measure for posture. Jaques [25] found that head distance to screenis effective in predicting boredom during interaction with an intelligent tutoringsystem, consistently with results that others achieved using more advanced posture-tracking device, such as the Tekscan R© Body Pressure Measurement System usedin the experiment by D’Mello et al. [12].Both pupillometry and posture have been linked to cognitive load in psychol-ogy research. Pupil size was found to increase as cognitive load increases [21, 26].As for posture, some found that an increase in cognitive load results in a decreasein postural stability [33, 37], while others observed the opposite [11, 38]. As likelyindicators of changes in cognitive load, pupillometry and posture are of interest tous because of the direct connection between cognitive load and working memorycapacity.2.5 Gaze Sequence MiningIn addition to the summative gaze measures, previous work has also demonstratedthe potential of leveraging the sequential nature of gaze movement in acquiringuser characteristics. Steichen et al. [43] discovered gaze movement patterns thatoccur at different frequencies depending on user’s cognitive abilities. They appliedthe differential sequence mining technique, a method originally proposed by Kin-nebrew and Biswas [27], on sequences of fixation locations, recorded while theuser is processing an information visualization. For each of perceptual speed, vi-sual WM, and verbal WM, between three to seven short patterns (of length threeto seven fixations) were found to occur at statistically significantly different fre-quencies for users with different levels of cognitive abilities. Part of our work isbuilt directly on this result, thus we will discuss about this work in more details inChapter 5.Other work on gaze movement sequence involves comparing the similarity offull sequences. For example, the ScanMatch method, developed by Cristino et al.[10], is based on the Needleman-Wunsch algorithm for comparing DNA sequencesin the bioinformatics context. The method produces a score that describes the7similarity between two full sequences, by considering their spatial and temporalproperties, and the score can then be used for clustering of the sequences. TheeyePatterns tool was developed by West et al. [49] for comparing and visualizingthe similarity of sequences, while also providing basic support for finding patternsin subsequences.8Chapter 3Eye-Tracking DataThis chapter introduces the two eye-tracking datasets that we use in the user char-acteristic classification experiments.3.1 DatasetsFor experimenting with classifying user’s cognitive abilities, we used two eye-tracking datasets from two previous user studies: the BAR+RADAR study [45] andthe INTERVENTION study [5]. Both studies were controlled experiments, duringwhich participants were asked to solve various visualization tasks, while their gazewas tracked by a Tobii T120 eye tracker embedded in the computer monitor. Thetasks in each study were designed based on a set of primitive analysis task types,identified as largely capturing common activities during information visualizationprocessing, e.g., retrieving the value of a specific data point, finding the data pointwith an extreme value in the dataset [1]. Areas of interest (AOIs) are defined onthe visualization interfaces: five AOIs for BAR+RADAR (Figure 3.1), and six forINTERVENTION (Figure 3.2). The fixation heatmap in Figure 3.3 gives an exampleof the distribution of participants’ fixations on a task. Table 3.1 summarizes theparameters of the datasets, including number of participants, number of trials, thevisualizations and task types deployed in the study.In each study, participants were tested for the following cognitive abilities,using standard tests available from the literature:9Figure 3.1: The bar chart (top) and the radar chart (bottom) used in theBAR+RADAR study, with five AOIs in each. Figures are from Steichenet al. [42].10Figure 3.2: The bar chart used in the INTERVENTION study, with six AOIs.Figure 3.3: Heatmap of the fixations on the bar chart in the INTERVENTIONstudy, aggregated by all users on one task. Created by SEQIT [50].11Table 3.1: Summary of dataset parameters.Bar+Radar InterventionVisualizations Bar chart and radar charts. Bar chart with four types ofassistive interventions.Task Types Retrieve value, filter, com-pute derived value, find ex-tremum, sort.Retrieve value, compute de-rived value.Participants 35 62Trials per user 28 80AOIs 5 61. perceptual speed, a measure of speed when performing simple perceptualtasks [14];2. visual working memory, a measure of temporary storage and manipulationcapacity of visual and spatial information [16];3. verbal working memory, a measure of temporary storage and manipulationcapacity of verbal information [47].3.2 Summative FeaturesIn our work, we use three main groups of summative eye-tracking features: Gaze,Pupil, and Head Distance. They are described in details below and summarized inTable 3.2.3.2.1 Gaze FeaturesAn eye tracker captures user’s gaze points, i.e., the locations on the screen wherethe user is looking, at a constant frequency (120 Hz in our datasets). Nearby gazesamples are clustered into fixations, i.e., the predicted focus of attention points; twoconsecutive fixations are connected by a saccade, which is a rapid eye movement.We generated gaze features based on the ones used in the two previous exper-iments [17, 42]. From the raw eye-tracking data, we extracted fixations using the12Table 3.2: List of summative features and groups.Group FeaturesGaze (AOI-independent) Fixation rate, Time spent. Sum, mean, and standarddeviation of: fixation duration, saccade length, ab-solute and relative saccade angle.Gaze (AOI-specific) Within each AOI: fixation rate, longest fixation, timeto first fixation, time to last fixation, total and pro-portional Time spent, total and proportional timespent, total and proportional number of transitionsto every other AOI.Pupil Min, max, mean, standard deviation of pupil size,starting pupil size, end pupil size.Head Distance Min, max, mean, standard deviation of head dis-tance, starting head distance, end head distance.Tobii I-VT Fixation Filter [34], which produces outputs of higher quality than theolder Tobii fixation filter used in previous experiments.Note: One distinct difference of the fixation filters is that the I-VT Fixationfilter sets the minimum fixation duration to 60 ms by default, consistent with cog-nition and vision research that suggests a minimum threshold for the duration ofinformation processing during a fixation [29], and previous eye-tracking studieshave been setting the minimum fixation duration to at least 60 ms [28, 40]. With-out this threshold, a fixation can be as short as 8 ms in our datasets.Then, we processed the data into a set of basic measures calculated using theopen-source Eye Movement Data Analysis Toolkit (EMDAT). After removing tri-als with low-quality samples (e.g., prolonged gaps caused by fixations outside thescreen), EMDAT generated gaze features by calculating various summary statistics(e.g., sum, mean, and standard deviation) over the following measures in a trial.With fixations, EMDAT calculates these statistics:• Fixation rate: number of fixations per milliseconds.• Number of fixations: total number of fixations within a time interval, e.g., atrial in the user study.13• Fixation duration: sum of all fixation durations, mean and standard deviationof individual fixation durations.With saccade, EMDAT calculates the sum, mean, and standard deviation ofthese measures:• Saccade length: the distance between the two fixations that the saccade con-nects (the d in Figure 3.4).• Absolute saccade angle: the angle between the saccade and the horizontalaxis (the x in Figure 3.4).• Relative saccade angle: the angle between two consecutive saccades (the yin Figure 3.4).yxd 2341Figure 3.4: Illustration of saccade measures. Blue circles represent fixations,ordered by the number on the circle. d: saccade length; x: absolutesaccade angle; y: relative saccade angle.These statistics can be computed for gaze points over the entire interface, gen-erating the gaze features labelled as AOI-independent in the first row of Table 3.2,or they can be computed over specific AOIs in the interface, generating the AOI-specific gaze features listed in the second row of Table 3.2 and described below.AOI-related FeaturesArea of Interest (AOI) is used in interface analyses to provide meaningful contex-tual information on user’s attention. For each AOI, EMDAT calculates the followingmeasures:14• Fixation rate: number of fixations in the AOI per millisecond spent in theAOI.• Longest fixation: the longest duration of any fixation in the AOI.• Time to first/last fixation: the time at which the first/last fixation in the AOIoccurs.• Number of fixations: total number of fixations in the AOI and its proportionto all fixations.• Time spent: total time spent in the AOI and its proportion to the total time ofthe trial.• Transition to every other AOI: number of saccades from this AOI to everyother AOI and its proportional to all saccades originated from this AOI.3.2.2 Pupil FeaturesFor every recorded sample, the eye tracker captures the diameter of user’s pupil.Pupil size is very sensitive to environmental lighting changes, thus both of the stud-ies that generated our datasets were conducted in windowless rooms with uniformlighting to control for this factor.Using EMDAT, we computed a set of summary statistics on user pupil size, suit-able for describing fluctuations of this measure over the course of each task [30].These include the minimum, the maximum, the mean, and the standard deviation ofuser’s pupil sizes during each trial. We also included the measure of a user’s pupilat the beginning and the end of each trial (start and end pupil size in Table 3.2),because of the phenomenon of changing mental workload at task boundaries [23].Differently from many other works, we used the raw pupil size instead of thepercentage change in pupil size (PCPS), which is adjusted from a baseline pupilsize under resting conditions. The two measures describe different aspects of pupilbehaviour – raw pupil size captures individual differences in this measure, whereasPCPS isolates the change in pupil size – both have the potential to explain certainuser characteristics. We would have considered both measures if not because restpupil size was not measured in the BAR+RADAR study.153.2.3 Head Distance FeaturesFor every recorded sample, the eye tracker also captures the distance from user’shead to the screen, or more precisely, from the user’s eyes to the camera on theeye tracker attached to the screen. The participants in both studies were in a sit-ting position in front of the computer, so this distance measure represents user’sposture in a single dimension: when the user is leaning forward, the distance isshorter than if the user is sitting back. We averaged the distances from the two eyesand used EMDAT to compute the same set of statistics as for pupil size, includingthe minimum, the maximum, the mean, and the standard deviation of user’s headdistances during each trial, as well as the first and the last distance measurementsthat describe the user’s posture at the beginning and the end of each trial.3.3 AOI SequencesTo capture the sequential nature of gaze movement, we created AOI sequences thatdescribe the spatial paths of user’s gaze while processing the interface. Each ele-ment in a sequence corresponds to a fixation, which appear in the sequence as theAOI in which the fixation is located. For example, if the user first reads the question,then looks at the legend, and finally moves to the chart, the AOI sequence mightlook like “Question-Question-Question-Legend-Legend-Chart-Chart-Chart”. Therepetition of an AOI reveals how many consecutive fixations the user made withinthat region. To ensure that all fixations on the screen appear in the sequence, in-cluding those outside of the pre-defined AOIs, we modified the AOI definitionswhen creating the sequence (Section 5.1).16Chapter 4Revisiting Summative FeaturesIn this chapter, we experiment with classifying several user characteristics usingthe summative eye-tracking features introduced in Chapter 3. Previous works onclassifying user characteristics from gaze data have achieved accuracies signifi-cantly above the baselines [17, 41]. We extend the previous work by investigatingnew summative features and an alternative classifier in an effort to improve theclassification accuracy.We begin this chapter by introducing the classification models we build, includ-ing the classifier algorithms and different feature sets. Then we report the statisticalanalyses conducted to evaluate their performance.4.1 Classification TargetsThe classification targets are the binary labels of each user characteristic (percep-tual speed, visual WM, verbal WM). As in previous work, the binary labels weregenerated by grouping participants in each dataset into the “High” and the “Low”groups of each user characteristics, via a median split on the respective test scores.4.2 Classification ModelsTo perform these supervised machine learning classifications, we built a number ofmodels with various classifier algorithms and summative eye-tracking feature sets.Three machine learning classifier algorithms were chosen for the experiment.171. LOGISTIC REGRESSION is the classifier with the best performance in theprevious studies that classify user characteristics with gaze data [17, 41].2. RANDOM FOREST is an effective and widely-used classifier, and it per-formed the best in our initial test with the pupil size and head distance fea-tures. The random forest we built contains 100 trees, which is a value wefound that balances performance and model training time in our inital test.3. MAJORITY CLASS is included to provide a baseline for evaluating the othertwo classifiers. For each of the binary classification labels, it consistentlypredicts the more frequent class in the training set, without using the knowl-edge of any eye-tracking measures.Using the three groups of summative features in Chapter 3, we constructed sixfeature sets.1. GAZE feature set contains features in the gaze group, including both the AOI-independent features and the AOI-specific variants.2. PUPIL feature set contains features in the pupil group.3. HEAD DISTANCE feature set contains features in the head distance group.4. P+HD feature set combines the PUPIL and the HEAD DISTANCE feature sets,which have six features in each. We want to compare the performance ofthese twelve features against that of the 100+ features in GAZE feature set.5. G+HD feature set combines the GAZE and the HEAD DISTANCE feature sets.We selected this combination by excluding features in the pupil group, whichare susceptible to lighting and other conditions, thus might not be alwaysreadily available.6. G+P+HD feature set combines all of the summative features: GAZE, PUPIL,and HEAD DISTANCE feature sets.The combination of GAZE and PUPIL feature sets is omitted from the analysis,because we did not find a compelling practical reason for it as we did for the othercombinations.184.3 Model Evaluation DesignWe designed a formal evaluation to study the impact of classifier and feature seton the accuracy of classifying user characteristic labels using the BAR+RADARand the INTERVENTION datasets. For each user characteristic measure in the twodatasets, we ran a two-way repeated measure analysis of variance (ANOVA), withclassifier and feature set as the factors and classification performance as the de-pendent measure. The three classifiers are MAJORITY CLASS, LOGISTIC REGRES-SION, RANDOM FOREST, and the six feature sets are GAZE (G), PUPIL (P), HEADDISTANCE (HD), P+HD, G+P+HD. We used the mean accuracy of 10 runs of 10-fold cross-validation as the metric for classification performance; this process al-lows both evening out the effect of random partitioning and also measuring thevariability of classifier’s performance.It should be noted that, for model accuracy, we used the accuracy of the classi-fication at the end of each trial, i.e., after the classifier has seen all of the eye-trackerdata generated by a user during that trial. The two previous classification experi-ments also reported the over-time accuracies of predicting with partially observedtrials and obtained “peak accuracy” before the end of the trial [17, 41]. We chosenot to compare the over-time accuracies to simplify the analyses on classifier andfeature set, given that, based on previous results, we expected a model with a higherend-of-trial accuracy to likely have higher over-time accuracies too [41]. However,in the results, we present the over-time accuracy results for the best performingclassifier and feature set for reference.Following the two previous classification experiments in the comparison, weconducted the train-test split in cross-validation by randomly assigning the trials,without grouping trials based on users. We justified this decision by consideringthe relatively small number of participants in each dataset as a subset of the largetraining data expected in practical implementation, in which there are more userswith similar characteristics and behaviours.4.4 Comparisons of Classifiers and Feature SetsFor both datasets and the three user characteristics of perceptual speed, visual WM,and verbal WM, the main effects of classifier, feature set, and the interaction effects19Table 4.1: Main effects and the interaction effects in the six ANOVAs, one foreach dataset and user characteristic combination. PS is short for percep-tual speed.Dataset Characteristic Effect F-Ratio η2G p-valueBar+Radar PS Classifier F2,18 = 14429 .995 p < .001Feature Set F5,45 = 664 .951 p < .001C × FS F10,90 = 673 .975 p < .001Verbal Classifier F2,18 = 13051 .994 p < .001Feature Set F5,45 = 495 .939 p < .001C × FS F10,90 = 484 .965 p < .001Visual Classifier F2,18 = 14664 .994 p < .001Feature Set F5,45 = 737 .967 p < .001C × FS F10,90 = 566 .968 p < .001Intervention PS Classifier F2,18 = 163593 .999 p < .001Feature Set F5,45 = 3731 .991 p < .001C × FS F10,90 = 1544 .991 p < .001Verbal Classifier F2,18 = 47100 .998 p < .001Feature Set F5,45 = 2410 .988 p < .001C × FS F10,90 = 2332 .992 p < .001Visual Classifier F2,18 = 55898 .999 p < .001Feature Set F5,45 = 2214 .998 p < .001C × FS F10,90 = 1972 .991 p < .001are all statistically significant, with large effect sizes1 (Table 4.1).The result shows that the choice of classifier and feature set has significanteffect on the accuracy of classifying these user characteristics. To examine the sig-nificant interaction effects between classifier and feature set that qualify the maineffects, we look at the specific pairwise comparisons with Tukey’s HSD tests inthe following sections. Tukey’s HSD test is a multiple comparison procedure thatperforms t-tests and controls the familywise error rate.1The effect size measure used is the generalized eta squared (η2G), which is suitable for repeatedmeasure analyses [3]. The effect size is considered to be large when the value is above .5, medium ifbetween .3 and .5, and small if between .1 and .3.20Bar+Radar Intervention0.00.10.20.30.40.50.60.70.80.91.00.00.10.20.30.40.50.60.70.80.91.00.00.10.20.30.40.50.60.70.80.91.0Perceptual SpeedVerbal WMVisual WMG P HD P+HD G+HD G+P+HDG P HD P+HD G+HD G+P+HDG P HD P+HD G+HD G+P+HDG P HD P+HD G+HD G+P+HDG P HD P+HD G+HD G+P+HDG P HD P+HD G+HD G+P+HDFeature SetClassifier Majority Class Logistic Regression Random Forest Figure 4.1: Accuracies of classifying the three user characteristics in the twodatasets using the three classifiers and six feature sets. Error bars are95% confidence intervals.214.4.1 Comparison of Classifier PerformanceTukey’s HSD pairwise comparisons for classifier within each feature set show thatboth LOGISTIC REGRESSION and RANDOM FOREST perform significantly betterthan the MAJORITY-CLASS baseline. In the case of LOGISTIC REGRESSION withGAZE feature set, this result is consistent with the findings from the precedingstudies. More importantly, the accuracies of RANDOM FOREST are statisticallysignificantly higher than those of LOGISTIC REGRESSION, with the only exceptionbeing predicting verbal WM with the GAZE feature set in the BAR+RADAR study.This result can be visualized in Figure 4.1, where the bars representing RANDOMFOREST are higher, with non-overlapping error bars, than the bar to their left repre-senting LOGISTIC REGRESSION, except in the leftmost cluster of bars (representingGAZE) of the center-left bar graph.4.4.2 Comparison of Feature Set PerformanceWe focus on looking at how different features sets perform with RANDOM FOREST,the winning classifier in the previous section. The performance of RANDOM FOR-EST classifier benefits from having the PUPIL and HEAD DISTANCE features. Thisis clearly shown in Table 4.2 and Figure 4.2, the former of which summarizes theoutcome of the relevant pairwise comparisons for each classification label in eachdataset. As the table shows, GAZE is always the worse performing dataset, with theTable 4.2: Ranking of feature set by accuracy for random forest classifier.Feature sets with differences in accuracy not significant at α = 0.05 areunderlined. The best performing feature set without pupil features are inbold.Dataset User Characteristic Ranking of Feature Sets by AccuracyBar+Radar PS P+HD > G+P+HD > G+HD > HD > P > GVerbal WM P+HD > G+P+HD > HD > G+HD > G > PVisual WM P+HD > G+P+HD > HD > P > G+HD > GIntervention PS P+HD > G+P+HD > G+HD > HD > P > GVisual WM P+HD > G+P+HD > G+HD > G > P > HDVerbal WM P+HD > G+P+HD > G+HD > P > HD > G22PS Verbal WM Visual WM PS Verbal WM Visual WM 0.00.10.20.30.40.50.60.70.80.91.0G P HDP+HDG+HDG+P+HD G P HDP+HDG+HDG+P+HD G P HDP+HDG+HDG+P+HD G P HDP+HDG+HDG+P+HD G P HDP+HDG+HDG+P+HD G P HDP+HDG+HDG+P+HDFeature SetAccuracyBar+Radar InterventionFigure 4.2: Accuracies of classifying the three user characteristics in the twodatasets using the six feature sets with the random forest classifier. Errorbars are 95% confidence intervals.only exception being for predicting verbal WM where PUPIL or HEAD DISTANCEalone does not beat GAZE.The P+HD feature set, i.e., the combination of pupil and head distance features,is the best performing feature set for RANDOM FOREST in all six ANOVAs, p <.001. Adding the GAZE to P+HD, however, reduces the accuracy, likely due to theoverfitting from having relatively large number of features in the G+P+HD featureset, given the limited number of data instances in each of the datasets.Next, we compare feature sets without pupil size features, which can be diffi-cult to measure reliably in practice due the sensitivity of pupil size to changes inlighting and other environment conditions. For RANDOM FOREST, adding HEADDISTANCE to GAZE improves the accuracies in all six classification tasks (i.e.,G+HD > G in Table 4.2). In the BAR+RADAR dataset, HEAD DISTANCE alonecan achieve accuracy higher than that of G+HD, possibly because the latter mighthave overfitted the data in the BAR+RADAR dataset, which is smaller than the IN-23TERVENTION dataset. In any case, in each dataset, some of the new feature setswithout pupil features that we explored have accuracies in the 70–75% range, and14–21% above the baseline, which is already a substantial improvement over theaccuracies between 59% and 65%, achieved by using gaze features only in previousstudies.4.4.3 Performance of the Best Classification ModelThe random forest classifier with the combination of pupil and head distance fea-tures produces the best accuracies of classifying all three user characteristics inboth datasets. We further evaluated its performance in terms how classification ac-curacy changes with the amount of data available for a trial (over-time accuracy).l lllll llllll lllll l0.700.750.800.8510% 20% 30% 40% 50% 60% 70% 80% 90% 100%Observed DataAccuracyUser Characteristic l Perceptual Speed Verbal WM Visual WMDataset Bar+Radar InterventionFigure 4.3: Trends in classification accuracy for random forest with P+HD, asa function of amount of observed data in a trial. Vertical axis does notstart from zero to show the trend in details.24Table 4.3: Comparison of peak accuracies between the previous experimentsand the current results. MC: majority-class baseline; Imp.: improvementabove the baseline.Previous CurrentDataset Characteristic MC Peak Imp. MC Peak Imp.Bar+Radar PS 50% 60% 10% 50% 82% 32%Verbal 52% 59% 7% 51% 80% 29%Visual 55% 64% 9% 51% 85% 34%Intervention PS 51% 65% 14% 55% 81% 26%Verbal 60% 64% 4% 59% 83% 24%Visual 54% 64% 10% 54% 83% 29%Over-time accuracy essentially simulates the behaviour of the model within a user-adaptive system that continuously observes its user, for opportunities of helpfuland personalized INTERVENTIONs. As shown in Figure 4.3, all six classificationtasks reach 71% accuracy after seeing 10% of trial data, and the accuracies risesto at least 79% at the half-way mark of the trial, indicating the feasibility of earlyprediction of user characteristics. The trends in over-time accuracies show that ourbest classifier performs better as it observes more trial data, and peak accuraciesoccur at the end of the trial. This was not the case in previous studies. Theseresults suggest that, while salient gaze patterns that can aid classification mighthappen early on during task processing, a more complete representation of a user’sbehavioural signatures for pupil size and head distance over the whole trial makebetter predictors of user cognitive abilities.The current results are strong improvements over the previous attempts of pre-dicting perceptual speed, visual WM, and verbal WM. As shown in Table 4.3, theprevious peak accuracies were between 59% and 65%, with 4–13% improvementsabove the baseline; our peak accuracies are in the 80–85% range, and 24–32%above the baseline.Class Accuracy ComparisonA good classifier should predict each class equally well, so we compare the best-performing random forest with the P+HD feature set and the previously used lo-25PS Verbal WM Visual WM PS Verbal WM Visual WM 0.00.10.20.30.40.50.60.70.80.91.0AccuracyClassification Model Logistic Regression with Gaze Random Forest with P+HDBar+Radar InterventionFigure 4.4: Class accuracies of logistic regression with GAZE feature set andrandom forest with P+HD feature set. The top and the bottom ends ofeach bar represent the recall scores of the two classes, and the middleline represents the overall classification accuracy.gistic regression with the GAZE feature set in term of their per-class classificationperformance, to see if our best model made any sacrifices while achieving the highaccuracy. We measured the class accuracies (equivalent to the recall score) for pre-dicting each class. Results in Figure 4.4 show that for all six classification tasks,our best model not only leads in accuracies, but generally also gives comparable oreven smaller variations in per-class accuracies.26Table 4.4: Mean Gini importance scores, in percentage, of the features in theP+HD feature set used by the random forest classifier, averaged acrossthe six classification tasks.Feature ImportanceMean pupil size 10.10Start head distance 9.03Maximum pupil size 8.96Maximum head distance 8.67End head distance 8.19Mean head distance 8.17Minimum head distance 8.04Start pupil size 7.95End pupil size 7.93Minimum pupil size 7.73Standard deviation of distance 7.64Standard deviation of pupil size 7.58Most Predictive FeaturesThe twelve features in our best model contribute roughly equally to the classifier’sperformance. The Gini importance scores of the features for the random forestclassifier are fairly even (Table 4.4), and neither the pupil size nor the head distanceto screen feature set is more dominant than the other (sum of importance scores ofeach set is approximately 50%). Given that each feature set do not perform as wellalone, this shows that pupil features and head distance features are complementaryto each other.4.4.4 Comparison with Previous Work on Gaze FeaturesTo make conclusions about improvements over previous works, where logistic re-gression with the gaze features was found to perform significantly better than thebaseline, we look at our results specifically for the same model. Tukey’s HSDpairwise comparisons show that, within the GAZE feature group, the accuracy ofLOGISTIC REGRESSION is significantly higher than the MAJORITY-CLASS base-line, p < .001, for all three user characteristics in the two datasets, consistent with27Table 4.5: Comparisons of classifications with gaze features between previ-ous experiments and current implementation. Imp.: improvement abovethe baseline. Previous results for BAR+RADAR were by Steichen et al.[42]; for INTERVENTION by Gingerich and Conati [17].Previous CurrentDataset Characteristic MC LR Imp. MC LR Imp.Bar+Radar PS 50% 55% 5% 51% 63% 12%Verbal 52% 59% 7% 52% 59% 7%Visual 55% 57% 2% 52% 54% 2%Intervention PS 51% 61% 10% 50% 66% 16%Verbal 60% 62% 2% 59% 66% 7%Visual 54% 62% 8% 54% 60% 6%the results in the two previous studies [17, 41].Upon a closer look at the classification results (Table 4.5), we find that our im-plementation achieved accuracies comparable or higher than previous experiments.Due to the lack of details of the previous implementations, we can only suspect thatthe differences come from the different implementations of the classifier algorithm:the two previous experiments were implemented in Java with the Weka machinelearning library [19]; we wrote our scripts in Python with the scikit-learn machinelearning library [35]. The differences are acceptable for our evaluations, as whencomparing with LOGISTIC REGRESSION in our experiment, RANDOM FOREST hasa higher target to match, so we can safely make conclusions about improvementsover previous works if random forest outperforms our current implementation oflogistic regression.4.5 Summary and DiscussionIn this chapter, we investigated various summative eye-tracking features and ma-chine learning classifiers and showed that a random forest classifier leveragingpupil and head distance to screen features can significantly improve classificationaccuracy from previous work. This classifier achieved above-80% accuracies, of-fering greater practical significance for real-time classification than the previously28obtained sub-65% accuracies.A limitation of this result is that pupillometry features can be difficult to use inreal-world settings, because pupil size is sensitive to environmental changes (e.g.,changes in ambient lighting or in the amount of light emitted from the screen).However, we see our result as a proof of concept that adds to the mounting evi-dence on the potential value of pupillometry for user modelling and user-adaptiveinterface systems, calling for more research on how to make models based on pupilsize features resilient to environmental changes.In the meanwhile, our results also showed an improvement over previous workwithout using pupil features. Random forest classifier with head distance to screenand gaze features achieved accuracies at or above 70%, showing the potential ofposture-related information for this modelling task.29Chapter 5Exploring SequencesWe hope to improve the accuracies of classifying user characteristics by exploringthe sequential nature of user eye gaze patterns. Previous work using differentialsequence mining technique found several sequential gaze behaviours that are in-dicative of specific user characteristics [43]. In this chapter, we build classifiers,using patterns in gaze movement as features, to predict user characteristics.5.1 AOI ModificationsWe modified the AOI definitions in the two datasets when we constructed the se-quences. Given the five AOIs defined previously, there were fixations that did notfall in any of the AOIs, so they were absent from the sequence. We added theNone AOI to incorporate these fixations for them to appear in the sequences. Byinspecting None-AOI fixations that are ineffective in processing information, wemight uncover different attention patterns between users with different cognitiveabilities. Also, as the Low AOI corresponds to the area in the visualization chartswithout meaningful information, we merged it with the None AOI to simplify thesequence. Figure 5.1 and Figure 5.2 give examples of the AOI modifications for thebar chart and for the radar chart, respectively.30Figure 5.1: AOI modification for the INTERVENTION dataset. Previous AOIdefinition is on the left; modified AOI definition is on the right.Figure 5.2: AOI modification for the radar chart in the BAR+RADAR dataset.Previous AOI definition is on the left; modified AOI definition is on theright. Colour-coding of the AOI follows that of Figure 5.1.315.2 Sequence Patterns as FeaturesWe describe user’s gaze movement on the interface in the form of the sequence ofthe AOIs they fixate on during a task, thus there is a one-to-one mapping betweenthe fixations and the elements in the sequence. We define a sequence pattern as asubstring of the full sequence. Based on the findings from our previous work insequence pattern [43], we use pattern frequencies as machine learning features forclassifying user characteristics.5.2.1 Pattern Frequency and Cognitive AbilitiesOur previous work1 compared the frequency of sequence patterns between userswith different levels of cognitive abilities [43]. We analyzed the BAR+RADARdataset and identified a number of patterns that occur significantly more or lessfrequently depending on the user’s cognitive abilities. For example, we found thatusers with low perceptual speed had significantly more “High-Low-Label” pat-terns, which are transitions between the High AOI and the Label AOI with an inter-mediate fixation in the Low AOI, which is situated between the High and the LabelAOIs. As another example, the pattern “Text-Text-Text-Text-Text”, i.e., highly re-peating fixations in the Text AOI, occurs significantly more frequently for userswith low verbal WM, suggesting that they need to spend more time processing thetextual information.The lengths of the patterns used in this prior work are from three to seven.Patterns longer than seven elements appear very infrequently given the trial lengthin the dataset; patterns of length two are considered in the summative features astransitions between two AOIs, and here, we want develop a new approach to under-stand the value of more complex patterns and to see if longer sequences provideadditional predictive power. Therefore, we continue using patterns of length threeto seven in this study.1As the second author of the paper, I performed the statistical tests and analyses on the differencesin pattern frequency between users with different user characteristics.325.2.2 Pattern Selection CriteriaThe number of possible unique patterns is dependent on the number of AOIs andpattern length:number of unique patterns =7∑l=3la,where l is the pattern length, and a is the AOI count. When there are five AOIs (asin the BAR+RADAR dataset), there are 28,975 possible unique patterns, and for sixAOIs (as in the INTERVENTION dataset), the number is 184,755. Given that bothour datasets have fewer than 5,000 trials, using all of the patterns as features wouldcertainly cause overfitting, so feature selection is necessary.Instead of resorting to generic feature selection algorithms, we narrow downthe list of patterns based on how frequently they occur and the difference in thefrequency between groups with different user characteristics, e.g., high vs low per-ceptual speed. We use two frequency measures, Sequence Support (SS) and Av-erage Pattern Frequency (APF). SS is the proportion of sequences that contain thepattern; APF is the average number of occurrences of the pattern. For each relevantuser group, g, e.g., users with high perceptual speed, SS and APF of pattern p arecalculated as follows:SS(p,g) =number of sequences in g that contains pnumber of sequences in g, andAPF(p,g) =total number of occurrences of p in sequences of gnumber of sequences in g.In our dataset, SS describes how common a pattern is among a group of users,and APF indicates the repetitiveness of the patterns. Table 5.1 provides an example.For the pattern “A-B”, it appears in one of the two sequences in group 1 and in bothsequences in group 2, so group 2 has a higher SS value for “A-B”. For APF, becausethe pattern “A-B” appears three times in group 1 and twice in group 2, so group 1has a higher APF value for “A-B”.Both measures have been used in previous sequential pattern mining applica-tions, such as for action sequences [27] and for gaze sequences [43]. Based on SSand APF, we propose two selection criteria for filtering the pattern list.33Table 5.1: Examples of SS and APF calculations.Group 1 Group 2Sequence 1 A-B-A-B-A-B A-BSequence 2 A-C A-BSS(“A-B”) 1/2 = 0.5 2/2 = 1.0APF(“A-B”) 3/2 = 1.5 2/2 = 1.01. S1: filter by minimum SS threshold, i.e., patterns that appear in enough se-quences are selected as features. Kinnebrew and Biswas [27], who orig-inally proposed this differential sequence mining technique, used 50% asthe SS threshold for their action sequence dataset. Differently from us, theybuilt patterns that could skip over one element in the sequence (known as a“gap”), creating more patterns with a given sequence. We continue with ourprevious work on gaze sequences [43] in using 40% as the SS threshold, tocompensate for not allowing for “gaps” in the sequence patterns. Therefore,by S1, patterns that appear in at least 40% of the sequences of either usergroup are selected.2. S2: filter by statistically significant difference in SS or APF. We perform sta-tistical tests to evaluate the statistical significance of the difference in SS andAPF measures. The statistical test for SS is the Pearson’s χ2 test, which iscommonly used to test for independence of an event occurring in two popu-lations. In this test, we compare the number of sequences with and without aspecific pattern in two groups, which is essentially what SS measures, to seeif the occurrences of the pattern in the two groups are correlated. Table 5.2shows an example of the contingency table used for the χ2 test. For APF,we use the Welch’s t-test, which tests if two populations have significantlydifferent means. In this test, the two samples are the pattern frequencies persequence in the two groups (Table 5.3), thus the means of the samples arethe APF measures. Patterns with frequency difference considered to be sta-tistically significant by either test are selected as features. The α level forsignificance is 0.05, and we do not correct for multiple comparison because34Table 5.2: Contingency table used in the χ2 test for SS. Example values aregiven based on the example in Table 5.1 for pattern “A-B”.Group 1 Group 2Number of sequences with pattern p 1 2Number of sequences without pattern p 1 0Table 5.3: Examples of the samples used in the t-test for APF; values arebased on the example in Table 5.1 for pattern “A-B”. For example, inGroup 1, “A-B” occurs 3 times in Sequence 1 and does not occur in Se-quence 2, thus the sample for Group 1 is (3, 0).t-test sampleGroup 1 (3, 0)Group 2 (1, 1)we are not drawing any conclusions on the differences, but merely using itas a tool to select prominent features.5.3 Classification ModelsFor evaluating the sequence pattern features in classifying the binary labels of usercharacteristics, we select logistic regression and random forest as the classifiers forthe formal analysis, because in our initial test, they are the two best performingclassifiers among other algorithms available in the scikit-learn machine learninglibrary [35], including SVM, Naive Bayes, ADABoost etc. For baseline accuracy,we use a majority-class classifier, which consistently predicts the more frequentclass for each user characteristic.The features used in the classifications are the frequency values of the se-lected sequence patterns. When using cross-validation to evaluate the classifica-tion models, we perform feature selection within the training set in each fold ofcross-validation: the patterns are selected based on the pattern selection criteria (asdescribed in Section 5.2.2), which we formally evaluate in the following section.35Table 5.4: Sequence pattern counts under the feature selection criteria. Stan-dard deviations are in parenthesis.Number of patternsDataset Characteristic Total S2 S1 S1 + S2Bar+Radar PS 14279 (67) 444 (21) 61 (1.0) 41 (2.8)Verbal WM 14275 (121) 368 (18) 54 (1.3) 19 (1.3)Visual WM 14726 (103) 313 (23) 55 (0.8) 17 (3.4)Intervention PS 44271 (67) 1811 (41) 39 (0.5) 30 (0.8)Verbal WM 44269 (79) 1616 (31) 38 (1.1) 31 (1.1)Visual WM 44264 (140) 1391 (47) 35 (0.8) 14 (1.7)5.4 EvaluationsWe present two evaluations on the sequence patterns. First, we examine the ef-fect of the pattern selection criteria on reducing the size of feature set and theperformance of classifiers using the selected features. Then, we compare the bestclassification results with sequence pattern features against those with summativeeye-tracking features.5.4.1 Evaluation on Pattern Selection CriteriaThe two pattern selection criteria, S1 and S2, introduced in Section 5.2.2 are de-signed to reduce the number of sequence pattern features to an appropriate amountgiven the datasets. Although there is no specific guideline as to what the optimalnumber of features is, Harrell [20] proposed a rule of thumb that the number offeatures is best to be smaller than min(n1,n2)/10 or min(n1,n2)/20, where n1 andn2 are the class sizes of the binary labels. With this range as a reference, we wouldlike to see if the criteria can reduce the number of features so that they would noteasily overfit the training data nor underfit with too few features.To estimate how many patterns are selected by each criterion, we ran a 10-fold cross-validation for each user characteristic in each dataset. For each fold,we performed pattern selection using S1, S2, and S1+S2 on the training set. InTable 5.4, we report the mean and standard deviation of the number of patternsselected in the 10 folds.36To get a sense of the patterns that are being selected, we apply S1 and S1+S2to the entire dataset and present the patterns selected by S1 or S1+S2 with their SSand APF values in Table A.1.Given the class sizes in each dataset, the number of features suggested by Har-rell [20] are 43 for BAR+RADAR and 150 for INTERVENTION. Selection criterionS2, which selects patterns with significantly different occurrence frequency be-tween two groups of a user characteristics, does not reduce the size of the featureset enough and risks overfitting given the size of the datasets. Selection criterionS1, which selects patterns that appears in 40% of the sequences, filters the featureset down to 54–61 features for the BAR+RADAR dataset, which is slightly higherthan the suggested 43 features. For the INTERVENTION dataset, S1 brings the num-ber of features down to 35–39, below the suggested number of 150. Combining S1and S2 narrows the list down even further, to fewer than the suggested 43 featuresfor BAR+RADAR, though the size of the resulting feature set varies between differ-ent user characteristics. For example, in both datasets, there are fewer features forvisual WM (17 and 14) than for perceptual speed (41 and 30). The patterns selectedby S1+S2 are likely to become important features in classification, but S1+S2 mayleave too few features for the INTERVENTION dataset. It is unclear which of S1and S1+S2 would perform better in classification tasks, so we test the performanceof both sets with logistic regression and random forest classifiers.Classification Performance of Pattern Selection CriteriaFor each user characteristic measure in the two datasets, a two-way repeated mea-sure ANOVA was used to evaluate the performance of the pattern selection criteria.The two factors are classifier (LOGISTIC REGRESSION, RANDOM FOREST) andfeature set (S1, S1+S2). As the dependent measure, we use the performance ofthe classification models measured by 10 runs of 10-fold cross-validation, similarto the method used in Chapter 4. As mentioned above, feature selection is donein each fold of the cross-validation, which is different from Chapter 4 where wedeemed feature selection to be unnecessary because of the smaller number of fea-tures in the previous models. Performing pattern selection in the training set ofeach fold is to prevent each test set from influencing the pattern selection results37Table 5.5: Main effects and the interaction effects in the six ANOVAs.Dataset Characteristic Effect F-ratio η2G p-valueBar+Radar PS Classifier F1,9 = 49.1 0.539 p < .001Selection F1,9 = 13.4 0.307 p = .005C×S F1,9 = 0.16 0.002 p = .699Verbal WM Classifier F1,9 = 194 0.722 p < .001Selection F1,9 = 8.85 0.25 p = .016C×S F1,9 = 0.89 0.024 p = .369Visual WM Classifier F1,9 = 24.4 0.37 p < .001Selection F1,9 = 0.76 0.028 p = .406C×S F1,9 = 1.77 0.051 p = .216Intervention PS Classifier F1,9 = 7.91 0.196 p = .020Selection F1,9 = 1.57 0.035 p = .242C×S F1,9 = 0.40 0.009 p = .542Verbal WM Classifier F1,9 = 35.6 0.332 p < .001Selection F1,9 = 11.9 0.356 p = .007C×S F1,9 = 5.86 0.168 p = .038Visual WM Classifier F1,9 = 1.73 0.025 p = .221Selection F1,9 = 24.6 0.35 p < .001C×S F1,9 = 0.02 0.001 p = .902and contaminating the training process.In summary of the results of the six ANOVAs, as presented in Table 5.5, themain effects of classifier are statistically significant in five of the six ANOVAs, withmedium to large effect sizes (η2G > .3) in four of the ANOVAs. The main effectof pattern selection criterion is statistically significant in four of the six ANOVAswith all have small to medium effect sizes. The interaction effect between classifierand pattern selection criterion is statistically significant in only one ANOVA, with asmall effect size.Given the relatively small effect sizes for the selection criterion factor, applyingS2 in addition to S1 does not lead to practically significant change in the classifica-tion accuracies (as shown in Figure 5.2). We argue that, with similar performanceand much fewer features, S1+S2 is the preferred pattern selection criteria as it leadsto a simpler classification model.38PS Verbal WM Visual WM PS Verbal WM Visual WM 0.00.10.20.30.40.50.60.70.80.91.0LR RF LR RF LR RF LR RF LR RF LR RFClassifierAccuracyFeature Selection S1 S1+S2Bar+Radar InterventionFigure 5.3: Comparison of pattern selection criteria between S1 and S1+S2in the six ANOVAs for the three user characteristics in the two datasets.Error bars are 95% confidence intervals. LR: logistic regression; RF:random forest.39Table 5.6: Pairwise comparisons of random forest (RF) and logistic regres-sion (LR), using the S1+S2 feature selection criterion.RF.S1+S2 – LR.S1+S2Dataset User Characteristic Difference Std. Error p-valueBar+Radar Perceptual Speed 0.52% 0.0039 p = .552Verbal WM −2.97% 0.0040 p < .001Visual WM 0.66% 0.0053 p = .426Intervention Perceptual Speed −0.43% 0.0017 p = .046Verbal WM −0.82% 0.0017 p < .001Visual WM 0.13% 0.0023 p = .939Classifier Performance EvaluationGiven our choice of S1+S2 as the feature selection criterion, we conduct posthoc pairwise comparisons between LOGISTIC REGRESSION and RANDOM FOR-EST classifiers. Results in Table 5.6 and Figure 5.4 show that LOGISTIC REGRES-SION performed significantly better than RANDOM FOREST classifier for sequencepattern features in three of the six ANOVAs. In the other three ANOVAs, the per-formances of LOGISTIC REGRESSION and RANDOM FOREST do not differ signifi-cantly. Therefore, we conclude that LOGISTIC REGRESSION is the better perform-ing classifier for sequence features and becomes the classifier of our choice forsequence features with the S1+S2 feature selection criterion.5.4.2 Evaluation on Sequence Feature PerformanceWe compare the sequence pattern features against a majority-class baseline and thesummative feature sets introduced in Chapter 4.For each user characteristic measure in the two datasets, we conducted a one-way repeated measure ANOVA to evaluate the performance of the classificationmodels. The levels of the factor are five feature sets with a baseline.1. MAJORITY CLASS, for comparing with the baseline that do not use any ofthe eye-tracking features.2. SEQUENCE, the feature set that contains the sequence pattern features, se-lected by S1+S2;40PS Verbal WM Visual WM PS Verbal WM Visual WM 0.500.550.600.65LR RF LR RF LR RF LR RF LR RF LR RFClassifierAccuracyBar+Radar InterventionFigure 5.4: Comparison of classifiers with the patterns selected by S1+S2 inthe six ANOVAs for the three user characteristics in the two datasets.Error bars are 95% confidence intervals.3. GAZE and G+S, for comparing with the feature set used in previous workand the effect of adding the sequence pattern features to gaze features;4. P+HD and P+HD+S, for comparing with the best performing feature set inthe analysis of summative features in Chapter 4 and the effect of adding thesequence pattern features.For each feature set, the classifier we chose is the better-performing classifierbetween logistic regression and random forest in our initial test. Except for SE-QUENCE, which works better with the logistic regression classifier, the other fourfeature sets all fare better with the random forest classifier. Same as previous eval-uations, we use the mean accuracy of 10 runs of 10-fold cross-validation as themetric for classification performance.Results in Table 5.7 show statistically significant main effects of feature setin all six ANOVAs with large effect sizes. This result is expected as we knowthat using the summative features (i.e., GAZE, P+HD) can beat the MAJORITY-41Table 5.7: Main effect of feature set in the six ANOVAs, one for each datasetand characteristic combination.Dataset Characteristic Effect F-Ratio η2GBar+Radar Perceptual Speed Feature Set F5,45 = 2124 0.996Verbal WM Feature Set F5,45 = 2000 0.995Visual WM Feature Set F5,45 = 3644 0.997Intervention Perceptual Speed Feature Set F5,45 = 12639 0.999Verbal WM Feature Set F5,45 = 11499 0.999Visual WM Feature Set F5,45 = 19095 0.999Table 5.8: Comparisons of the sequence feature set against the baseline.Dataset User Characteristic Difference Std. Error p-valueSequence – MCBar+Radar Perceptual Speed 7.40% 0.0035 p < .001Verbal WM 6.40% 0.0035 p < .001Visual WM −0.10% 0.0034 p = .999Intervention Perceptual Speed 6.70% 0.0013 p < .001Verbal WM 3.30% 0.0013 p < .001Visual WM 4.00% 0.0013 p < .001CLASS baseline. For further comparisons, we conducted pairwise comparisonswith Tukey’s HSD tests.The SEQUENCE feature set beats the MAJORITY-CLASS baseline in five of thesix ANOVAs, as shown in Table 5.8 and illustrated by the blue bars in Figure 5.5.This means that the sequence pattern frequencies are predictive of the user char-acteristics. However, the accuracy achieved by the sequence feature set is signif-icantly lower than the accuracies by the other feature sets that include summativefeatures only (Table 5.9). Adding sequence features to GAZE does not lead to statis-tically significant changes in the accuracy (Table 5.10 and green bars in Figure 5.5).Adding sequence features to P+HD lowers the accuracy significantly in four of thesix ANOVAs, and leads to no significant change in the other two (Table 5.10 and redbars in Figure 5.5). Table 5.11 summarizes the performance ranking of the featuresets.42Table 5.9: Comparisons of the summative feature sets with the sequence fea-ture set.Dataset User Characteristic Difference Std. Error p-valueGaze – SequenceBar+Radar Perceptual Speed 3.40% 0.0035 p < .001Verbal WM 2.10% 0.0035 p < .001Visual WM 6.70% 0.0034 p < .001Intervention Perceptual Speed 6.90% 0.0013 p < .001Verbal WM 8.30% 0.0013 p < .001Visual WM 7.30% 0.0013 p < .001P+HD – SequenceBar+Radar Perceptual Speed 23.80% 0.0035 p < .001Verbal WM 22.20% 0.0035 p < .001Visual WM 33.00% 0.0034 p < .001Intervention Perceptual Speed 19.90% 0.0013 p < .001Verbal WM 20.80% 0.0013 p < .001Visual WM 24.80% 0.0013 p < .001Table 5.10: Comparisons of adding sequence features to summative featuresets with summative feature sets alone.Dataset User Characteristic Difference Std. Error p-valueG+S – GazeBar+Radar Perceptual Speed 0.90% 0.0035 p = .095Verbal WM 0.10% 0.0035 p = .999Visual WM -0.40% 0.0034 p = .836Intervention Perceptual Speed 0.10% 0.0013 p = .974Verbal WM 0.30% 0.0013 p = .264Visual WM 0.30% 0.0013 p = .235P+HD+S – P+HDBar+Radar Perceptual Speed -2.90% 0.0035 p < .001Verbal WM -4.30% 0.0035 p < .001Visual WM -1.40% 0.0034 p < .001Intervention Perceptual Speed 0.05% 0.0013 p = .999Verbal WM -1.30% 0.0013 p < .001Visual WM 0.05% 0.0013 p = .99943PS Verbal WM Visual WM PS Verbal WM Visual WM 0.00.10.20.30.40.50.60.70.80.91.0MC S GG+SP+HDP+HD+S MC S GG+SP+HDP+HD+S MC S GG+SP+HDP+HD+S MC S GG+SP+HDP+HD+S MC S GG+SP+HDP+HD+S MC S GG+SP+HDP+HD+SFeature SetAccuracyFeature Set MC S G G+S P+HD P+HD+SBar+Radar InterventionFigure 5.5: Comparison of classification performance between sequence pat-tern features and summative features in the six ANOVAs for the threeuser characteristics in the two datasets. Error bars are 95% confidenceintervals.44Table 5.11: Rankings of feature set by classification accuracy for the randomforest classifier. Feature sets with difference in accuracy not significantat α = .05 are underlined.Dataset User Characteristic Ranking of Feature Set by AccuracyBar+Radar Perceptual Speed P+HD > P+HD+S > G+S > G > S > MCVerbal WM P+HD > P+HD+S > G+S > G > S > MCVisual WM P+HD > P+HD+S > G+S > G > S > MCIntervention Perceptual Speed P+HD > P+HD+S > G+S > G > S > MCVerbal WM P+HD > P+HD+S > G+S > G > S > MCVisual WM P+HD > P+HD+S > G+S > G > S > MC5.5 Variations of Sequence FeatureWe attempted with creating a number of variations to the sequence pattern featuresin hope to improve the classification performance. None of them saw significantimprovement from the sequence pattern features, so we did not perform formalanalyses. We document these attempts here to aid future work.5.5.1 Collapsing SequencesCollapsing sequences is a technique used in the previous study [43]. The full se-quences, also called “expanded” sequences, are prone to noise: two patterns witha single difference are considered distinct, e.g., “A-B-B-B-A” and “A-B-B-B-B-A”(with an extra ‘B’) are two independent patterns. Thus, we created the “collapsed”sequences by merging each group of consecutive fixations within the same AOIinto an AOI visit. For example, both patterns mentioned earlier would be collapsedinto “A-B-A”, with three AOI visits. Then, the same pattern selection criteria wereapplied to the “collapsed” sequences, and frequency of each selected pattern wasextracted and used as features in user characteristic classifications.5.5.2 Extracting Long AOI VisitsAlthough the collapsing technique simplifies the sequences, collapsed sequencesstill contain every single AOI visit, regardless of the duration of the AOI visit. Wehypothesized that there might be differences in the overall task-solving strategy that45users take, so we extracted the ten longest AOI visits, i.e., where the user spends themost of their time and attention, in each sequence, and formed a new sequence oflength ten. The new sequence is still ordered temporally, not ranked by duration.Ten longest AOI visits were selected because the median length of the trials is 21AOI visits, so we removed (on average) half of the AOI visits in a trial.5.5.3 Annotating Sequence with DurationIn addition to fixation location, another important factor that differentiate one fix-ation from another is fixation duration, which is absent from the AOI sequences.Without such information, a fixation that processes information for one second andanother that lasts 100 ms appear the same to the classification algorithm. For amore sophisticated classification, we annotated the sequences with fixation dura-tions and applied the sequential pattern mining algorithm developed by Fournier-Viger et al. [15]. The algorithm performs pattern clustering based on not only theitems in the pattern but also the duration associated with each item. For example,the sequence pattern A-A-A (i.e., three consecutive fixation in “A”) can appear inthree distinct duration-annotated pattern clusters:• A (60–450) - A (60–475) - A,• A (60–450) - A (500–3000) - A,• A (500–3000) - A - A.The ranges in the parentheses are the ranges of the fixation durations in millisec-onds. The three pattern clusters can be described in terms of duration as short-short-any, short-long-any, and long-any-any, respectively. Frequencies of thesepatterns were calculated and used as features in user characteristic classifications.5.5.4 Language ModelWe experimented with using natural language processing techniques in classifyinguser characteristics using sequences. We treated each element of the sequence as aword and each sequence as a sentence. All of the sequences belonging to one usergroup (e.g., high perceptual speed) form a corpus. Given the two corpora for each46user characteristic, we built several n-gram models, with n ranging from 2 to 6, us-ing NLTK: The Natural Language Toolkit [4]. We used two smoothing techniquesto handle unseen n-grams: Good-Turing and Lidstone’s smoothing, implementedin NLTK. Bi-gram and tri-gram are the better performing n-gram models, but theclassification accuracies are inferior to those of the logistic regression with patternfrequency features.5.6 Summary and DiscussionWe explored using patterns in user gaze sequences in classifying user characteris-tics. While the results show that sequence pattern features are predictive in theseclassifications (by beating the majority-class baseline), these features are not aspowerful as the summative eye-tracking features discussed in Chapter 4.Several reasons might have contributed to the poor performance of sequencefeatures. Given that, although statistically significant, the differences in patternfrequencies between user groups have small effect sizes2 (maximum φ of .168for SS, maximum Cohen’s d of .399 for APF), a classifier would need to considera set of such patterns together to make an well-informed prediction. Given thelimited length of individual trials in the dataset, the number of predictive patternsin each trial might be insufficient. Second, having more training data could helpbuild better classifiers. The variability in gaze sequences is high even for eachuser, so having observed more user behaviours could potentially lead to a moreknowledgeable classifier and allow for leveraging more powerful classifiers, e.g.,neural networks, that need large training data. More training data would also allowus to increase the quality threshold when removing low-quality gaze sequencesin the dataset, reducing the noise in the data. Lastly, the quality of the sequencealso depends on the accuracy of the eye tracker in determining the exact fixationlocation. Given the current accuracy level of the eye tracker and the closely locatedinterface elements in the two studies, misclassified sequence elements are to beexpected. As eye tracking becomes more precise, we might be able to achievebetter classification performance leveraging gaze sequence information.2Interpretation for φ : small: .1, medium: .3, high: .5. Interpretation for Cohen’s d [7]: small: .2,medium: .5, high: .8.47Chapter 6Conclusion and Future WorkThe long-term goal of this research is to create user-adaptive visualizations thataid information processing by personalizing the interaction intelligently to meeta user’s specific abilities and needs. In this thesis, we focused on the problemof classifying a user’s perceptual speed, visual WM, and verbal WM in real timebecause of their known impact on the performance and preference of visualizationprocessing. We built on previous work and improved the classification accuraciesby using additional features of pupil size and head distance to screen. We alsoexplored the sequential nature of user’s gaze movement and analyzed patterns inthe gaze sequences as features in the user cognitive ability classification tasks.6.1 Research QuestionsThe series of classification experiments we conducted helped us answer our twomain research questions.Q1: Can we improve the accuracy of classifying user’s cognitive abilities thatprevious work achieved using summative gaze features, by using additional mea-sures of user’s pupil size and head distance to screen also collected by the eyetracker?Our investigation of various eye-tracking features and machine learning classi-fiers showed that a random forest classifier with pupil and head distance to screenfeatures can significantly improve the accuracy from previous work on classifying48user’s cognitive abilities. This classifier achieved above-80% accuracies, offeringgreater practical significance for real-time classification than the previously ob-tained sub-65% accuracies. The model also performed well when given only par-tial observations of the trials, reaching above-70% accuracies in the first 10% of thetrial, and the accuracies continuously improved as more of the trials are observed.Q2: Can sequential patterns of user’s gaze movement be used to further im-prove the classification of user’s cognitive abilities?We used patterns extracted from the gaze movement sequences as features toclassify user cognitive abilities. These features achieved accuracies significantlyabove the majority-class baseline but inferior to those achieved by the summativefeatures. Adding sequence pattern features to summative features did not improvethe accuracies achieved by summative features alone. Therefore, we were unableto further improve the classification of user’s cognitive abilities using sequentialpatterns of user’s gaze movement.6.2 Future WorkWe identify a number of potential directions for future work.6.2.1 Replication on Additional DatasetsWe replicated the classification tasks on two datasets, providing initial evidence ofthe generality of the results. The classification models can be applied to additionaldatasets. For example, a study on the ValueChart visualization also recorded eye-tracking data and tested for user’s cognitive abilities [9].6.2.2 Using Additional Gaze FeaturesThe latest eye-tracking data processing software (e.g., Tobii Studio) is able to pro-vide more advanced saccade measures. In our study, the path of a saccade is sim-plified to the straight line from one fixation to the next, whereas in reality, the pathof a saccade can be in various shapes, which the newer software can detect. Thus,more fine-grained saccade measures can be used as summative features, such asgaze acceleration, saccade distance (as opposed to saccade displacement).496.2.3 Re-examining Binary SplitIn previous work and this work, each user cognitive ability was split into twoclasses based on the median of a scoring metric. This was done mainly for bal-ancing the sample size of the two labels during machine learning classification.We discuss two shortcomings in this method and our proposed solutions.1. Users with cognitive ability scores close to the median are labelled as oneof the two classes, even though their actual abilities are similar. To preventusers with close-to-average cognitive abilities from diluting the representa-tive features in the more extreme cases, one can perform model training onlyon users with scores more than one standard deviation from the mean.2. A different approach is to apply unsupervised learning to cluster users basedon their eye-tracking features, then examine the common characteristics amongthe users in each cluster. For example, if one cluster might contain usersmostly with low test scores for perceptual speed and low-to-medium scoresfor visual working memory, then the appropriate interface adaptation can beprovided to them based on this characterization.6.2.4 More Sophisticated Sequence ProcessingThe patterns we extracted from gaze movement sequences are simple substrings.We experimented with a number of alternatives in Section 5.5, and there are othersthat could be tried.1. For every x seconds (e.g., x = 1, 3, or 5), take the AOI in which the user spendthe majority of the time during that x-second period and create a sequencewith those AOIs.2. In Section 5.5.3, we annotated the sequences with fixation duration. Alter-natively, other attributes can be annotated to the sequences as well, such aspupil size and head distance to the screen.Our exploration of gaze movement sequences has been preliminary. More ad-vanced sequence mining methods in other fields (e.g., bioinformatics) often require50more sophisticated algorithms and computations. Concepts from existing tech-niques in other fields can be borrowed and adapted to work with gaze movementsequences.6.2.5 System Prototype ImplementationThe promising results in this paper pave the way for developing user-adaptive vi-sualizations. Given how perceptual speed, visual WM, and verbal WM impact per-formance of processing visualizations and how users with lower levels of thesecognitive abilities could be helped by adaptive interventions [44, 46], the next stepis to devise prototypes that integrate classifiers for detecting visualization perfor-mance [17, 42] and our models for predicting user cognitive abilities in real time todeliver adaptive interventions. When their performance with a given visualizationis detected to need help, the system applies interventions that target the specificneeds of users, based on their relevant cognitive abilities, to assist them and pro-vide a better experience.51Bibliography[1] R. Amar, J. Eagan, and J. Stasko. Low-level components of analytic activityin information visualization. In IEEE Symposium on InformationVisualization, pages 111–117. IEEE, 2005. → pages 9[2] N. M. Aykin and T. Aykin. Individual differences in human-computerinteraction. Computers & Industrial Engineering, 20(3):373–379, 1991. →pages 4, 5[3] R. Bakeman. Recommended effect size statistics for repeated measuresdesigns. Behavior Research Methods, 37(3):379–384, 2005. → pages 20[4] S. Bird. NLTK: the natural language toolkit. In Proceedings of theCOLING/ACL on Interactive Presentation Sessions, pages 69–72.Association for Computational Linguistics, 2006. → pages 47[5] G. Carenini, C. Conati, E. Hoque, B. Steichen, D. Toker, and J. Enns.Highlighting interventions and user differences: Informing adaptiveinformation visualization support. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, pages 1835–1844.ACM, 2014. → pages 5, 9[6] C. Chen. Individual differences in a spatial-semantic virtual environment.Journal of the American Society for Information Science, 51(6):529–542,2000. → pages 4[7] J. Cohen. Statistical power analysis for the behavioral sciences. Academicpress, 2013. → pages 47[8] C. Conati and H. Maclaren. Exploring the role of individual differences ininformation visualization. In Proceedings of the Working Conference onAdvanced Visual Interfaces, pages 199–206. ACM, 2008. → pages 552[9] C. Conati, G. Carenini, E. Hoque, B. Steichen, and D. Toker. Evaluating theimpact of user characteristics and different layouts on an interactivevisualization for decision making. In Computer Graphics Forum,volume 33, pages 371–380. Wiley Online Library, 2014. → pages 5, 49[10] F. Cristino, S. Mathôt, J. Theeuwes, and I. D. Gilchrist. ScanMatch: A novelmethod for comparing fixation sequences. Behavior Research Methods, 42(3):692–700, 2010. → pages 7[11] M. C. Dault, J. S. Frank, and F. Allard. Influence of a visuo-spatial, verbaland central executive working memory task on postural control. Gait &Posture, 14(2):110–116, 2001. → pages 7[12] S. D’Mello, R. W. Picard, and A. Graesser. Toward an affect-sensitiveautotutor. IEEE Intelligent Systems, (4):53–61, 2007. → pages 7[13] D. E. Egan. Individual differences in human-computer interaction. InM. Helander, editor, Handbook of Human-Computer Interaction, pages543–568. North-Holland, Amsterdam, 1988. → pages 4[14] R. B. Ekstrom, J. W. French, H. H. Harman, and D. Dermen. Manual for kitof factor referenced cognitive tests. Educational Testing Service Princeton,NJ, 1976. → pages 12[15] P. Fournier-Viger, R. Nkambou, and E. M. Nguifo. A knowledge discoveryframework for learning task models from user interactions in intelligenttutoring systems. In MICAI 2008: Advances in Artificial Intelligence, pages765–778. Springer, 2008. → pages 46[16] K. Fukuda and E. K. Vogel. Human variation in overriding attentionalcapture. The Journal of Neuroscience, 29(27):8726–8733, 2009. → pages 12[17] M. J. Gingerich and C. Conati. Constructing models of user and taskcharacteristics from eye gaze data for user-adaptive informationhighlighting. In Twenty-Ninth AAAI Conference on Artificial Intelligence,2015. → pages 2, 6, 12, 17, 18, 19, 28, 51, 69[18] S. Greenberg and I. H. Witten. Adaptive personalized interfaces – a questionof viability. Behaviour & Information Technology, 4(1):31–45, 1985. →pages 5[19] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten. The WEKA data mining software: an update. ACM SIGKDDExplorations Newsletter, 11(1):10–18, 2009. → pages 2853[20] F. E. Harrell. Regression modeling strategies: with applications to linearmodels, logistic regression, and survival analysis. Springer Science &Business Media, 2013. → pages 36, 37[21] J. Hyönä, J. Tommola, and A.-M. Alaja. Pupil dilation as a measure ofprocessing load in simultaneous interpretation and other language tasks. TheQuarterly Journal of Experimental Psychology, 48(3):598–612, 1995. →pages 7[22] P. Innocent. Towards self-adaptive interface systems. International Journalof Man-Machine Studies, 16(3):287–299, 1982. → pages 5[23] S. T. Iqbal, P. D. Adamczyk, X. S. Zheng, and B. P. Bailey. Towards anindex of opportunity: understanding changes in mental workload during taskexecution. In Proceedings of the SIGCHI Conference on Human Factors inComputing Systems, pages 311–320. ACM, 2005. → pages 6, 15[24] A. Jameson. Adaptive interfaces and agents. Human-Computer Interaction:Design Issues, Solutions, and Applications, 105, 2009. → pages 5[25] N. Jaques. Predicting affect in an intelligent tutoring system. 2014. → pages7[26] D. Kahneman and J. Beatty. Pupil diameter and load on memory. Science,154(3756):1583–1585, 1966. → pages 7[27] J. Kinnebrew and G. Biswas. Identifying learning behaviors bycontextualizing differential sequence mining with action features andperformance evolution. In Educational Data Mining, 2012. → pages 7, 33,34[28] O. V. Komogortsev, D. V. Gobert, S. Jayarathna, D. H. Koh, and S. M.Gowda. Standardization of automated analyses of oculomotor fixation andsaccadic behaviors. IEEE Transactions on Biomedical Engineering, 57(11):2635–2645, 2010. → pages 13[29] E. Kowler. Eye movements and their role in visual and cognitive processes.Number 4. Elsevier Science Limited, 1990. → pages 13[30] S. Lallé, D. Toker, C. Conati, and G. Carenini. Prediction of users’ learningcurves for adaptation while using an information visualization. InProceedings of the 20th International Conference on Intelligent UserInterfaces, pages 357–368. ACM, 2015. → pages 6, 1554[31] P. Langley. User modeling in adaptive interface. Springer, 1999. → pages 6[32] P. Martínez-Gómez and A. Aizawa. Recognition of understanding level andlanguage skill using measurements of reading behavior. In Proceedings ofthe 19th International Conference on Intelligent User Interfaces, pages95–104. ACM, 2014. → pages 6[33] E. A. Maylor, S. Allison, and A. M. Wing. Effects of spatial and nonspatialcognitive activity on postural stability. British Journal of Psychology, 92(2):319–338, 2001. → pages 7[34] A. Olsen. The Tobii I-VT fixation filter. Tobii Technology, 2012. → pages 13[35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. → pages 28, 35[36] H. Prendinger, A. Hyrskykari, M. Nakayama, H. Istance, N. Bee, andY. Takahasi. Attentive interfaces for users with disabilities: eye gaze forintention and uncertainty estimation. Universal Access in the InformationSociety, 8(4):339–354, 2009. → pages 6[37] J. K. Rankin, M. H. Woollacott, A. Shumway-Cook, and L. A. Brown.Cognitive influence on postural stability a neuromuscular analysis in youngand older adults. The Journals of Gerontology Series A: Biological Sciencesand Medical Sciences, 55(3):M112–M119, 2000. → pages 7[38] M. A. Riley, A. A. Baker, and J. M. Schmit. Inverse relation betweenpostural variability and difficulty of a concurrent short-term memory task.Brain Research Bulletin, 62(3):191–195, 2003. → pages 7[39] J. B. Rotter. Generalized expectancies for internal versus external control ofreinforcement. Psychological Monographs: General and Applied, 80(1):1,1966. → pages 69[40] J. Salojärvi, K. Puolamäki, J. Simola, L. Kovanen, I. Kojo, and S. Kaski.Inferring relevance from eye movements: Feature extraction. In Proceedingsof the NIPS 2005 Workshop on Machine Learning for Implicit Feedback andUser Modeling, page 45, 2005. → pages 1355[41] B. Steichen, G. Carenini, and C. Conati. User-adaptive informationvisualization: Using eye gaze data to infer visualization tasks and usercognitive abilities. In Proceedings of the 18th International Conference onIntelligent User Interfaces, pages 317–328. ACM, 2013. → pages 2, 6, 17,18, 19, 28[42] B. Steichen, C. Conati, and G. Carenini. Inferring visualization taskproperties, user performance, and user cognitive abilities from eye gaze data.ACM Transactions on Interactive Intelligent Systems (TiiS), 4(2):11, 2014.→ pages 10, 12, 28, 51[43] B. Steichen, M. M. A. Wu, D. Toker, C. Conati, and G. Carenini. Te, te, hi,hi: Eye gaze sequence analysis for informing user-adaptive informationvisualizations. In User Modeling, Adaptation, and Personalization, pages183–194. Springer, 2014. → pages 7, 30, 32, 33, 34, 45[44] D. Toker and C. Conati. Eye tracking to understand user differences invisualization processing with highlighting interventions. In User Modeling,Adaptation, and Personalization, pages 219–230. Springer, 2014. → pages6, 51[45] D. Toker, C. Conati, G. Carenini, and M. Haraty. Towards adaptiveinformation visualization: on the influence of user characteristics. In UserModeling, Adaptation, and Personalization, pages 274–285. Springer, 2012.→ pages 5, 9[46] D. Toker, C. Conati, B. Steichen, and G. Carenini. Individual usercharacteristics and information visualization: connecting the dots througheye tracking. In proceedings of the SIGCHI Conference on Human Factorsin Computing Systems, pages 295–304. ACM, 2013. → pages 5, 6, 51[47] M. L. Turner and R. W. Engle. Is working memory capacity task dependent?Journal of memory and language, 28(2):127–154, 1989. → pages 12[48] M. C. Velez, D. Silver, and M. Tremaine. Understanding visualizationthrough spatial ability differences. In IEEE Visualization, pages 511–518.IEEE, 2005. → pages 5[49] J. M. West, A. R. Haake, E. P. Rozanski, and K. S. Karn. eyePatterns:software for identifying patterns and similarities across fixation sequences.In Proceedings of the Symposium on Eye Tracking Research & Applications,pages 149–154. ACM, 2006. → pages 856[50] M. M. A. Wu and T. Munzner. SEQIT: Visualizing sequences of interest ineye tracking data. To Appear in Proceedings of the IEEE Conference onInformation Visualization, 2015. → pages 11[51] C. Ziemkiewicz, R. J. Crouser, A. R. Yauilla, S. L. Su, W. Ribarsky, andR. Chang. How locus of control influences compatibility with visualizationstyle. In IEEE Conference on Visual Analytics Science and Technology(VAST), pages 81–90. IEEE, 2011. → pages 557Appendix ASupporting MaterialsTable A.1: List of sequence patterns that pass the S1 pattern selection crite-rion. The table contains the SS and APF values for each user group, thep-value of the statistical comparison, and the effect sizes of the differ-ences (φ for the χ2 test for SS, Cohen’s d for the t-test for APF). Patternsthat are selected by S1+S2 is labelled as “TRUE” in the last column. The“Question” AOI in the INTERVENTION dataset is abbreviated as “Q”.SS APFPattern Low High p φ Low High p d Passes S2Bar+Radar, Perceptual SpeedHigh-High-High 90% 82% .001 .109 11.67 8.83 .000 .254 TRUEHigh-High-High-High 81% 69% .000 .131 8.96 6.60 .001 .237 TRUEHigh-High-High-High-High 72% 57% .000 .149 7.06 5.05 .001 .223 TRUEHigh-High-High-High-High-High 61% 49% .000 .119 5.68 3.96 .002 .209 TRUEHigh-High-High-High-High-High-High 51% 40% .002 .107 4.65 3.14 .003 .200 TRUEHigh-High-High-High-High-High-None 49% 38% .001 .110 0.72 0.57 .012 .171 TRUEHigh-High-High-High-High-None 58% 45% .000 .134 0.96 0.74 .002 .207 TRUEHigh-High-High-High-None 66% 57% .007 .091 1.30 1.08 .014 .167 TRUEHigh-High-High-High-None-None 41% 35% .086 .058 0.52 0.45 .141 .100 FALSEHigh-High-High-None 76% 70% .063 .063 1.83 1.54 .011 .173 TRUEcontinued . . .58. . . continuedSS APFPattern Low High p φ Low High p d Passes S2High-High-High-None-None 53% 48% .165 .047 0.77 0.69 .183 .090 FALSEHigh-High-Labels 50% 40% .004 .098 0.78 0.63 .019 .160 TRUEHigh-High-None 87% 82% .074 .061 2.77 2.51 .092 .114 FALSEHigh-High-None-High 47% 46% .707 .013 1.06 0.94 .299 .071 FALSEHigh-High-None-None 67% 65% .651 .015 1.14 1.13 .865 .012 FALSEHigh-High-None-None-None 48% 50% .520 .022 0.66 0.70 .436 .053 FALSEHigh-None-High 56% 56% .970 .001 1.71 1.48 .127 .104 FALSEHigh-None-High-High 49% 44% .208 .043 1.13 0.94 .084 .117 FALSEHigh-None-None 77% 77% .823 .008 1.82 2.08 .051 .132 FALSEHigh-None-None-None 61% 65% .273 .037 1.04 1.27 .013 .169 TRUEHigh-None-None-None-None 44% 54% .003 .102 0.63 0.88 .000 .247 TRUEHigh-None-None-None-None-None 33% 44% .003 .102 0.44 0.60 .004 .198 TRUELabels-High-High 42% 32% .007 .092 0.60 0.48 .039 .140 TRUELabels-None-None 53% 48% .236 .040 0.85 0.81 .566 .039 FALSENone-High-High 86% 82% .162 .047 2.91 2.67 .155 .097 FALSENone-High-High-High 76% 71% .063 .063 1.90 1.63 .023 .155 TRUENone-High-High-High-High 67% 57% .004 .098 1.31 1.14 .069 .123 TRUENone-High-High-High-High-High 57% 46% .002 .107 0.95 0.79 .030 .147 TRUENone-High-High-High-High-High-High 47% 38% .007 .092 0.70 0.58 .066 .125 TRUENone-High-High-None 41% 46% .207 .043 0.70 0.77 .358 .062 FALSENone-High-None 49% 56% .053 .066 1.11 1.29 .127 .104 FALSENone-High-None-None 32% 42% .005 .096 0.48 0.72 .001 .237 TRUENone-None-High 75% 80% .065 .063 1.97 2.29 .026 .151 TRUENone-None-High-High 63% 68% .161 .048 1.22 1.34 .191 .089 FALSENone-None-High-High-High 51% 52% .918 .003 0.81 0.79 .752 .021 FALSENone-None-High-High-High-High 42% 37% .174 .046 0.58 0.53 .343 .064 FALSENone-None-High-None 34% 40% .047 .067 0.52 0.69 .012 .170 TRUENone-None-None 92% 96% .033 .072 6.61 9.74 .000 .395 TRUEcontinued . . .59. . . continuedSS APFPattern Low High p φ Low High p d Passes S2None-None-None-High 58% 68% .003 .101 1.12 1.44 .002 .212 TRUENone-None-None-High-High 46% 52% .073 .061 0.69 0.83 .045 .136 TRUENone-None-None-None 77% 87% .000 .129 4.21 6.77 .000 .399 TRUENone-None-None-None-High 44% 56% .000 .118 0.69 1.00 .000 .265 TRUENone-None-None-None-High-High 31% 40% .005 .094 0.42 0.59 .002 .211 TRUENone-None-None-None-None 60% 73% .000 .135 2.73 4.76 .000 .387 TRUENone-None-None-None-None-High 34% 43% .005 .095 0.47 0.71 .000 .242 TRUENone-None-None-None-None-None 42% 57% .000 .148 1.75 3.33 .000 .371 TRUENone-None-None-None-None-None-None 31% 46% .000 .161 1.15 2.36 .000 .350 TRUENone-None-None-Text 37% 44% .043 .069 0.49 0.65 .004 .196 TRUENone-None-Text 64% 61% .425 .027 1.05 1.06 .951 .004 FALSENone-None-Text-Text 42% 35% .028 .075 0.56 0.45 .034 .144 TRUENone-Text-None 49% 48% .972 .001 0.72 0.78 .370 .061 FALSENone-Text-Text 69% 57% .000 .122 1.18 0.91 .000 .239 TRUENone-Text-Text-Text 54% 42% .001 .116 0.77 0.57 .001 .235 TRUENone-Text-Text-Text-Text 41% 33% .015 .082 0.52 0.39 .006 .189 TRUEText-None-None 60% 63% .342 .032 1.05 1.08 .706 .026 FALSEText-None-None-None 39% 51% .001 .116 0.52 0.73 .000 .246 TRUEText-Text-None 62% 53% .013 .085 1.06 0.78 .000 .268 TRUEText-Text-Text 68% 53% .000 .144 4.26 3.01 .000 .238 TRUEText-Text-Text-None 47% 36% .001 .111 0.64 0.44 .000 .264 TRUEText-Text-Text-Text 53% 43% .002 .105 3.17 2.22 .002 .211 TRUEText-Text-Text-Text-Text 46% 35% .003 .102 2.43 1.67 .005 .193 TRUEBar+Radar, Verbal Workig MemoryHigh-High-High 86% 86% .934 .003 10.57 9.94 .405 .057 FALSEHigh-High-High-High 76% 74% .542 .021 8.09 7.48 .371 .061 FALSEHigh-High-High-High-High 66% 63% .345 .032 6.35 5.77 .352 .063 FALSEcontinued . . .60. . . continuedSS APFPattern Low High p φ Low High p d Passes S2High-High-High-High-High-High 57% 53% .302 .035 5.12 4.53 .300 .070 FALSEHigh-High-High-High-High-High-High 47% 44% .544 .021 4.18 3.62 .278 .074 FALSEHigh-High-High-High-High-High-None 44% 43% .908 .004 0.64 0.64 .990 .001 FALSEHigh-High-High-High-High-None 52% 51% .698 .013 0.84 0.86 .797 .017 FALSEHigh-High-High-High-None 64% 59% .143 .050 1.21 1.17 .621 .034 FALSEHigh-High-High-None 74% 72% .514 .022 1.69 1.68 .936 .005 FALSEHigh-High-High-None-None 50% 51% .663 .015 0.72 0.73 .829 .015 FALSEHigh-High-Labels 44% 46% .448 .026 0.68 0.73 .439 .052 FALSEHigh-High-None 84% 85% .923 .003 2.55 2.72 .269 .075 FALSEHigh-High-None-High 45% 47% .684 .014 0.96 1.04 .436 .053 FALSEHigh-High-None-None 63% 68% .144 .049 1.09 1.18 .268 .075 FALSEHigh-High-None-None-None 48% 50% .559 .020 0.64 0.72 .203 .086 FALSEHigh-None-High 53% 59% .091 .057 1.48 1.71 .130 .103 FALSEHigh-None-High-High 45% 49% .250 .039 0.98 1.09 .312 .069 FALSEHigh-None-None 75% 79% .147 .049 1.86 2.03 .208 .085 FALSEHigh-None-None-None 61% 65% .238 .040 1.09 1.22 .175 .092 FALSEHigh-None-None-None-None 47% 52% .143 .050 0.73 0.78 .485 .047 FALSELabels-None-None 52% 48% .262 .038 0.88 0.78 .187 .090 FALSELabels-None-None-None 40% 36% .237 .040 0.59 0.51 .154 .097 FALSENone-High-High 83% 84% .776 .010 2.69 2.88 .247 .079 FALSENone-High-High-High 73% 74% .791 .009 1.76 1.77 .969 .003 FALSENone-High-High-High-High 62% 62% .999 .000 1.23 1.22 .923 .007 FALSENone-High-High-High-High-High 51% 52% .983 .001 0.84 0.89 .542 .041 FALSENone-High-High-High-High-High-High 41% 43% .656 .015 0.63 0.65 .833 .014 FALSENone-High-High-None 41% 45% .326 .033 0.66 0.81 .036 .142 TRUENone-High-None 51% 54% .371 .030 1.10 1.31 .081 .119 FALSENone-None-High 77% 78% .876 .005 2.14 2.12 .862 .012 FALSENone-None-High-High 66% 66% .974 .001 1.28 1.28 .992 .001 FALSEcontinued . . .61. . . continuedSS APFPattern Low High p φ Low High p d Passes S2None-None-High-High-High 52% 52% .985 .001 0.82 0.79 .572 .038 FALSENone-None-None 94% 94% .974 .001 9.00 7.38 .003 .201 TRUENone-None-None-High 63% 63% .878 .005 1.31 1.25 .596 .036 FALSENone-None-None-High-High 50% 49% .765 .010 0.75 0.76 .859 .012 FALSENone-None-None-None 82% 83% .908 .004 6.18 4.82 .002 .208 TRUENone-None-None-None-High 48% 52% .327 .033 0.87 0.83 .603 .035 FALSENone-None-None-None-None 67% 66% .797 .009 4.34 3.17 .001 .220 TRUENone-None-None-None-None-None 51% 48% .329 .033 3.03 2.06 .001 .224 TRUENone-None-None-None-None-None-None 42% 35% .060 .064 2.15 1.37 .001 .223 TRUENone-None-None-Text 45% 36% .008 .089 0.67 0.47 .000 .244 TRUENone-None-Text 67% 59% .024 .076 1.16 0.95 .005 .191 TRUENone-Text-None 55% 42% .000 .127 0.90 0.60 .000 .302 TRUENone-Text-Text 66% 60% .089 .058 1.10 0.99 .174 .092 FALSENone-Text-Text-Text 53% 43% .002 .104 0.72 0.62 .088 .116 TRUENone-Text-Text-Text-Text 43% 31% .001 .114 0.52 0.40 .016 .163 TRUEText-None-None 70% 54% .000 .168 1.27 0.88 .000 .338 TRUEText-None-None-None 52% 39% .000 .124 0.78 0.47 .000 .374 TRUEText-Text-None 62% 53% .012 .085 0.97 0.87 .150 .098 TRUEText-Text-Text 66% 55% .001 .112 4.48 2.81 .000 .319 TRUEText-Text-Text-None 46% 37% .007 .092 0.58 0.50 .108 .109 TRUEText-Text-Text-Text 54% 42% .001 .115 3.47 1.94 .000 .339 TRUEText-Text-Text-Text-Text 47% 34% .000 .129 2.76 1.36 .000 .357 TRUEBar+Radar, Visual Workig MemoryHigh-High-High 86% 85% .618 .017 9.69 10.80 .145 .099 FALSEHigh-High-High-High 74% 76% .625 .017 7.36 8.19 .223 .083 FALSEHigh-High-High-High-High 62% 66% .239 .040 5.77 6.34 .354 .063 FALSEHigh-High-High-High-High-High 55% 55% .954 .002 4.63 5.01 .501 .046 FALSEcontinued . . .62. . . continuedSS APFPattern Low High p φ Low High p d Passes S2High-High-High-High-High-High-High 47% 44% .458 .025 3.75 4.04 .573 .038 FALSEHigh-High-High-High-High-High-None 44% 43% .801 .009 0.63 0.65 .653 .031 FALSEHigh-High-High-High-High-None 50% 53% .405 .028 0.79 0.90 .138 .101 FALSEHigh-High-High-High-None 59% 64% .186 .045 1.11 1.27 .091 .115 FALSEHigh-High-High-High-None-None 34% 42% .036 .071 0.44 0.53 .062 .127 TRUEHigh-High-High-None 72% 74% .567 .019 1.61 1.76 .162 .095 FALSEHigh-High-High-None-None 47% 54% .039 .070 0.67 0.79 .045 .136 TRUEHigh-High-Labels 46% 44% .741 .011 0.69 0.72 .568 .039 FALSEHigh-High-None 83% 86% .161 .048 2.55 2.73 .257 .077 FALSEHigh-High-None-High 47% 46% .785 .009 0.95 1.05 .340 .065 FALSEHigh-High-None-None 63% 69% .081 .059 1.09 1.19 .221 .083 FALSEHigh-High-None-None-None 47% 51% .207 .043 0.66 0.70 .523 .043 FALSEHigh-None-High 56% 55% .933 .003 1.57 1.61 .783 .019 FALSEHigh-None-High-High 48% 45% .538 .021 0.96 1.11 .152 .097 FALSEHigh-None-None 74% 80% .076 .060 1.87 2.02 .263 .076 FALSEHigh-None-None-None 59% 67% .015 .083 1.11 1.20 .332 .066 TRUEHigh-None-None-None-None 47% 51% .290 .036 0.72 0.79 .303 .070 FALSEHigh-None-None-None-None-None 36% 41% .164 .047 0.49 0.56 .208 .086 FALSELabels-None-None 51% 50% .863 .006 0.78 0.87 .226 .082 FALSENone-High-High 83% 85% .402 .028 2.63 2.93 .074 .121 FALSENone-High-High-High 72% 74% .568 .019 1.65 1.88 .052 .132 FALSENone-High-High-High-High 60% 63% .404 .028 1.14 1.32 .057 .129 FALSENone-High-High-High-High-High 50% 53% .332 .033 0.80 0.93 .074 .121 FALSENone-High-High-High-High-High-High 42% 43% .973 .001 0.60 0.68 .238 .080 FALSENone-High-High-None 41% 46% .163 .047 0.71 0.76 .507 .045 FALSENone-High-None 50% 55% .193 .044 1.21 1.20 .959 .004 FALSENone-None-High 74% 81% .023 .077 1.99 2.27 .053 .131 TRUENone-None-High-High 64% 67% .374 .030 1.21 1.34 .143 .099 FALSEcontinued . . .63. . . continuedSS APFPattern Low High p φ Low High p d Passes S2None-None-High-High-High 51% 52% .877 .005 0.77 0.83 .342 .064 FALSENone-None-High-High-High-High 37% 41% .242 .040 0.53 0.58 .315 .068 FALSENone-None-None 92% 96% .024 .076 7.50 8.83 .014 .166 TRUENone-None-None-High 60% 66% .070 .061 1.19 1.37 .077 .120 FALSENone-None-None-High-High 48% 51% .357 .031 0.72 0.79 .280 .073 FALSENone-None-None-None 79% 86% .015 .082 5.03 5.94 .039 .140 TRUENone-None-None-None-High 48% 52% .211 .042 0.79 0.90 .164 .094 FALSENone-None-None-None-None 62% 70% .029 .074 3.45 4.03 .103 .110 TRUENone-None-None-None-None-High 35% 41% .082 .059 0.55 0.63 .227 .082 FALSENone-None-None-None-None-None 49% 50% .807 .008 2.36 2.71 .238 .080 FALSENone-None-None-Text 40% 42% .695 .013 0.52 0.61 .084 .117 FALSENone-None-Text 63% 63% .991 .000 0.97 1.14 .025 .152 TRUENone-None-Text-Text 37% 41% .238 .040 0.46 0.55 .077 .120 FALSENone-Text-None 46% 51% .205 .043 0.68 0.81 .038 .141 TRUENone-Text-Text 60% 66% .070 .061 0.94 1.15 .007 .182 TRUENone-Text-Text-Text 45% 51% .069 .062 0.62 0.71 .115 .107 FALSEText-None-None 57% 67% .003 .102 0.91 1.22 .000 .274 TRUEText-None-None-None 41% 49% .020 .079 0.53 0.72 .001 .228 TRUEText-Text-None 54% 61% .043 .069 0.78 1.05 .000 .253 TRUEText-Text-Text 58% 63% .094 .057 3.10 4.15 .003 .198 TRUEText-Text-Text-None 37% 45% .012 .085 0.46 0.61 .003 .201 TRUEText-Text-Text-Text 45% 51% .107 .055 2.24 3.13 .004 .196 TRUEText-Text-Text-Text-Text 37% 44% .047 .067 1.66 2.43 .004 .195 TRUEIntervention, Perceptual SpeedHigh-High-High 77% 73% .004 .047 14.23 11.19 .000 .218 TRUEHigh-High-High-High 63% 60% .146 .024 12.52 9.82 .000 .207 TRUEHigh-High-High-High-High 56% 54% .301 .017 11.30 8.83 .000 .202 TRUEcontinued . . .64. . . continuedSS APFPattern Low High p φ Low High p d Passes S2High-High-High-High-High-High 53% 52% .438 .013 10.30 8.03 .000 .199 TRUEHigh-High-High-High-High-High-High 52% 50% .422 .013 9.45 7.32 .000 .197 TRUEHigh-High-Input 40% 45% .010 .043 0.47 0.50 .117 .052 TRUEHigh-High-None 44% 37% .000 .064 0.79 0.56 .000 .211 TRUEHigh-High-Q 43% 40% .100 .027 0.53 0.51 .394 .028 FALSEInput-Input-Input 50% 37% .000 .128 1.27 0.78 .000 .264 TRUEInput-Q-Q 45% 45% .910 .002 0.64 0.62 .488 .023 FALSELegend-High-High 51% 52% .608 .009 0.78 0.74 .276 .037 FALSELegend-High-High-High 42% 39% .122 .026 0.59 0.51 .003 .102 TRUELegend-Legend-High 51% 53% .160 .023 0.74 0.76 .577 .019 FALSELegend-Legend-High-High 44% 45% .353 .015 0.62 0.60 .477 .024 FALSELegend-Legend-Legend 72% 74% .385 .014 3.52 3.54 .843 .007 FALSELegend-Legend-Legend-High 37% 41% .036 .035 0.46 0.49 .250 .038 TRUELegend-Legend-Legend-Legend 57% 57% .982 .000 2.21 2.20 .932 .003 FALSELegend-Legend-Legend-Legend-Legend 41% 41% .913 .002 1.37 1.37 .944 .002 FALSELegend-Legend-None 41% 34% .000 .076 0.54 0.42 .000 .177 TRUELegend-Q-Q 36% 42% .001 .055 0.50 0.58 .004 .097 TRUENone-High-High 50% 41% .000 .091 0.87 0.62 .000 .223 TRUENone-Q-Q 50% 45% .001 .053 0.72 0.58 .000 .174 TRUENone-Q-Q-Q 43% 39% .007 .045 0.56 0.46 .000 .147 TRUEQ-Input-Input 41% 30% .000 .109 0.48 0.33 .000 .256 TRUEQ-Legend-Legend 48% 54% .000 .061 0.73 0.85 .000 .130 TRUEQ-Legend-Legend-Legend 41% 45% .009 .044 0.53 0.62 .000 .121 TRUEQ-Q-Legend 47% 55% .000 .076 0.72 0.87 .000 .147 TRUEQ-Q-Legend-Legend 42% 50% .000 .080 0.60 0.75 .000 .167 TRUEQ-Q-Legend-Legend-Legend 35% 41% .000 .067 0.45 0.55 .000 .139 TRUEQ-Q-None 50% 42% .000 .079 0.73 0.56 .000 .197 TRUEQ-Q-Q 93% 95% .011 .042 12.15 11.73 .166 .047 TRUEcontinued . . .65. . . continuedSS APFPattern Low High p φ Low High p d Passes S2Q-Q-Q-Legend 39% 48% .000 .092 0.54 0.69 .000 .179 TRUEQ-Q-Q-Legend-Legend 35% 43% .000 .090 0.45 0.59 .000 .187 TRUEQ-Q-Q-None 40% 35% .003 .050 0.53 0.43 .000 .139 TRUEQ-Q-Q-Q 88% 92% .000 .063 10.04 9.66 .165 .047 TRUEQ-Q-Q-Q-Legend 32% 41% .000 .086 0.41 0.51 .000 .154 TRUEQ-Q-Q-Q-Q 83% 87% .001 .054 8.44 8.09 .167 .046 TRUEQ-Q-Q-Q-Q-Q 78% 81% .026 .037 7.16 6.84 .167 .046 TRUEQ-Q-Q-Q-Q-Q-Q 71% 74% .046 .033 6.10 5.80 .163 .047 TRUEIntervention, Verbal Working MemoryHigh-High-High 75% 75% .591 .009 13.78 11.72 .000 .145 TRUEHigh-High-High-High 63% 61% .272 .018 12.06 10.32 .000 .131 TRUEHigh-High-High-High-High 56% 55% .469 .012 10.83 9.33 .000 .121 TRUEHigh-High-High-High-High-High 53% 52% .409 .014 9.83 8.51 .001 .114 TRUEHigh-High-High-High-High-High-High 52% 51% .565 .010 8.98 7.79 .002 .109 TRUEHigh-High-Input 40% 45% .007 .045 0.46 0.50 .072 .061 TRUEHigh-High-None 42% 39% .225 .020 0.77 0.59 .000 .162 TRUEHigh-High-Q 47% 38% .000 .089 0.60 0.45 .000 .212 TRUEHigh-Q-Q 40% 31% .000 .092 0.51 0.37 .000 .216 TRUEInput-Input-Input 39% 46% .000 .065 0.88 1.08 .001 .110 TRUEInput-Q-Q 43% 47% .019 .039 0.61 0.64 .272 .037 TRUELegend-High-High 51% 52% .546 .010 0.78 0.74 .148 .050 FALSELegend-High-High-High 40% 41% .608 .009 0.57 0.53 .125 .052 FALSELegend-Legend-High 51% 53% .312 .017 0.76 0.74 .443 .026 FALSELegend-Legend-High-High 44% 45% .848 .003 0.63 0.59 .219 .042 FALSELegend-Legend-Legend 73% 73% .625 .008 3.55 3.52 .858 .006 FALSELegend-Legend-Legend-Legend 57% 57% .765 .005 2.17 2.24 .517 .022 FALSELegend-Legend-Legend-Legend-Legend 40% 41% .455 .012 1.29 1.43 .089 .057 FALSEcontinued . . .66. . . continuedSS APFPattern Low High p φ Low High p d Passes S2Legend-Legend-Q 42% 35% .000 .066 0.59 0.46 .000 .161 TRUELegend-Q-Q 43% 37% .000 .064 0.63 0.49 .000 .161 TRUENone-High-High 48% 43% .002 .052 0.85 0.64 .000 .184 TRUENone-Q-Q 52% 44% .000 .079 0.74 0.57 .000 .197 TRUENone-Q-Q-Q 45% 38% .000 .073 0.59 0.44 .000 .210 TRUEQ-Legend-Legend 52% 51% .377 .015 0.86 0.75 .002 .108 TRUEQ-Legend-Legend-Legend 44% 42% .176 .022 0.63 0.54 .002 .108 TRUEQ-Q-High 42% 37% .002 .051 0.55 0.45 .000 .143 TRUEQ-Q-Legend 52% 51% .463 .012 0.87 0.76 .002 .108 TRUEQ-Q-Legend-Legend 47% 46% .307 .017 0.74 0.64 .001 .116 TRUEQ-Q-None 50% 42% .000 .079 0.74 0.56 .000 .214 TRUEQ-Q-Q 95% 94% .788 .004 13.30 10.95 .000 .260 TRUEQ-Q-Q-Legend 45% 43% .223 .020 0.67 0.59 .006 .093 TRUEQ-Q-Q-Legend-Legend 41% 38% .142 .024 0.57 0.50 .005 .096 TRUEQ-Q-Q-None 42% 34% .000 .082 0.57 0.41 .000 .234 TRUEQ-Q-Q-Q 91% 90% .158 .023 10.99 9.02 .000 .238 TRUEQ-Q-Q-Q-Q 86% 85% .116 .026 9.21 7.57 .000 .214 TRUEQ-Q-Q-Q-Q-Q 82% 78% .011 .042 7.81 6.41 .000 .197 TRUEQ-Q-Q-Q-Q-Q-Q 74% 71% .022 .038 6.64 5.44 .000 .182 TRUEIntervention, Visual Working MemoryHigh-High-High 77% 73% .009 .043 13.05 12.15 .050 .065 TRUEHigh-High-High-High 63% 60% .137 .025 11.35 10.78 .185 .044 FALSEHigh-High-High-High-High 56% 54% .471 .012 10.14 9.78 .372 .030 FALSEHigh-High-High-High-High-High 53% 52% .688 .007 9.17 8.96 .582 .018 FALSEHigh-High-High-High-High-High-High 51% 51% .794 .004 8.34 8.23 .770 .010 FALSEHigh-High-Input 43% 42% .803 .004 0.50 0.48 .395 .028 FALSEHigh-High-None 42% 39% .020 .039 0.76 0.58 .000 .164 TRUEcontinued . . .67. . . continuedSS APFPattern Low High p φ Low High p d Passes S2High-High-Q 46% 38% .000 .087 0.60 0.44 .000 .221 TRUEHigh-Q-Q 40% 30% .000 .106 0.51 0.35 .000 .259 TRUEInput-Input-Input 45% 41% .025 .037 1.08 0.92 .009 .087 TRUEInput-Q-Q 45% 45% .793 .004 0.64 0.62 .566 .019 FALSELegend-High-High 52% 50% .223 .020 0.79 0.73 .079 .059 FALSELegend-High-High-High 40% 40% .761 .005 0.57 0.53 .110 .053 FALSELegend-Legend-High 54% 51% .067 .030 0.80 0.71 .007 .090 TRUELegend-Legend-High-High 46% 43% .022 .038 0.65 0.57 .003 .098 TRUELegend-Legend-Legend 71% 75% .006 .046 3.44 3.61 .184 .044 TRUELegend-Legend-Legend-High 41% 38% .060 .031 0.51 0.45 .022 .077 TRUELegend-Legend-Legend-Legend 56% 58% .143 .024 2.14 2.26 .240 .039 FALSELegend-Legend-Legend-Legend-Legend 39% 42% .121 .026 1.32 1.41 .267 .037 FALSENone-High-High 48% 43% .001 .056 0.83 0.64 .000 .167 TRUENone-Q-Q 46% 48% .191 .022 0.63 0.65 .562 .019 FALSENone-Q-Q-Q 40% 41% .510 .011 0.51 0.50 .605 .017 FALSEQ-Legend-Legend 50% 52% .250 .019 0.79 0.80 .776 .009 FALSEQ-Legend-Legend-Legend 42% 44% .332 .016 0.57 0.58 .793 .009 FALSEQ-Q-High 42% 36% .001 .057 0.56 0.44 .000 .169 TRUEQ-Q-Legend 50% 52% .167 .023 0.79 0.81 .546 .020 FALSEQ-Q-Legend-Legend 46% 47% .403 .014 0.68 0.68 .769 .010 FALSEQ-Q-None 47% 45% .228 .020 0.65 0.62 .213 .041 FALSEQ-Q-Q 94% 95% .879 .003 12.38 11.52 .004 .096 TRUEQ-Q-Q-Legend 43% 45% .170 .023 0.60 0.63 .289 .035 FALSEQ-Q-Q-Legend-Legend 39% 40% .325 .016 0.52 0.54 .411 .027 FALSEQ-Q-Q-Q 91% 90% .696 .006 10.21 9.50 .010 .087 TRUEQ-Q-Q-Q-Q 86% 85% .260 .019 8.56 7.98 .021 .077 TRUEQ-Q-Q-Q-Q-Q 81% 78% .048 .033 7.24 6.76 .040 .069 TRUEQ-Q-Q-Q-Q-Q-Q 74% 71% .062 .031 6.15 5.75 .069 .061 FALSE68Appendix BAdditional AnalysesGiven the promising results of the summative eye-tracking features we achievedin Chapter 4, we conduct two follow-up analyses to expand the results beyondclassifying cognitive abilities and to provide details on the performance of the bestnon-pupil feature set.B.1 Visualization Expertise and Locus of ControlIn addition to the three cognitive abilities, we applied the classification model onanother two user characteristics: visualization expertise and locus of control. Bothof them were recorded in the INTERVENTION study: visualization expertise wasmeasured as how frequently the participants used and created bar charts; locus ofcontrol – a personality trait that describes the extend to which a person believesthey are in control of events affecting them – was measured by the standard I-Escale [39]. In the BAR+RADAR study, locus of control was not recorded, and vi-sualization expertise was surveyed as expertise on bar chart and expertise on radarchart separately, so for simplicity, we focus the classification analysis on the IN-TERVENTION dataset only. The previous experiment on classifying visualizationexpertise and locus of control using the INTERVENTION dataset did not find a clas-sifier that could outperform the majority-class baseline [17].We followed the same evaluation process as in Chapter 4. First, we createdbinary classification labels for visualization expertise and locus of control using69Table B.1: Ranking of feature set by accuracy for the random forest classifier.Feature sets with differences in accuracy not significant at α = 0.05 areunderlined. The best performing feature set without pupil features are inbold.User Characteristic Ranking of Feature Sets by AccuracyVisualization Expertise P+HD > G+P+HD > G+HD > P > HD > GLocus of Control P+HD > G+P+HD > G+HD > P > HD > Gmedian split. Then, we conducted 10 runs of 10-fold cross-validation to obtain theclassification accuracies of the same three classifiers (MAJORITY CLASS, LOGIS-TIC REGRESSION, RANDOM FOREST) and six feature sets (G, P, HD, P+HD, G+HD,G+P+HD). Finally, we performed a two-way ANOVA with classifier and feature setas the factors and the accuracy of classifying the two user characteristics as thedependent measure.For both user characteristics of visualization expertise and locus of control, themain effects of classifier, feature set, and the interaction effects are all statisticallysignificant, with large effect sizes. The classification accuracy results are shown inFigure B.1, and the feature sets are ranked by accuracy in Table B.1.For visualization expertise, LOGISTIC REGRESSION performs above the base-line only when gaze features are present, i.e., in G, G+HD, and G+P+HD featuresets, whereas RANDOM FOREST outperforms LOGISTIC REGRESSION and MA-JORITY CLASS with all six feature sets. The best-performing feature set with RAN-DOM FOREST is also P+HD, i.e., the combination of pupil size and head distanceto screen features.For locus of control, both LOGISTIC REGRESSION and RANDOM FOREST beatthe MAJORITY-CLASS baseline, and RANDOM FOREST outperforms LOGISTIC RE-GRESSION with all six feature sets. The best-performing feature set with RANDOMFOREST is P+HD, same as for visualization expertise and the three cognitive abili-ties in Chapter 4.70Expertise Locus of Control0.00.10.20.30.40.50.60.70.80.91.0InterventionG P HD P+HD G+HD G+P+HD G P HD P+HD G+HD G+P+HDFeature SetAccuracyClassifier Majority Class Logistic Regression Random Forest Figure B.1: Accuracies of using the three classifiers and six feature setsfor classifying expertise and locus of control in the INTERVENTIONdataset. Error bars are 95% confidence intervals.71B.2 Over-time Accuracy for Best Non-Pupil Feature SetIn addition to the analysis (in Section 4.4.3) of the over-time accuracies for the ran-dom forest classifier with the best-performing P+HD feature set, we also obtainedthe accuracy trend for the best feature set without pupil features for the six classi-fication tasks (i.e., three cognitive abilities in each of the two datasets). The bestfeature sets without pupil features are listed in Table 4.2, highlighted in bold.As shown in Figure B.2, the six classification tasks reach 65–70% accuracyafter seeing the first 10% of trial data, and the accuracies increase to 68–72% half-way through the trial. In three of the six tasks, the peak accuracies occur at theend of the trial, whereas in the other three trials, the peak accuracies occur earlier,at the 50%, 70%, and 90% marks of the trial (the differences between the peakaccuracies and end-of-trial accuracies are less than 1%). The increasing trend ismore stable in the INTERVENTION dataset, likely due to that the INTERVENTIONdataset contains more data points than the BAR+RADAR dataset does.Overall, the accuracy of the best feature set without pupil features never ex-ceeds that of the P+HD feature set (Figure 4.3) at any point. But in the absence ofpupil size measures, gaze and head distance to screen features can support reason-ably high accuracies even with partially observed trials.720.650.700.7510% 20% 30% 40% 50% 60% 70% 80% 90% 100%Observed DataAccuracyUser Characteristic Perceptual Speed Verbal WM Visual WMDataset Bar+Radar InterventionFigure B.2: Trends in classification accuracy for random forest with the bestnon-pupil feature set, as a function of amount of observed data in a trial.Vertical axis does not start from zero to show the trend in details.73"@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2015-11"@en ; edm:isShownAt "10.14288/1.0165831"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Computer Science"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "Attribution-NonCommercial-NoDerivs 2.5 Canada"@* ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/2.5/ca/"@* ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Inferring user cognitive abilities from eye-tracking data"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/55086"@en .