UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

User characteristics and eye tracking to inform the design of user-adaptive information visualizations Toker, Dereck 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_november_toker_dereck.pdf [ 5.39MB ]
Metadata
JSON: 24-1.0380857.json
JSON-LD: 24-1.0380857-ld.json
RDF/XML (Pretty): 24-1.0380857-rdf.xml
RDF/JSON: 24-1.0380857-rdf.json
Turtle: 24-1.0380857-turtle.txt
N-Triples: 24-1.0380857-rdf-ntriples.txt
Original Record: 24-1.0380857-source.json
Full Text
24-1.0380857-fulltext.txt
Citation
24-1.0380857.ris

Full Text

USER CHARACTERISTICS AND EYE TRACKING TO INFORM THE DESIGN OF USER-ADAPTIVE INFORMATION VISUALIZATIONS by  Dereck Toker  B.A., The University of British Columbia, 2010 MSc, The University of British Columbia, 2013  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)   September 2019 © Dereck Toker, 2019 ii  The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: User Characteristics and Eye Tracking to Inform the Design of User-Adaptive Information Visualizations  submitted by             Dereck Toker  in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science  Examining Committee: Cristina Conati Supervisor  Giuseppe Carenini Supervisory Committee Member    Tamara Munzner Supervisory Committee Member  James Enns Supervisory Committee Member Carson Woo University Examiner Ron Rensink University Examiner Remco Chang External Examiner iii  Abstract Amidst an ever-increasing amount of digital information, information visualizations have become a fundamental tool to support tasks for discovering, presenting, and understanding the many underlying trends in this data. Ongoing effort to improve the effectiveness of visualizations however has been typically limited to their design and evaluation following a one size-fits-all model, meaning that they do not take into account the individual differences of their users. There is mounting evidence though, that user differences such as cognitive abilities, personality traits, learning abilities, and preferences can significantly influence user performance and satisfaction during information visualization tasks, thus motivating a need for personalization.  In this thesis, our primary goal is to inform the design of user-adaptive visualizations, namely, visualizations that aim to recognize and adapt to each user’s specific needs. We conducted three different user studies to address several key questions for designing user-adaptive visualizations: i) What characteristics of the user should be considered to drive adaptation? ii) How can a visualization system adequately adapt to these user characteristics? and iii) When should adaptations be delivered in order to maximize effectiveness and reduce intrusiveness? In our first study, we tested the effectiveness of highlighting interventions on bar chart visualizations and examined the role that several cognitive abilities may have on visualization processing. Results from this study provide contributions showing that: highlighting relevant information in real-time can be beneficial to bar chart processing; certain user characteristics iv  may only warrant adaptation as task complexity increases; users with low Verbal Working Memory may need interventions that facilitate processing of the visualization’s legend; and adapting to users’ level of Evolving Skill with a visualization is possible using eye tracking to make real-time predictions of this user characteristic. In our second and third study, we investigate visualizations embedded in narrative text, referred to as Magazine Style Narrative Visualization (MSNV). Results from these two studies provide contributions showing that: Verbal Working Memory and English Reading Ability can impact users’ ability to effectively process MSNVs supporting a need for adaptation; and in particular low Reading Ability users might benefit from adaptations helping them locate relevant information in the visualizations. v  Lay Summary Our primary goal is to inform the design of user-adaptive information visualizations that can tailor their interaction to support each user’s specific needs. We conduct several user studies with bar chart visualizations and visualizations embedded in narrative text to address three key questions for designing such systems: i) What characteristics of the user should be considered to drive adaptation? ii) How can visualizations adequately adapt to these user characteristics? iii) When should adaptations be delivered to maximize effectiveness and reduce intrusiveness? Our contributions identify several user characteristics that may warrant adaptation, and we analyze eye tracking data to provide insights on how such adaptations could be devised. We also show that adapting by dynamically highlighting relevant information can be beneficial to bar chart processing. In addition, we show that adapting to a user’s Evolving Skill level with a visualization is possible by predicting it in real-time from eye tracking data.  vi  Preface The research presented in this thesis was conducted in the Laboratory for Computational Intelligence [LCI] under the direction of Dr. Cristina Conati, and is part of the Intelligent User Interface Group [IUI] in the department of computer science. The user studies presented herein were conducted under the UBC Office of Research Ethics certificate number: H10-01320.  This thesis is presented as a ‘manuscript’ thesis, meaning that the core chapters, 2 thru 7, are all based on published articles. For the introduction and conclusion chapters, 1 and 8, I contributed writing and framing, with feedback provided by my supervisor (Dr. Cristina Conati) and my supervisory committee (Drs Giuseppe Carenini, James Enns, and Tamara Munzner).  A version of Chapter 2 was published at CHI’14 [32]: Carenini, Conati, Hoque, Steichen, Toker, and Enns. (2014) Highlighting interventions and user differences: informing adaptive information visualization support. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, pages 1835-1844. The author order is alphabetical (i.e., equal contribution), and my contribution to the paper included: helping design the user study and highlighting interventions, running study participants, establishing and running the data analysis, reporting all of the results, and writing many parts of the paper. I also attended and presented the paper at the conference. The remaining authors contributed to the study design, overseeing the analysis, and writing. vii   A version of Chapter 3 was published at UMAP’14 [154]: Toker and Conati. (2014) Eye tracking to understand user differences in visualization processing with highlighting interventions. User Modeling, Adaptation, and Personalization. Lecture Notes in Computer Science (UMAP ‘14). As first author, I carried out the analysis, interpreted the results, and wrote the majority of the paper. I also attended the conference and presented the research. The secondary author(s) contributed to providing feedback on the analysis and paper writing.  A version of Chapter 4 was published at UMAP’17 [155]: Toker, Lallé, and Conati. (2017) Leveraging Pupil Dilation Measures for Understanding Users' Cognitive Load During Visualization Processing. Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP ‘17). I attended the conference and presented this paper in the workshop on Human Aspects in Adaptive and Personalized Interactive Environments (HAAPIE). I carried out the analysis, interpreted the results, and wrote the majority of the paper. The secondary author(s) contributed to providing feedback on the analysis and paper writing.  A version of Chapter 5 was published at IUI’17 [160]: Toker, Lallé, and Conati. (2017) Pupillometry and Head Distance to the Screen to Predict Skill Acquisition During Information Visualization Tasks. Proceedings of the 22nd International Conference on Intelligent User Interfaces (IUI ‘17). As first author, I carried out the data analysis, interpreted the results, and wrote the majority of the paper. I attended the conference and also presented the work. The secondary author(s) contributed to providing feedback on the analysis and paper writing. viii   A version of Chapter 6 was accepted for publication at UMUAI: Toker, Conati, and Carenini. (2019) Gaze Analysis of User Characteristics in Magazine Style Narrative Visualizations. The Journal of Personalization Research: User Modeling and User-Adapted Interaction (UMUAI). As first author, I lead the design of the user study, carried out the analyses, interpreted the results, and wrote the majority of the manuscript. The secondary author(s) contributed to providing feedback on the analysis and paper writing.  A version of chapter 7 was published at UMAP’19 [161]: Toker, Moro, Simko, Bielikova, and Conati. (2019) Impact of English Reading Comprehension Abilities on Processing Magazine Style Narrative Visualizations and Implications for Personalization. Proceedings of the 27th Conference on User Modeling, Adaptation and Personalization (UMAP ‘19). As first author, I helped modify the user study I had previously designed so that it could be carried out in a different country. I also conducted the main data analyses, interpreted the results, and wrote a significant portion of the paper. The secondary author(s) contributed to providing feedback on the analysis and paper writing, as well as conducting the study in Bratislava. ix  Table of Contents Abstract .............................................................................................................................................. iii Lay Summary ..................................................................................................................................... v Preface ............................................................................................................................................... vi Table of Contents ............................................................................................................................ ix List of Tables ................................................................................................................................... xv List of Figures ................................................................................................................................. xix Acknowledgements ..................................................................................................................... xxii Dedication ..................................................................................................................................... xxiv Chapter 1: Introduction ................................................................................................................... 1 1.1 Motivation........................................................................................................................................ 1 1.2 User-Adaptive Interaction ............................................................................................................ 2 1.3 Thesis Structure Overview .......................................................................................................... 4 1.4 Thesis Contributions ..................................................................................................................... 7 1.4.1 What to Adapt To? .................................................................................................................... 7 1.4.2 How to Adapt? ......................................................................................................................... 10 1.4.3 When to Adapt? ....................................................................................................................... 12 1.5 Analysis Methodology ................................................................................................................ 14 Chapter 2: Impact of User Characteristics on Task Performance with Highlighting Interventions .................................................................................................................................... 19 2.1 Introduction ................................................................................................................................... 19 x  2.2 Related Work ................................................................................................................................. 22 2.3 User Study ...................................................................................................................................... 23 2.3.1 Experimental Tasks ................................................................................................................. 24 2.3.2 Highlighting Interventions .................................................................................................... 26 2.3.2.1 Selected Interventions ................................................................................................... 26 2.3.2.2 Intervention Timing ....................................................................................................... 27 2.3.3 User Characteristics Explored in the Study ....................................................................... 29 2.3.4 Study Procedure ....................................................................................................................... 30 2.4 Analysis of task performance .................................................................................................... 33 2.4.1 Results on Task Performance................................................................................................ 34 2.4.1.1 Main Effects ..................................................................................................................... 34 2.4.1.2 Interaction Effects .......................................................................................................... 36 2.5 Analysis of Subjective Measures .............................................................................................. 40 2.6 Discussion and Conclusions ...................................................................................................... 42 Chapter 3: Analyzing Eye Tracking to Understand User Characteristics During Visualization Processing with Highlighting Interventions ................................................. 47 3.1 Introduction ................................................................................................................................... 47 3.2 Related Work ................................................................................................................................. 49 3.3 User Study ...................................................................................................................................... 51 3.4 Eye Tracking Pre-processing & Analysis ............................................................................... 53 3.4.1 Generate Low-Level Eye Tracking Features ..................................................................... 54 3.4.2 Generate Components using Dimensional Reduction .................................................... 55 3.4.3 Mixed Model Analysis ............................................................................................................ 58 xi  3.5 Results ............................................................................................................................................. 58 3.5.1 Impact of User Characteristics on Gaze Patterns ............................................................. 59 3.5.2 Impact of Interventions on Visualization Processing...................................................... 62 3.5.3 Impact of TaskType on AOI Processing ............................................................................. 63 3.6 Conclusions and Future Work .................................................................................................. 64 Chapter 4: Leveraging Pupil Measures for Understanding Users’ Cognitive Load During Visualization Processing with Highlighting Interventions ................................... 67 4.1 Introduction ................................................................................................................................... 68 4.2 Related Work ................................................................................................................................. 69 4.3 User Study ...................................................................................................................................... 70 4.4 Processing Pupil Data ................................................................................................................. 72 4.5 Results ............................................................................................................................................. 74 4.5.1 Effects of Intervention-Type ................................................................................................. 74 4.6 Conclusions & Future Work ...................................................................................................... 76 Chapter 5: Predicting Skill Acquisition from Eye Tracking Data During Visualization Processing ......................................................................................................................................... 79 5.1 Introduction ................................................................................................................................... 79 5.2 Related Work ................................................................................................................................. 83 5.3 Dataset, Features, & Labels ........................................................................................................ 86 5.3.1 Eye Tracking Feature Sets ..................................................................................................... 87 5.3.2 Labeling Skill Acquisition ...................................................................................................... 90 5.4 Machine Learning Setup ............................................................................................................. 91 5.4.1 Model Baseline ......................................................................................................................... 93 xii  5.5 Results ............................................................................................................................................. 93 5.5.1 Predicting Skill Acquisition ................................................................................................... 94 5.5.2 Most Predictive Features ........................................................................................................ 98 5.6 Discussion and Conclusions .................................................................................................... 101 Chapter 6: Impact of User Characteristics on Performance and Gaze with Magazine Style Narrative Visualizations ................................................................................................... 105 6.1 Introduction ................................................................................................................................. 106 6.2 Related Work ............................................................................................................................... 110 6.2.1 Relevant Findings from Psychology .................................................................................. 110 6.2.2 User Characteristics in Visualization Research .............................................................. 112 6.2.3 User Adaptation ..................................................................................................................... 113 6.2.4 Eye Tracking in User Modeling for Information Visualizations ................................ 115 6.3 MSNV User Study ...................................................................................................................... 117 6.3.1 Study Procedure ..................................................................................................................... 117 6.3.2 MSNVs Used in the Study .................................................................................................... 118 6.3.3 Dependent Measures ............................................................................................................ 121 6.3.4 User Characteristics .............................................................................................................. 123 6.4 Effects of User Characteristics on MSNV User Experience .............................................. 128 6.4.1 Analysis & Results ................................................................................................................. 128 6.5 Eye Tracking Analysis and Results ........................................................................................ 131 6.5.1 Generating Gaze Metrics ..................................................................................................... 131 6.5.2 Identifying Gaze Metrics Relevant to MSNV Performance ......................................... 134 6.5.2.1 Gaze Metrics Relevant to Time on Task ................................................................. 135 xiii  6.5.2.2 Gaze Metrics Relevant to Comprehension Accuracy ........................................... 136 6.5.3 Impact of User Characteristics on Gaze Metrics Relevant to Time on Task ........... 138 6.5.4 Specifying Finer-Grained AOIs on the Visualization .................................................... 141 6.5.5 Finer-Grained AOI Gaze Metrics Relevant to Time on Task....................................... 142 6.5.6 Effects of Reading Proficiency on Finer-Grained AOI Gaze Metrics .......................... 144 6.6 Conclusion & Future Work ...................................................................................................... 147 Chapter 7: Impact of English Reading Comprehension Abilities on MSNV Processing with Users from a Non-English Speaking Country .............................................................. 152 7.1 Introduction ................................................................................................................................. 153 7.2 Related Work ............................................................................................................................... 157 7.3 User Study .................................................................................................................................... 159 7.3.1 Study Procedure ..................................................................................................................... 160 7.3.2 Materials .................................................................................................................................. 162 7.3.2.1 Study Tasks .................................................................................................................... 162 7.3.2.2 Dependent Measures ................................................................................................... 162 7.3.2.3 Measures to Assess Reading Ability ........................................................................ 164 7.4 Impact of Reading Ability on Task Performance ................................................................ 166 7.4.1 Combining Users’ Reading Ability Scores ....................................................................... 167 7.4.2 Analysis of CERP on MSNV Performance ........................................................................ 169 7.5 Gaze Analysis of MSNV Processing ....................................................................................... 170 7.5.1 Computing Gaze Metrics ..................................................................................................... 171 7.5.2 Identifying Relevant Gaze Metrics .................................................................................... 173 7.5.3 Impact of CERP on Relevant Gaze Metrics ...................................................................... 175 xiv  7.6 Discussion and Conclusion ...................................................................................................... 177 Chapter 8: Conclusion ................................................................................................................. 180 8.1 What to Adapt To? .................................................................................................................... 180 8.2 How to Adapt? ............................................................................................................................ 181 8.3 When to Adapt?.......................................................................................................................... 182 8.4 Limitations ................................................................................................................................... 183 8.5 Current & Future Work ............................................................................................................ 184 Bibliography ................................................................................................................................... 186 Appendix A ..................................................................................................................................... 206 Appendix B ..................................................................................................................................... 209  xv  List of Tables Table 1.1: Mapping of which publication each chapter is based on, as well as which study dataset was used, type of data is analyzed, analysis method and tool, along with which key design questions apply. Cells in gray indicate relevant publications that are not chapters in this thesis. ................................................................................................................................................................. 18 Table 2.1: Descriptive statistics of user characteristics. ........................................................................ 30 Table 2.2: Significant main effects on task performance (time on task). ........................................... 35 Table 2.3: Significant interaction effects for task performance (time on task). ............................... 37 Table 2.4: Significant effects for intervention usefulness ratings. (5-point Likert). ........................ 41 Table 3.1: PCA results for high-level family. ........................................................................................... 56 Table 3.2: PCA results for AOI-proportionate family. ........................................................................... 57 Table 3.3: PCA resutls for AOI transitions family. ................................................................................. 57 Table 5.1: Set of features generated using Tobii T-120 eye tracking setup and EMDAT processing. ....................................................................................................................................................... 88 Table 5.2: Effect of feature set combination on overall model performance averaged across all time-slices. Rows are arranged in descending order of classifier accuracy. Dashed lines separate models that are not statistically different from one another. .............................................................. 96 Table 5.3: Top 10 most predictive features across all time slices, along with directionality of the feature. *Negative values indicate the feature is lower during skill acquisition. ............................ 99 Table 6.1: Summary statistics illustrating the variety of document characteristics across the 15 MSNVs administered in the user study. .................................................................................................. 120 xvi  Table 6.2: Summary statistics of the four measures of MSNV user experience obtained from the study. ............................................................................................................................................................... 123 Table 6.3: The set of nine user characteristics measured in the study............................................. 124 Table 6.4: Summary statistics showing the range of scores obtained for the user characteristics we measured in the study........................................................................................................................... 125 Table 6.5: Kendall’s Tau Correlation scores between the all of the user characteristics. ............ 127 Table 6.6: Results indicating which user characteristics have a significant effect on measures of MSNV user experience. The normalized model coefficient b indicates the size and directionality of the relationship. ....................................................................................................................................... 129 Table 6.7: Set of 17 Gaze metrics generated for each of the 4 AOIs. These metrics are generated by EMDAT for each user and each task. ................................................................................................. 133 Table 6.8: For each AOI, we report gaze metrics that were found to be significant with time on task. The normalized model coefficient b indicates the size and directionality of the relationship. Cells shaded in gray indicate metrics excluded from the analysis due to either lack of normality: i.e., time_to_first_fix for Text AOI; or high correlation, i.e., transitions to self (the same AOI). ...................................................................................................................................................... 136 Table 6.9: No gaze metrics had a significant relationship with comprehension accuracy. Three metrics yielded p-values < .05, however none remained significant after correcting for multiple comparisons. Gray cells indicate metrics excluded from the analysis due to lack of normality or high correlation. ........................................................................................................................................... 137 Table 6.10: Results showing in which AOIs a significant effect of user characteristics were found on the corresponding gaze metric. The normalized model coefficient b indicates the size xvii  and directionality of the relationship. Grey cells indicate gaze metrics non-relevant to time on task, and were thus not evaluated. ........................................................................................................... 139 Table 6.11: Results indicating which gaze metrics using finer-grained AOIs were found significant with time on task. Cells shaded in gray indicate metrics that were excluded from the analysis due to high correlation: i.e., transitions to self (the same AOI). The normalized model coefficient b indicates the size and directionality of the relationship.............................................. 144 Table 6.12: Results showing significant effects of ReadingP found on finer-grained AOI gaze metrics. The normalized model coefficient b indicates the size and directionality of the relationship. Grey cells indicate gaze metrics that are non-relevant to time on task, and were not evaluated. ................................................................................................................................................ 145 Table 7.1: Set of 17 gaze metrics generated for each of the 4 AOIs shown in Figure 7.5. ........... 172 Table 7.2: Gaze metrics in each AOI found to be significant with time on task. The coefficient b indicates the directionality of the relationship (+/−). Grey cells indicate metrics excluded due to high correlations (i.e., transitions within the same AOI are highly correlated to sum_fix_durations). ...................................................................................................................................... 174 Table 7.3: Results showing in which AOIs a significant effect of CERP was found on the corresponding gaze metric. Metrics in grey cells were not relevant to time on task. ................. 176 Table A.1: There were eight significant main effects (p < .05) reported in IUI’18 [156] of user characteristics on performance/subjective measures from MSNV Study 1. Only three main effects remained significant after the analysis was corrected. .......................................................... 207 Table B.1: Pearson correlations between all of the user characteristics investigated in Chapter 2. All correlations are not significant except for a strong positive correlation (r = 0.47, p < .01) xviii  between Expertise–Simple and Expertise–Complex; and a weak negative correlation (r = -0.27, p < .01) between Perceptual Speed and Locus of Control. ......................................................................... 209  xix  List of Figures Figure 1.1: Example bar visualization and task from the Intervention Study. ................................... 5 Figure 1.2: Example Magazine Style Narrative Visualization administered in both of the MSNV Studies. ................................................................................................................................................................ 5 Figure 2.1: Example bar graph visualization as used in the experimental tasks. ............................ 24 Figure 2.2: Interventions used in the study. ............................................................................................. 27 Figure 2.3: Performance score (time on task) for each intervention. All bar graphs are shown with 95% confidence intervals. .................................................................................................................... 35 Figure 2.4: Interaction between Interventions and Task Type on time on task. ............................. 38 Figure 2.5: Interaction between: Perceptual Speed and Task Type on time on task (left); VerbalWM and Task Type on time on task (right). ............................................................................... 38 Figure 2.6: Interaction between Delivery Time and Interventions on task performance (time on task). .................................................................................................................................................................. 39 Figure 2.7: Reported usefulness ratings (5-point Likert scale) for each Intervention. ................... 41 Figure 2.8: Reported usefulness ratings (5-point Likert scale) by VisualWM levels. ...................... 41 Figure 3.1: Sample bar graph visualization and task administered in the study. ............................ 51 Figure 3.2: The four highlighting interventions evaluated in the study. .......................................... 52 Figure 3.3: Interaction effect between Perceptual Speed (High/Low) and TaskType (Retrieve Value/Compute Derived Value) on the prop_Labels component. ......................................................... 59 Figure 3.4: Interaction between Visual Working Memory (High/Low) and TaskType (Retrieve Value/Compute Derived Value) for two 'Input' AOI related components. ....................................... 60 xx  Figure 3.5: Main effect of intervention type on three different gaze components. ......................... 62 Figure 3.6: Main effect of TaskType (Retrieve Value/Compute Derived Value) on four of the five AOI-proportionate family components............................................................................................. 63 Figure 4.1: Sample bar graph visualization and task administered in the user study. ................... 70 Figure 4.2: The four different highlighting interventions evaluated in the user study. ................ 72 Figure 4.3: Main effect of Intervention-Type on users’ mean pupil size. .......................................... 75 Figure 4.4: Main effect of Intervention-Type on standard deviation of users’ pupil size.............. 76 Figure 5.1 Sample bar graph visualization and task administered to users during the study. ..... 87 Figure 5.2: Areas of Interest (AOI) defined over the interface. ............................................................ 89 Figure 5.3: Improvement in average trial completion time across the 80 tasks in the dataset (randomly administered for each user). The blue line separates trials into two general stages of skill acquisition:   during - skill with the visualization is in the state of being acquired since performance is still improving; and after - skill with the visualization has been acquired since performance change has stabilized. ........................................................................................................... 91 Figure 5.4: Predictive accuracy across time slices based on feature set combination. GAZE is shown using a dotted blue line and corresponds to the best model previously published in [162]. .................................................................................................................................................................. 94 Figure 6.1: An example of two references in a MSNV document, each consisting of a sentence in the body of narrative text and corresponding data points within the visualization. Source: The Economist - Dec 22, 2012 ...................................................................................................................... 107 Figure 6.2: One of 15 MSNVs administered in the user study. *Note: Red highlighting is shown to illustrate the concept of a reference. Highlighting was not provided to users in the study.. 120 xxi  Figure 6.3: Subjective and comprehension questions presented to users after reading each MSNV document. Note: users were not allowed to proceed without answering all of the questions. ........................................................................................................................................................ 122 Figure 6.4: The four AOIs we defined to capture MSNV processing, shown here for one of the documents administered in the user study. ........................................................................................... 133 Figure 6.5: A visualization in our MSNV dataset illustrating the four finer-grained AOIs we defined: Legend AOI (purple box), Labels AOI (orange region), R-Bars AOI (green boxes), and NR-Bars AOI (blue boxes). .......................................................................................................................... 142 Figure 6.6: Differences in transition behaviors within the visualization between users with low vs high ReadingP (median split) that result in slower performance. Bars are shown with 95% confidence intervals. .................................................................................................................................... 147 Figure 7.1: Example MSNV with two references to the embedded visualization (one underlined in green and one in red, for illustration purposes). .............................................................................. 154 Figure 7.2: Eye tracker lab setup used for conducting the study in the non-English speaking country. ........................................................................................................................................................... 160 Figure 7.3: Questions presented to users after reading each MSNV document. ............................ 164 Figure 7.4: Distribution of users from the English Speaking Country and Non-English Speaking Country (NESC) populations according to their Combined English Reading Proficiency (CERP) scores. .............................................................................................................................................................. 169 Figure 7.5: The four AOIs we defined to capture MSNV processing, shown here for one of the MSNV documents administered in the user studies. ........................................................................... 172   xxii  Acknowledgements Thank you to all the special people in my life that made this work possible:  The health and wellness professionals: my stylist Willy, my counsellors Olivia and Maija, my chiropractor Nora, and my family doctor Dr. Behra for truly supporting me and taking the time to listen and ask how I am doing, and never making me feel rushed or dismissed.  Kafka’s Coffee, for your delicious pour-overs.  Avocados  My ex-wife, for swapping kid weeks so that I could attend conferences to present my publications.  That bartender guy at R&B Ale & Pizza House, for noticing that both sides of my sheet of paper had filled up with notes, and offering me two fresh sheets of paper, which became part of the results section in Chapter 6.  My committee members Giuseppe, Tamara, and Jim.  My supervisor Cristina, for challenging and guiding me through all of this, and for investing in a good chair and sit-stand desk when I was having RSI problems with my hands.  My family members for the years of love and support: Dad, for all those last minute e-Trasnfers that kept me afloat when I was broke. Mom, for always believing in me, and being patient and always trying to help despite living in another province, and despite how busy I was. My sister Tristan, for changing my life for the better from Squamish xxiii  Valley Music Festival, Astral Harvest, and Ponderosa (teepee!). I love you and thank you for what you’ve taught me about life. And to my Nana, you’re the best grand-parent I could have ever asked for growing up.  My two kids, Leica and Jane, for teaching me the value of our time together on this earth.  And lastly, to my partner Linnea, for your continuous support, especially during all the stretches when I was working long and late at the lab. Most importantly though, you have shown me what it feels like to truly love someone and be loved. xxiv  Dedication For my Tampa (grand-père). You demonstrated that despite the risks, anything is achievable with vision and perseverance. In your lifetime, you worked astoundingly hard clearing land and farming to make way for a better life. You taught me that any endeavor is possible when you trust, love, and believe in yourself. In a sense, I have been able to do the same. The land I have cleared was not earth and stone, but academic research and education.     1  Chapter 1: Introduction 1.1 Motivation With the ever-growing expansion of the social web, the internet of things, and the continued miniaturization of sensors, data is currently being collected at a nearly unimpeded rate in almost all aspects of human life. This data includes information from smart phones, autonomous cars, online games, shopping, web browsing, streaming, personal health, banking, meteorology, and education, just to name a few. In this world of ever-increasing digital information, information visualizations have become a fundamental tool to support tasks for exploring, presenting, discovering, and fostering understanding of the many underlying trends that may exist amidst all of this data. Information visualizations are “visual representations of datasets designed to help people carry out tasks more effectively” [128]. These tasks can include identifying actionable insights from the visualized data as well as communicating such insights more effectively using visualizations. There are many different types of information visualizations available to choose from (e.g., bar graph, table, heatmap, line chart, etc.) and with varying degrees of complexity and interactivity. Ongoing effort to innovate and improve the effectiveness of visualizations however has been typically limited to their design and evaluation following a one size-fits-all model, meaning that they do not take into account the individual differences of their users. There is mounting evidence though, that user differences such as cognitive abilities, personality traits, learning abilities, and preferences can significantly influence user performance and satisfaction during information visualization tasks, e.g., [42,71,157,177]. 2  Therefore, there is a great opportunity for personalization. In this thesis, our primary goal is to inform the design of user-adaptive visualizations, namely, visualizations that aim to recognize and adapt to each user’s specific needs. 1.2 User-Adaptive Interaction The primary objective of user-adaptive interaction is to support and improve user experience with a system by adapting its behavior to information acquired about individual users. Some examples of notable research areas that leverage user-adaptive interaction include: Intelligent Tutoring Systems, Virtual Assistants, Recommender Systems, Adaptive Hypermedia, and AI in Games. A requisite of user-adaptive interaction is having a data structure to store relevant characteristics1 about individual users known as a user model. Typical data gathering techniques used to capture characteristics in the user model can include: i) Asking the user at the outset of the interaction, e.g., with tests or questionnaires; ii) Tracking and interpreting the user’s behavior during system interaction, e.g., mouse click data, eye tracking data; iii) Requesting explicit feedback from the user during system interaction; or iv) Observing contextual information about the user, e.g., device type, time of day, geolocation. Once the user model has obtained information about a user’s characteristics as a result of data gathering, the user model can then be evaluated to determine if and how to adapt the interaction accordingly.                                                   1 In this thesis, we refer to information about the user contained is the user model as user characteristics. We define user characteristics (sometimes referred to more broadly as individual differences) as individual traits or states of the user. User traits refer to more stable or long-term properties of the user (e.g., cognitive abilities, personality traits, expertise, preferences), whereas user states are more short-term or circumstantial properties of the user (e.g., emotions, skill level, task goals). 3  Adaptations are ordinarily defined using either rule-based filtering, consisting of if-then case statements. For example: if a user is lost then provide an intervention to guide their attention; if a user’s skill level increases then enable select advanced system features. The other approach customarily used is collaborative filtering, where known characteristics of a given user are compared to other users in order to find those users that are similar. The assumption is that the matched users will also be similar in other characteristics presently unknown about the given user, and adaptations are provided according to this association. For example, a simple movie recommender system will match the interests of a given user to other users, and provides recommendations of new movies to watch by selecting highly rated ones from the matched profiles. For further in depth examples of successful user-adaptive systems that leverage many of the techniques discussed above, see [87]. In contrast to system-driven adaptation, where the system is responsible for making changes to the interface, personalization can also include adaptable or customizable interfaces, where the user is responsible for making changes to the system to better fit their needs [74]. Some examples include: a user specifying their preferences on their mobile device for filtering unwanted notifications and emphasizing important ones [88], or a user expressing their identity in massively-multiplayer online multiplayer games by customizing the appearance of their player avatar [17]. The main benefit of adaptable interfaces is that users are in full control, however not all users are experienced enough or motivated to invest the effort needed to make customizations [115]. With regard to adaptive interfaces (the kind we are designing for in this thesis) even though users may have less control, a major benefit is that no extra effort is required from the user in order for the system to deliver adaptations [79]. In light of both approaches, researchers have also investigated using a mixed-initiative approach, where 4  elements of both adaptable and adaptive are integrated together. For example, a personalized toolbar in MSWord where the customization mechanism also contains system recommendations, namely, users can select which features to include in the toolbar that suit their needs (adaptable), and when doing so, the system also provides recommended features to choose from (adaptive) [29]. Another more recent example is a visualization-based urban planning decision tool that displays an optimal visualization (either deviation chart or map) according to a user’s relevant characteristics (adaptive), and in addition, includes a customization mechanism which lets the user to hide/display either or both of the visualizations (adaptable), allowing them to specify the type and amount of information they prefer to be displayed [107]. Although the work presented in this thesis is focused on designing for visualizations that are adaptive, we envision the possibility of including a mixed-initiative approach in future work, e.g., by allowing the user to select the type of adaptive highlighting they prefer (see Chapter 2 for details on adaptive highlighting). 1.3 Thesis Structure Overview This thesis is presented as a ‘manuscript’ thesis, meaning that the upcoming chapters (Chapters 2 through 7) are based on published peer reviewed articles (see Table 1.1 for a mapping of included/excluded publications to thesis Chapters). The thesis chapters are organized chronologically according to when the research was conducted. Our work produced a collection of datasets resulting from three different user studies we conducted. Each of the studies was a repeated measures design, where each user would perform multiple tasks with a respective visualization interface. 5   Figure 1.1: Example bar visualization and task from the Intervention Study. Intervention Study: An eye tracking user study we conducted at UBC to test the effectiveness of several forms of highlighting interventions on grouped bar chart visualizations, and the possible role that several user characteristics might have on this effectiveness (Figure 1.1). The work presented in Chapters 2 to 5 utilizes data generated from the Intervention Study.  Figure 1.2: Example Magazine Style Narrative Visualization administered in both of the MSNV Studies. 6  MSNV Study 1: An eye tracking study we conducted at UBC to examine how users process Magazine Style Narrative Visualizations (MSNVs), and the role several user characteristics, including ones related to reading ability, might have on this processing (Figure 1.2). The work presented in Chapter 6 presents and analyzes data generated from MSNV Study 1. MSNV Study 2: A direct replication of MSNV Study 1 but conducted in a non-English speaking country (Slovak Republic) aimed at identifying the extent to which results from MSNV Study 1 would apply for a rather different pool of users. This study is noteworthy as one of the very few replications in the visualization literature. It addresses the call to increase the number of replication attempts in the visualization literature in light of the replication crisis that is playing out in many fields, where repeating an experiment may not yield the results claimed in the initial study [102]. The work presented in Chapter 7 presents and analyzes data from MSNV Study 2, then compares results to those in MSNV Study 1, and then analyzes data from the combined datasets MSNV Study 1 and MSNV Study 2 together.  Note that two of the user characteristics relating to reading ability discussed in Chapters 6 and 7 are referred to differently between both chapters, even though they are referring to the same thing. Specifically, Reading Proficiency in Chapter 6 is referred to in Chapter 7 by the test used to measure it X_LEX, and similarly Verbal IQ in Chapter 6 is referred in Chapter 7 to by the test used to measure it NAART. Last of all, we conclude in Chapter 8 with a summarization of our contributions, followed by a discussion on limitations of our research, and then describe current and future work.  7  1.4 Thesis Contributions Three key questions need to be addressed when designing for user-adaptive interaction: 1. What to adapt to? What user characteristics should be considered for inclusion in the user model to drive adaptation? In this thesis, we address this question primarily by analyzing measures of task performance from our user studies. 2. How to adapt? How can the system adequately adapt the to users’ characteristics? In this thesis, we address this question primarily by analyzing eye tracking data collected during our user studies. 3. When to adapt? When should adaptations be delivered in order to maximize effectiveness and reduce intrusiveness? In this thesis, we address this question by evaluating machine learning models built from eye tracking data collected during our user studies. In the remainder of this section, we organize our thesis contributions according to each of these three design questions. For each one, we also provide relevant prior research to situate our work. 1.4.1 What to Adapt To? The decision on which user characteristics to include in the user model depends on the tasks the adaptive system aims to support. For some tasks, this choice can be an obvious or intuitive one. For instance, adapting to the user’s preferences or interests in recommender systems [1]; or adapting to the user’s evolving knowledge and learning in intelligent tutoring systems (ITS) [140]; or adapting to the user’s task goals for virtual personal assistants [90]. As discussed in the previous subsection, some existing work in user-adaptive visualization has evaluated the 8  effectiveness of providing adaptations based on intuitive choices of user characteristics which include preferences [68,125,129], evolving domain knowledge [68], and task goals [63]. Alternatively, determining which user characteristics to adapt to can be achieved by relying on existing theories from psychology and other relevant disciplines. For example, there has been work on adapting to player-centric traits (e.g., conqueror, socializer) in video games [19] based on findings from neurobiology and personality theory that show the importance of these traits on player motivation. There has also been extensive research on adapting to user emotions (e.g., boredom, confusion) in ITS [116,139] based on findings in educational psychology that show the importance of affective states on learning [47]. With respect to the objective of this thesis i.e., developing user-adaptive visualizations, the field of perceptual psychology provides important knowledge on cognitive abilities (e.g., perceptual and working memory) that can play a role in low-level perceptual tasks important for visualization processing (e.g., processing basic shapes, colors, or words) [53,165,168]. However, this knowledge does not enable a precise quantification of how much these cognitive abilities influence higher-level visualization processing, and if this influence is significant enough to warrant personalization. For instance, although we know from perceptual psychology that Visual Working Memory influences users’ ability to correctly retain color information in working memory [168], the theory explaining this effect does not tell us if different levels of this cognitive ability actually make a substantial difference with how users perform with visualizations that related on color encodings. Thus, researchers in user-adaptive visualization began to investigate which cognitive abilities warrant personalization with visualizations, and have provided initial evidence for several that have substantial impact on task performance when working with specific visualizations. For instance, low Perceptual Speed (ability to scan/compare figures or symbols, 9  or carry out other very simple tasks involving visual perception) increases the time required to complete tasks involving 2D visualizations [167]. A user’s level of Perceptual Speed can also predict which among alternative visualizations will work better both in terms of task accuracy [42] and information recall [4]. Both high Spatial Memory (ability to remember the configuration, location, and orientation of figural material) and high Disembedding (ability to hold a given visual percept or configuration in mind so as to disembed it from other well defined perceptual material) positively correlate with task accuracy [167], and high Associative Memory (ability to recall one part of a previously learned but otherwise unrelated pair of items when the other part of the pair is presented) increases the amount of relevant information identified during visualization-based search tasks [36]. My master’s thesis contributed to this line of work by identifying two other cognitive abilities Verbal Working Memory (quantity of verbal information, e.g., words, that can be temporarily maintained and manipulated in working memory) and Visual Working Memory (measures the quantity of visual information, e.g., shapes and colors, that can be temporarily maintained or manipulated in working memory) that also warrant personalization because of their impact on either preference or performance for three different types of visualizations [38,157,159]. Contributions: In this thesis, we contribute to the understanding of which user characteristics warrant personalization in user adaptive-visualization as follows:  Chapter 2 provides initial results showing that the impact of user characteristics on visualization processing can depend on task complexity. Specifically, we show that Perceptual Speed, Verbal Working Memory, and Visual Working Memory each impact the processing of bar chart visualizations (i.e., users low in either of these characteristic are slower on task) only for more complex tasks. Therefore, user 10  adaptive visualizations should track task complexity given that certain user characteristics may only warrant adaptation as task complexity increases.  Chapter 5 provides evidence that a user’s level of Evolving Skill with a visualization impacts task performance and is therefore an important user characteristic to adapt to. Although adapting to the user’s level of skill with a visualization may seem obvious, our findings show the need to support unskilled users even with basic bar charts, i.e., relatively simple visualizations that are widespread and commonly used.  In Chapter 6, we broaden the investigation of which user characteristics to adapt to from stand-alone visualizations to visualizations embedded in narrative text as they are commonly found in magazines, blogs, text-books, technical reports, etc. These types of documents are commonly referred to as Magazine Style Narrative Visualization, or MSNV for short [146]. Our results support the need for adaptation in MSNVs by showing that Verbal Working Memory and English Reading Ability each impact users’ ability to effectively process MSNVs. 1.4.2 How to Adapt? The decision on how to provide adaptive support is largely guided by the domain, task, and outcome the system aims to support. For example, an ITS might adapt the difficulty of practice problems or provide hints in order to increase student learning [140]; in action role-playing video games, AI has been used in storytelling by adapting the types of encounters players are faced with to increase game enjoyment [153]. In user-adaptive visualization, the forms of adaptation investigated thus far have been limited to recommending optimal visualization(s) among a set of alternatives [63,68,125,129]. To the best of our knowledge, no work has looked at providing adaptive support for processing a specific visualization. In this thesis, we have studied adaptations that deliver this latter type of support. 11  Contributions:  Chapter 2 presents initial evidence that highlighting relevant information in real time, is beneficial to bar chart processing. We show significant improvements in task performance (i.e., time on task) for highlighting interventions that use bolding, arrows, or de-emphasis to dynamically highlight relevant bars in the visualization.   Chapter 4 presents evidence that monitoring pupil size as an estimate of cognitive load may be beneficial towards designing, testing, or validating such highlighting interventions, since monitoring indications of higher cognitive load could be used to filter out unsuitable interventions (as opposed to relying on task performance). Additionally, the aim of our research is to facilitate adaptations that help users according to their relevant user characteristics. Specifically, our goal is to design adaptations that address suboptimal visualization processing behaviors caused by limitations in some relevant cognitive abilities. Unfortunately, there is little work on exactly how such adaptations could be devised. There has been some research on using eye tracking to show that gaze patterns during visualization processing change depending on certain user characteristics, for instance depending on user expertise in the domain of the data being visualized [105,151,152]. However, that work only makes informal connections or none at all between these differences in gaze behaviors to objective or subjective measures of task performance. Building this connection is important to identify which gaze behaviors are conducive to suboptimal processing, and to devise meaningful adaptations that can correct them. My master’s thesis was the first to address this gap by devising a methodology to triangulate eye tracking and task performance data to identify where users were struggling during visualization processing based on their relevant user characteristics [158,159]. Our results provided insights on how to devise meaningful adaptations to support low Perceptual Speed users working with bar and radar 12  charts. Namely, we showed that users with low Perceptual Speed were slower on task because they devoted more effort processing the visualizations’ legend and labels, thus indicating a need for interventions designed to facilitate the access and processing of these two specific visualization components. Contributions: In this thesis, we leverage the methodology devised in [158,159] and generate findings that inform how to design adaptations for two other user characteristics:  Chapter 3 provides results showing that users with low Verbal Working Memory might also benefit from interventions that facilitate processing of the legend in bar charts.  In Chapters 6 and 7, we provide results showing that adaptive support (e.g., interventions to guide attention to the graph) may benefit users with low English Reading Ability while they are processing MSNVs (e.g., to make them faster) by helping them locate relevant information in the visualization. For example, compared to users with high English Reading Ability, we found that low English Reading Ability users needed more time to locate bars in the visualization mentioned by the MSNV’s narrative text, and they exhibited more back and forth transitions between these bars and extraneous bars not mentioned by the MSNV’s narrative text. 1.4.3 When to Adapt? There are three basic possibilities of when to adapt: at the outset of a new task, during the task, or at the end of the task. To illustrate with an example, in ITS [140] the initial layout or content of the learning interface can be tailored to the user at the outset of a task. Adaptations provided during the task can include providing hints or prompts in response to the user’s specific behaviors. Adaptations suitable at the end of a task can include a personalized summary of the 13  user’s performance as well as suggestions on how to improve for future tasks. In another example with recommender systems in e-commerce [1], an initial recommendation can be made at the outset of a task, when possibly less is known about the user’s interests. During the task, as the system collects more information about the user’s interests, more fine-grained recommendations can be provided. Finally, at the end of a task, when the user has potentially made a choice to purchase an item, a different set of recommendations for other types of products might be offered to the user. In user-adaptive visualization, adaptations have been mostly delivered at outset each new task with the visualization [68,125,129]. An exception is [63], where adaptations were provided during the task, as soon as the user’s task goal could be predicted by the adaptive system. Similar to [63], the aim of our research is to enable adaptations during task, as soon as user characteristics relevant for adaptation can be predicted by the user model. To this end, we investigate eye-tracking as a non-invasive data source for predicting user characteristics in real-time. Previous research outside of user-adaptive visualization has shown that eye tracking can be used to predict various user characteristics during task such as Mind Wandering [22], Student Learning [24,98], Reading Comprehension [43], and emotions including Boredom and Curiosity [89]. In the context of user-adaptive visualization, eye tacking data has been used to predict Domain Expertise [72], negative emotions such as Confusion [108] and Frustration [136], and several cognitive abilities including Perceptual Speed, Verbal Working Memory, and Visual Working Memory [149]. Contributions:  Chapter 5 provides evidence that eye tracking data (gaze patterns, pupil size, head distance to the screen) can be used to predict a user’s level of Evolving Skill (skilled vs. unskilled) with a visualization during tasks with bar charts. 14  1.5 Analysis Methodology In this section we provide a high level overview explaining the chronology and evolution of our analysis methodology. The methods we employ are also summarized at the end of this section in in Table 1.1, where we show the mapping between relevant publications to analysis methods used, as well as the mapping of publications to thesis chapter, user study, type of data analyzed, analysis tool, and which of the three key design questions apply. For completeness, Table 1.1 also includes four publications that are non-thesis chapters (cells in grey) consisting of: two pre-thesis publications that were part of my Masters work; and two publications during my PhD that were each overridden by a more recent publication updating the work (this will be explained in more detail below). One of the main distinguishing elements that guided the selection of analysis method was whether we were evaluating study performance data (e.g., time on task), eye tracking data (e.g., fixation rate, number of fixations on various salient elements of the visualizations, etc.), and in some cases both. During my masters work, I evaluated data collected from a user study with simple bar and radar charts, and the potential role several user characteristics might play when performing basic visualization tasks with them (referred to as the Bar/Radar study). We first analyzed performance data (time on task) to compare potential performance differences between the two visualizations, and to identify if performance differences was linked to any of the user characteristics (i.e., addresses what to adapt to). As our analysis method, we employed a General Linear Model (GLM) repeated measures in SPSS, which was an appropriate model and suited our needs. This work was published at UMAP’12 [157] (first row in Table 1.1). Next, we wanted to analyze eye tracking data collected during the study to see if users’ gaze behaviors differed depending on their user characteristics (i.e., addresses how to adapt). 15  However, we encountered a problem using GLM as our analysis method due to occasional missing gaze information caused by temporary loss of calibration (i.e., the user looked down or away from the screen during one of their tasks). In particular, GLM is not resilient to missing datapoints because the model requires data to be in wide format (i.e., all trials are listed in one data entry row per participant) and the entire row must be complete. Thus when there is even just one invalid trial, the model is forced to discard all data for that participant. This can be costly in an experiment with some invalid trials, as is often the case when using unobtrusive eye trackers that do not constrain subjects’ movements. Our solution was to employ Linear Mixed Effects Models because it is robust to missing datapoints. In particular, Mixed Models list data in long format, where each trial for each user is a different data entry row, and thus discarding invalid trials does not interfere with the model’s ability to use the valid ones, allowing us to leverage the most from our eye tracking datasets [56]. Results from applying Mixed Models in SPSS to eye tracking data from the Bar/Radar study yielded the second publication resulting from my Masters work: CHI’13 [158] (second row, Table 1.1). During the first part of my PhD, we conducted two similar analyses using the same methodology described above but on the Intervention Study dataset (third and fourth rows in Table 1.1). First, we used GLM in SPSS to conduct an analysis on performance data, published at CHI’14 [32] which Chapter 2 of this thesis is based on, followed by an analysis of eye tracking data using Mixed Models in SPSS, published at UMAP’14 [154] which thesis Chapter 3 is based on. We also used Mixed Models in SPSS to conducted a third analysis of the Intervention Study on a specific subset of eye tracking data we had yet to examine (i.e., pupil dilation measures), published at UMAP’17 [155] and constitutes Chapter 4 of this thesis (second row in Table 1.1). 16  Chapter 5 is the only chapter that employs a different methodology than the other chapters (i.e., it addresses when to adapt). Here, we used R to construct machine learning on eye tracking data collected from the Intervention Study in order to predict relevant user characteristics, in this case a users’ evolving Skill Level with a visualization. In our first attempt, we used gaze-based eye tracking measures and evaluated classification performance using a simple naïve baseline, this work was published at IUI’14 [162]. However, we revisited this work by including a broader set eye tracking measures (gaze-based with the addition of pupil dilation and head distance) and evaluated classification performance using a stricter baseline that we also devised. Chapter 5 is based on this latter recent work, which was published at IUI’17 [160].The publication from IUI’14 is not included as a thesis chapter since it is essentially overridden by IUI’17 (refer to sixth and seventh rows in Table 1.1). In the latter part of my PhD, we conducted two studies with Magazine Style Narrative Visualizations (MSNVs): the first one MSNV Study 1 at UBC, and then replicated this study in the Slovak Republic, MSNV Study 2. In terms of analysis methods, we initially attempted to leverage a different approach that could potentially unify our previous methodology, where before, performance was analyzed using GLM, then separately, eye tracking was analyzed using Mixed Models. To this end we explored the feasibility of using Structure Equation Models in R, and selected this method to perform an initial analysis of performance and subjective data from MSNV Study 1, which was published as a short paper at IUI’18 [156]. However, we later discovered a crucial error in our model specification that compromised some of the reported results, and further learned that SEM would not be suitable for our purposes (a detailed account of the model specification error we encountered and how it was resolved is provided in Appendix A). With the assistance of several paid consultations offered 17  by the UBC department of Statistics consulting group (SCARL), we instead devised a combined methodology that allowed us to analyze both performance and gaze data (i.e., addresses both questions of what and how to adapt) by using several Mixed Models in R. We then re-ran the analysis of performance data from MSNV Study 1 using Mixed Models in R, but also extended it to include eye tracking data, and the results of these analyses yielded the journal manuscript (UMUAI) for which Chapter 6 is based on. Note that IUI’18 [156] is not included as a thesis chapter, since it was overridden by UMUAI (second and third last rows in Table 1.1). Lastly, we leveraged our combined Mixed Model methodology in R to analyze performance data collected from MSNV Study 2, and to analyze both performance and eye tracking from the combined datasets of MSNV Study 1 and MSNV Study 2. This final work was published at UMAP’19 [161] and is the basis of Chapter 7.             18  Chapter   Citation &  Conference Study Analysis of: Method Tool Design Question: Task Performance Eye  Tracking What? How? When? MSC [157] UMAP’12 Bar/Radar ✔  GLM SPSS ✔   MSC [158] CHI’13 Bar/Radar  ✔  Mixed Model SPSS  ✔  2 [32] CHI’14 Intervention ✔  GLM SPSS ✔ ✔  3 [154] UMAP’14 Intervention  ✔ Mixed Model SPSS  ✔  4 [155] UMAP’17 Intervention  ✔ Mixed Model SPSS  ✔   ̶ [162] IUI’14 Intervention ✔ ✔ Machine Learning R   ✔ 5 [160] IUI’17 Intervention ✔ ✔ Machine Learning R ✔  ✔ ̶ [156] IUI’18 MSNV-1 ✔  SEM R ✔   6 under review UMUAI MSNV-1 ✔ ✔ Mixed Model R ✔ ✔  7 [161] UMAP’19 MSNV-1 & MSNV-2 ✔ ✔ Mixed Model R ✔ ✔             Table 1.1: Mapping of which publication each chapter is based on, as well as which study dataset was used, type of data is analyzed, analysis method and tool, along with which key design questions apply. Cells in gray indicate relevant publications that are not chapters in this thesis. 19  Chapter 2: Impact of User Characteristics on Task Performance with Highlighting Interventions Preface   ̶In this chapter2, we present a study with grouped bar charts to investigate the impact of user characteristics on visualization processing and to evaluate a variety of visual prompts, called ‘interventions’, that are applied to a visualization to help users process it. This chapter addresses the question of how to adapt by showing that some of the tested interventions perform better than a condition in which no intervention is provided, both in terms of task performance as well as subjective user ratings. We also address the question of what to adapt to by providing evidence that the impact of several user characteristics on visualization processing can depend on task complexity. 2.1 Introduction Recent advances in visualization research have shown that individual user needs, abilities and preferences can have a significant impact on user performance and satisfaction during visualization usage (e.g., [42,57,68,157]). It is therefore important to investigate the potential of                                                   2 The content of this chapter was published as [32]: Carenini, Conati, Hoque, Steichen, Toker, and Enns. (2014) Highlighting Interventions and User Differences: Informing Adaptive Information Visualization Support. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ‘14). 20  user-adaptive visualizations, i.e., visualization techniques and systems that support the provision of visual information personalized to each user’s needs and differences. The benefits of user-adaptive interaction have been shown in a variety of human-computer interaction tasks and applications such as operation of menu based interfaces, web search, desktop assistance, and human learning [87]. There are three key decisions that need to be made when designing a user-adaptive system: (1) what to adapt to, i.e., understanding which user features should be considered for adaptation, including stable, long-term user traits (e.g., cognitive abilities, personality, etc.), as well as transitory, short-term states (e.g., current task, cognitive load, attention); (2) when to adapt, i.e., understanding when it is appropriate and/or necessary to provide adaptive support to the user; (3) how to adapt, i.e., understanding how adaptation should be provided. In this chapter, we focus primarily on this latter question in the context of designing user-adaptive visualizations. In information visualization, the only research we are aware of targeting the question of how to adapt is on recommending alternative visualizations based on specific user, data, or task features (e.g., [63,68]). By contrast, in this chapter we focus on adaptive interventions aimed at improving the effectiveness of the visualization a user is currently working with. In particular, we evaluate a set of four alternative highlighting interventions aimed at supporting analytical interaction by directing the user's attention to a specific subset of data within a visualization, while still retaining the context of the data as a whole [54]. Highlighting can be extremely useful in any scenario in which an agent (a system or a human) needs to communicate to a user several points about a possibly large and complex dataset. For instance, in a dataset of car sales, two key points could be that “more cars were sold 21  this year in China than in India” and that “Europe sales have been decreasing in the last 3 years”. In these scenarios, the ability to highlight subsets of the data would naturally support a more effective communication. While the whole dataset can be compactly conveyed with an appropriate visualization, the information relevant to each point can be synchronously highlighted as the key points are sequentially expressed in language (written or spoken). For instance, in our example, sales for China and India would be highlighted first, followed by sales for European countries in the last 3 years. The ability to generate highlighting interventions would be especially useful in computer-human communication, for instance, when a system has automatically analyzed and derived insights from a complex dataset (e.g., [35,178] ), and needs to communicate this to a user. This functionality may also be beneficial to support a user in inspecting human-generated presentations combining visualizations with textual material, which are quite common in documents ranging from newspaper articles to scientific papers. If a system could track what part of the text the reader is currently reading, and infer the corresponding point made (as it is being investigated in [50]), such a system could use one (or more) of the interventions evaluated in this chapter to highlight the relevant visualization elements [40]. The interventions that we evaluated in our study were inspired by the analytical interaction techniques presented in Few [54] and by a taxonomy of post-hoc strategies for visual prompts presented by Mittal [124]. While both [54] and [124] provide valuable descriptions and taxonomies of different techniques, to the best of our knowledge there is no formal evaluation of which interventions may be most useful, both in general and under particular task/user contexts. The user study presented in this chapter aims to answer the following research questions on the effectiveness of the four interventions that we target: 22  1. Can highlighting interventions improve information visualization processing? 2. Is there an intervention that is the most effective? 3. Are questions 1 & 2 above affected by individual user characteristics, by task complexity, and by when the interventions are delivered? Generally speaking, if we find an intervention that is the most effective, it should be used whenever a system needs to draw the user’s attention to a subset of the data. However, if intervention effectiveness is found to depend on the task and/or the user, the results of our study could inform adaptive highlighting for visualization support. 2.2 Related Work Three key decisions are involved in supporting user-adaptive interaction: what to adapt to, when to adapt and how to adapt. Deciding what to adapt to involves identifying which individual user features influence interaction performance enough to justify adaptation.  In visualization, there are already results on the impact of a number of user characteristics on user performance and satisfaction. For example, user performance across different visualizations and task types has been linked to the cognitive measures of perceptual speed and spatial visualization [42,157,167], as well as to the personality trait of locus of control [71,177]. Also, the cognitive abilities of visual/verbal working memory, as well as visualization expertise, have been shown to impact user satisfaction [157]. Addressing the decision of when to adapt involves formalizing adaptation strategies that identify those situations in which the benefits of providing adaptive interventions outweigh their cost (e.g., disrupting the interaction). When to adapt has been extensively investigated in fields such as Intelligent Tutoring [174] or Availability Management Systems [87]. In visualization, to our knowledge, [63] is the only work that actively monitors real-time user 23  behavior in order to infer the need for intervention, although [40] describes a user study designed to capture instances of user confusion during visualization. Addressing the question of how to adapt, which is the focus of this chapter, has been studied outside information visualization to support, for example, display notifications [11], or hints provision [127]. In information visualization, researchers have so far focused on adaptivity that relates to suggesting alternative visualizations based on specific user or task features [63,68]. By contrast, in this chapter we focus on interventions that relate to the current visualization.  Highlighting interventions are the most relevant techniques to our goal of devising dynamic interventions, because by definition they can be added to an existing visualization as needed to emphasize a specific aspect. Our sources of inspiration for highlighting interventions were Few [54] and Mittal [124]. Mittal [124] was especially useful, as it presents a taxonomy of post-hoc strategies for visual prompts, which is based on a detailed analysis of previous visualization literature [16,103], and on the analysis of several thousand charts in newspapers, magazines and business/governmental reports. 2.3 User Study We conducted a user study to investigate the effectiveness of four different highlighting interventions that can be used to emphasize specific aspects of a visualization. We also look at how this effectiveness may be impacted by task complexity, user differences, and delivery time. To keep the number of conditions manageable, we only studied one visualization: bar graphs (see Figure 2.1 for an example). We focused on bar graphs for three reasons. First, bar graphs are one of the most ubiquitous and effective information visualization techniques. Second, 24  there is already research showing that performance with and preferences for this basic form of visualization is influenced by individual differences such as perceptual speed, visual working memory, and verbal working memory [42,157,167]. Thus, it can be beneficial to investigate how to provide visual interventions for different users who may be working sub-optimally with bar graphs. Finally, as we argue at the end of the chapter, results on bar graphs are likely to generalize to other information visualizations. 2.3.1 Experimental Tasks Tasks were performed via dedicated software. Each task consisted of presenting the participant with a bar graph along with a textual question relating to the displayed data. Participants would select their answer from a horizontal list of radio buttons and click 'Submit' to advance to the next task (see Figure 2.1). The study questions related to comparing individuals against a group average (data points in the bar graph) on a set of dimensions (data series in the bar graph). Since all tasks require comparisons to the average, the corresponding bar was given a fixed color (black) and position (leftmost) across all tasks (see Figure 2.1). In contrast, the color of the other bars was varied from task to task, and selected at random from a set of four color schemes optimized using ColorBrewer [75].  Figure 2.1: Example bar graph visualization as used in the experimental tasks. 25  For variety, the task questions were drawn from four different data sets: i) student grades by course; ii) movie revenue by city; iii) pet food nutritional values by vitamin and mineral content; and iv) company growth rates by business department. All tasks involved the same number of data points (six, including the average) and series dimensions (eight).  Task complexity was varied by making subjects perform two different types of task, chosen from a set of primitive data analysis tasks that Amar et al. [5] identified as "largely capturing people’s activities while employing information visualization". The first task type was Retrieve Value (RV), described by [5] as “Given a set of specific cases, find attributes of those cases”. This is one of the simplest task types in the Amar hierarchy, and thus it was selected to exemplify tasks of lower complexity. In our study, RV tasks required to retrieve a specific individual in the target domain and compare it against the group average (e.g., “Is Club Universe's revenue in Paris below the average movie revenue in that city?”). The second task type we chose was Compute Derived Value (CDV), defined in [5] as “Given a set of data cases, compute an aggregate numeric representation of those data cases”. In our study, CDV tasks required users to first perform a set of comparisons, and then compute an aggregate of the comparison outcomes (e.g., “In how many departments is BioRestore above the average growth and Microfirm is below it? ”). CDV tasks in our study are more complex than RV tasks because they require users to i) perform significantly more comparisons, ii) remember the comparison outcomes, and iii) compute an aggregate from the remembered comparisons. 26  2.3.2 Highlighting Interventions 2.3.2.1 Selected Interventions Figure 2.2 shows the four highlighting interventions that were evaluated in the study, which are designed to guide a user's focus to a specific subset of data within the bar graph while still retaining the context of the data as a whole [54]. In our study, these interventions would highlight those bars that were relevant to answer the current question. For instance, the question “In how many cities are both Shark Swamp and Speed Freak above the average movie revenue?”, the interventions would highlight the bars for Shark Swamp's revenue, Speed Freak's revenue, and the average movie revenue in each city. Bolding (Figure 2.2, top left) draws a thickened border around the relevant bars3. De-Emphasis (Figure 2.2, top right) fades all non-relevant bars. Average Reference Lines (Figure 2.2, bottom left) draws a horizontal line going from the top of the left-most bar (representing the average) to the last relevant bar, to facilitate comparison. Connected Arrows (Figure 2.2, bottom right) involves a series of connected arrows pointing downwards to the relevant bars.                                                   3 Notice that Bolding highlights the average only by thickening the bar, because of its black color. This is arguably not a serious confound, because the average is always relevant in the study tasks, and it was already made to stand out through its constant and distinctive color as well as a consistent leftmost position. 27   Figure 2.2: Interventions used in the study. We focused on only four highlighting interventions to keep the number of conditions and trials within the limit of the available resources for this study. We selected the four interventions in Figure 2.2 because they were the most suitable to support our target tasks, as compared, for instance, to highlighting by Color Change (left out because color is already used to encode information in our visualizations), or to Annotate Values, i.e., adding their specific values on top of selected bars (left out because it can interfere with perceived bar height, and negatively affect our tasks). Participants performed each of the two task types described earlier with each of the four highlighting interventions in Figure 2.2, as well as with no intervention, as a baseline. 2.3.2.2 Intervention Timing If highlighting interventions were to be used to provide real-time adaptive support, they would be superimposed on a visualization while the user is looking at it. This could possibly be 28  disruptive, even if the adaptive system had a reliable mechanism to decide when the interventions should appear based on user needs. In this study, we wanted to evaluate the relative effectiveness of the selected interventions without this confound, as well as gain initial insights on whether this relative effectiveness changes when the interventions are provided dynamically. Thus, we added an experimental factor that varied when the interventions would be shown, consisting of two conditions, Time zero (T0) and Time x (TX). In the T0 condition, the interventions are included in the bar graph from the beginning of the task, to evaluate them without the possible confound of the disruption that can be caused by a dynamic superimposition.  In contrast, the TX condition aims to gauge interventions’ effectiveness when they are added dynamically. At the time of the study, however, we had no criterion implemented to decide when an intervention should appear based on a user’s needs. Thus, we adopted a procedure designed to minimize the potential intrusiveness of an unjustified superimposition of visual prompts. Essentially, the idea is to add the visual prompts to the target bar graph as soon as the user has had a chance to look at both the bar graph and the related task question. This constraint is enforced in the TX condition by the following steps, which leverage the real-time gaze information provided by a Tobii T120 Eye-tracker installed on the experimental machine: 1. The bar graph appears, without the task question (and without intervention). It stays visible until a user has had a total of 5 eye fixations on the graph or more than 5 seconds have passed. 2. The graph disappears and the question text appears. The question stays visible until the user has had at least 6 fixations on it (2 fixations each in the first third, in the middle, and in the last third of the text), or more than 5 seconds have passed. 29  3. The graph reappears. At this point, the graph and question text are both visible.  4. After 500ms, the selected intervention is added. This slight delay aims to ensure that users recognize that the intervention is an added component to the graph they had seen so far. Participants saw each intervention on each task type with both the T0 and the TX delivery strategy, thus generating 20 experimental conditions: 2 task types (RV vs. CDV), times 2 delivery times (T0 vs. TX), times 5 interventions (including no intervention). It should be noted that participants are expected to be slower in the TX delivery condition because of the delay before both graph and text are visible on the screen (which is necessary to complete the task). What we aim to understand with these two conditions is whether delivery time affects the relative effectiveness of the interventions. 2.3.3 User Characteristics Explored in the Study The user characteristics investigated in this study include three cognitive abilities (perceptual speed, verbal and visual working memory), two measures of user visualization expertise with using bar graphs, as well as one personality trait (locus of control).  Perceptual speed (a measure of speed when performing simple perceptual tasks), visual working memory (a measure of storage and manipulation capacity of visual and spatial information), and verbal working memory (a measure of storage and manipulation capacity of verbal information) were selected because they were repeatedly shown to influence visualization performance or user satisfaction in studies involving bar graphs [42,157,167].  Besides cognitive abilities, the study in [157] also looked at the impact of visualization expertise, but results were inconclusive, possibly because they measured expertise via self-report questions asked after the experimental tasks. In this study, we aim to provide a more 30  reliable investigation of the impact of user visualization expertise, not only on bar graph processing but also on the effectiveness of our visual interventions. We use two separate measures for expertise, captured in a pre-questionnaire: one that gauges user familiarity with simple bar graphs (expertise-simple) and one with complex ones (expertise-complex), elicited as described in the next section.  Locus of control (a measure of the degree to which individuals perceive outcomes as either a result from their own behavior, or from forces that are external to themselves) has been shown to impact user performance with visualizations other than bar graphs [71,177], e.g., list-like visualizations and visualizations with a strong containment metaphor. With this study we wanted to ascertain whether locus of control may also have an impact while interacting with simpler visualizations such as bar graphs and on the effectiveness of our visual interventions. 2.3.4 Study Procedure 62 subjects ranging in age from 18 to 42 participated in the experiment. We selected the number of participants by performing a power analysis [51] a priori on the parameters of our experimental design, defined to detect a small effect size of at least  ηρ2 = .01 with 0.8 power.  Table 2.1: Descriptive statistics of user characteristics. 31  Participants were mostly recruited via dedicated systems at our university. This resulted in a variety of student participants from diverse backgrounds (e.g., Psychology, Forestry, Computer Science, Finance, Fine Art, German, Commerce). We also recruited 7 non-student participants such as a non-profit community connector, 3D artist, and air combat systems officer. Table 2.1 presents summary statistics on the user characteristics data collected from the study. A correlation analysis over our 6 user characteristics4 shows no significant correlations, except for a strong positive correlation (r = 0.47, p < .01) between expertise-simple and expertise-complex, and a weak negative correlation (r = -0.27, p < .01) between perceptual speed and locus of control. Because the expertise measures are highly correlated, we retain only expertise-complex as our measure of expertise for further analysis, given its higher variance. The experiment was a within-subjects study, fitting in a single session lasting at most 90 minutes. There were 20 experimental conditions: 2 task types (RV vs. CDV), times 2 delivery times (T0 vs. TX), times 5 interventions (including no intervention). Participants were instructed to complete the tasks as quickly and accurately as possible. To account for within-subject variance, each participant repeated each condition 4 times, which is a well-established procedure in perceptual psychology experiments measuring performance in terms of time and accuracy [135,166]. Thus, there were a total of 80 trials per participant. To avoid participants getting bored, each of the four domains described earlier were randomly assigned to each task. Participants began by filling out a pre-study questionnaire asking for demographic information as well as self-reported expertise with simple and complex bar graphs. Expertise-                                                  4 The full set of correlation scores are reported in Appendix B. 32  simple was elicited with the question 'How often do you look at simple Bar Graphs', followed by a basic bar graph with 8 bars (values for one series over 8 dimensions); Expertise-complex was elicited with the question 'How often do you look at complex Bar Graphs', followed by a graph with 48 bars (values for 6 series over 8 dimensions), as used for the experimental tasks. Both questions had five answer options: i) Never, ii) Rarely (several times a year), iii) Occasionally (several times a month), iv) Frequently (several times a week), v) Very frequently (several times a day). Participants then completed standard computer-based tests for Locus of Control [143], Verbal Working Memory [165], Visual Working Memory [59], and a paper-based test for Perceptual Speed [49]. Next, participants underwent a training phase to expose them to bar graphs, the study tasks, and the highlighting interventions. Then participants underwent a calibration phase for the eye-tracker, before starting the study trials. Participants then performed 40 of the 80 study trials, followed by a 5 minute break. After the break, the eye-tracker was re-calibrated and the participant performed the remaining 40 trials. The 80 trials were fully randomized in terms of experimental conditions (i.e., task complexity, intervention delivery time, interventions). The experimental software was fully automated and ran in a web-browser, with the visualizations and interventions programmed using the D3 visualization framework [26]. Lastly, participants took a post-questionnaire asking for their evaluations of each intervention’s usefulness, as well as their relative preferences. The questionnaire included:  10 rating statements in the form of “I found the X intervention useful for performing Y tasks”, for each intervention and task type (i.e., simple vs. complex). The statements were rated on a Likert scale from 1 to 5. 33   2 ranking statements in the form of “Please rank your preference of interventions for simple/complex tasks (order from 1 to 5) (1: most preferred, 5: least preferred)”. 2.4 Analysis of task performance We look at both task completion time and task accuracy as performance measures. Completion time was normally distributed (M=19.5s, SD=10.2), whereas task accuracy indicated a ceiling effect with 91.4% correct answers, possibly due to the tasks being generally easy to solve, or due to participants focusing on generating the correct answer, while sacrificing their time on task. The ceiling effect on accuracy arguably makes a separate analysis of this performance measure not very informative. We nevertheless did not want to discard accuracy altogether, because trials that were answered incorrectly should be penalized accordingly. We opted to use a combined score for task performance, known as Inverse Efficiency Score [163]. Given that participants repeated each experimental condition 4 times, task performance is calculated by averaging completion time for the trial repetitions that were performed correctly, and then dividing this score by the percentage of correct repetitions5. Task performance values thus calculated can be essentially interpreted as completion times penalized for incorrect trials (the lower the percentage of correct trials, the higher the adjusted average time on these trials). Thus, performance is reported in seconds and a higher score represents a lower performance. We use a General Linear Model (GLM) repeated measures to analyze our performance data. We first run a 2 (task complexity) by 2 (delivery time) by 5 (intervention type) General Linear                                                   5 When there are no correct repetitions, leading to a divide-by-0 problem, the participant for that trial must be discarded from the analysis. In our study, only one participant's data was removed from the analysis for this reason. 34  Model (GLM) repeated measures to investigate the effects of our experimental factors alone. Next, we analyze the effects of each of our five co-variates separately (perceptual speed, visual WM, verbal WM, locus of control, and expertise-complex), by running a GLM with the experimental factors and only that co-variate. Due to the high number of covariates in our study, this approach ensures that we do not overfit our models by including all co-variates at once. Each co-variate was discretized into three levels via a three-way split. Low represents the bottom quartile of the values distribution (i.e. lower 25%), average represents the values within the interquartile range (i.e., middle 50%), and high represents the upper quartile (top 25%). In the next sections, effect sizes (partial eta-squared) are reported as small for .01, medium for .09, and large as .25 [55]. All reported pairwise comparisons are corrected with the Bonferroni adjustment. 2.4.1 Results on Task Performance 2.4.1.1 Main Effects We found main effects of task type, delivery time, intervention type, Perceptual Speed, and Verbal WM, as shown in Table 2.2. 35   Figure 2.3: Performance score (time on task) for each intervention. All bar graphs are shown with 95% confidence intervals.   Main Effect F-Ratio Effect Size Sig. Value Task Type F(1,59) = 543.47 𝛈𝛒𝟐= .902  p < .001 Delivery Time F(1,59) = 1277.94 𝛈𝛒𝟐= .956 p < .001 Intervention F(4,236) = 44.44 𝛈𝛒𝟐= .430 p < .001 Perceptual Speed F(1,58) = 10.02 𝛈𝛒𝟐= .147 p < .01 VerbalWM F(3,56) = 5.42 𝛈𝛒𝟐= .225 p < .01     Table 2.2: Significant main effects on task performance (time on task). The main effect of task type confirms the difference in complexity between the two task types in the study, with Compute Derived Value having longer task performance values (M=25.2s, SD=7.5) than the simpler Retrieve Value tasks (M=13.8s, SD= 5.3). The main effect of delivery time is to be expected because of the delay in answering the question generated by the TX condition, as discussed earlier. The average performance for TX was 22.7s (SD=7.8), as opposed to 16.3s (SD=8.2) for T0.  36  There was also a main effect of intervention type. As shown in Figure 2.3, performance was best for De-Emphasis, and worst for None, (i.e., no intervention provided). Pairwise comparisons show that interventions are significantly different from one another except for Bolding and Connected Arrows, and for None and Avg. Ref. Lines. This result indicates that all interventions, except for Avg. Ref. Lines, were helping users solve the selected tasks more efficiently than when they received no intervention. These results will be further qualified by interactions with task type and delivery time described in the next section. The main effect and related pairwise comparisons for Perceptual Speed indicate that performance was similar for users with low Perceptual Speed (M=20.6s, SD=8.6) and average perceptual speed (M=20.3s, SD=9.3), whereas users with high perceptual speed were significantly better at completing tasks (M=17s, SD=6.7), hence confirming previous work [157]. The results for Verbal WM show similar directionality, except that the performance of users with low Verbal WM (M=24.1s, SD=12.0) was significantly worse than the scores of users in both the average group (M=19.4s, SD=8.4) and the high group (M=18.0s, SD=7.2). While [157] previously uncovered a link between Verbal WM and user preferences for different visualizations, and [158] has showed that low Verbal WM increases a user’s gaze fixations on textual elements, our current result on Verbal WM is, to the best of our knowledge, the first to directly link this cognitive ability to task performance with information visualizations. The results for Perceptual Speed and Verbal Working Memory will be further qualified by interactions with task type. 2.4.1.2 Interaction Effects Table 2.3 shows a summary of the interaction effects.  37  Interaction Effect F-Ratio Effect Size Sig. Value Intervention*Task Type F(4,236) = 8.65 ηρ2= .128 p < .001 Perceptual Speed*TaskType F(1,58) = 8.64 ηρ2= .130 p < .01 VerbalWM*TaskType F(3,56) = 5.79 ηρ2= .237 p < .01 VisualWM*TaskType F(1,58) = 3.81 ηρ2= .062 p < .05 DeliveryTime*Intervention F(4,236) = 7.56 ηρ2= .114 p < .01     Table 2.3: Significant interaction effects for task performance (time on task). Intervention*TaskType: Figure 2.4 shows that, for both task types, None is the intervention with the worst performance and De-Emphasis the one with the best. Pairwise comparisons, however, show that for simple tasks (RV), all interventions are significantly different from one another and are better than None, whereas for complex tasks (CDV), Avg. Ref. Lines is no longer significantly better than None. A possible explanation is that Avg. Ref. Lines helps the comparisons with the average bar, but it does not highlight the elements to be compared as well as the other interventions, except in the case when they are contiguous to the average bar and to each other. This may become a greater disadvantage with the more complex comparisons involved in our CDV tasks.  Additionally, there are no longer significant differences between Bolding and De-Emphasis, nor among Bolding, Connected Arrows, and Avg. Ref. Lines, indicating that for more complex tasks, the relative performance between the interventions is less pronounced. For instance, feedback we gathered from participants indicates that De-Emphasis can make it hard to see bar groupings. Even though Bolding and De-Emphasis can be considered conceptually similar (emphasizing relevant bars vs. de-emphasizing non-relevant bars), it is possible that for complex tasks, the fading of 'irrelevant bars' removes some contextual cues for sample 38  grouping, which would help solve the tasks (e.g., when many bars in the middle of a group are faded out, the outermost bars of a group may look disconnected).  Figure 2.4: Interaction between Interventions and Task Type on time on task. Perceptual Speed*TaskType, VerbalWM*TaskType, VisualWM*TaskType: There was no significant difference in performance with RV tasks for users with different values of Perceptual Speed, VerbalWM, and VisualWM. For CDV tasks, in contrast, users with higher values of these cognitive measures perform better. Figure 2.5 shows the interactions for Perceptual Speed and Verbal Working Memory.  Figure 2.5: Interaction between: Perceptual Speed and Task Type on time on task (left); VerbalWM and Task Type on time on task (right). 39  The trends for Visual Working Memory are very similar to those in Figure 2.5. These results are likely due to the fact that CDV tasks requires processing and remembering more of the visual elements (bars, interventions, etc.) and more of the verbal information on the graph (i.e., legend items, labels, etc.). Thus, for CDV tasks, higher values of the cognitive measures may be having a stronger impact compared to RV tasks. The result for perceptual speed aligns with results in previous work [157], where it was also found that users with lower perceptual speed require more time to complete a complex task relative to their high perceptual speed counterparts. For visual and verbal working memory, this study is the first to connect these two cognitive traits to task performance (as opposed to user preferences) with a visualization, possibly because previous studies relied on tasks that were not complex enough to detect these effects.  Figure 2.6: Interaction between Delivery Time and Interventions on task performance (time on task). Delivery Time*Intervention: This interaction effect is shown in Figure 2.6 and indicates that for T0, None and De-Emphasis are, respectively, significantly worse and better than all other interventions (with no other significant differences). For TX, the difference between interventions are much smaller, with Avg. Ref. Lines no longer being significantly better than 40  None, and Bolding and Connected Arrows no longer being worse than De-Emphasis. This result is important, because it suggests that when interventions are delivered dynamically, they may lose some of their value due to possible intrusiveness, and thus it is crucial to evaluate them in the right context of usage. On the other hand, even in the potentially intrusive TX condition, some interventions are still better than none, indicating that it is possible to provide dynamic adaptive interventions that can help improve effectiveness. 2.5 Analysis of Subjective Measures As we did for performance measures, we first ran a 2 (task type) x 5 (intervention) General Linear Model repeated measures on the usefulness ratings in order to investigate the effects of our experimental factors alone, followed by additional analyses on each of our five co-variates with the experimental factors. These ratings were corrected using the Aligned Rank Transformation (ART)-Tool [173] to make them suitable for parametric analysis. Results from this analysis are shown in Table 2.4. A similar set of analyses on the preference rankings yielded no significant results.  There was a significant main effect of intervention on usefulness ratings, shown in Figure 2.7. Pairwise comparisons reveal that all of the intervention ratings were significantly different from one another (except for Connected Arrows and Bolding), and that all interventions were better than None. Results F-Ratio Effect Size Sig. Value Intervention  F(4,236) = 100.23 ηρ2= .629 p < .001 VisualWM*Intervention F(4,224) = 2.34 ηρ2= .040 p < .05     41  Table 2.4: Significant effects for intervention usefulness ratings. (5-point Likert).   Figure 2.7: Reported usefulness ratings (5-point Likert scale) for each Intervention. This main effect and the trends of the relative ratings between interventions correspond exactly to those for intervention on task performance found in the previous section (see Figure 2.3), showing a strong connection between objective and subjective effectiveness of the tested interventions. It is also worth noting that users found all the interventions more useful than no intervention, regardless of task type. This was not the case for task performance. There was, however, an interaction between intervention type and visual WM, as shown in Figure 2.8.   Figure 2.8: Reported usefulness ratings (5-point Likert) by VisualWM levels. 42  This is in line with previous work linking Visual WM to user subjective (preference) scores [158]. Pairwise comparisons show that users with either low or average Visual WM rated the usefulness of Avg. Ref Lines significantly lower than users with high Visual WM. A possible explanation for this result is that the added reference lines may have been ‘visual distractors’ for lower Visual WM users, given that the lines do not run only through the relevant bars, but also through any other bars between the average and the last relevant bar. We also find that users with average Visual WM rate Bolding significantly higher than users with either low or high Visual WM. While this finding further confirms the influence of Visual WM on subjective ratings, we currently do not have an intuition as to the directionality of the result. 2.6 Discussion and Conclusions The goal of our study was to investigate the relative effectiveness of four visual prompts designed to support users in visualization processing by highlighting visualization elements relevant to performing target tasks. As we discussed in the introduction, this functionality can be extremely useful for scenarios in which users need to make a variety of inferences on a visualized dataset, and may benefit from having the most relevant subsets of graph elements emphasized in turn. Although in our study, to keep the number of conditions manageable, we only considered one type of information visualization, i.e., bar graphs, there are at least three different arguments that support the potential generality of our results to other visualizations (but of course, generalizations should be eventually corroborated by empirical evidence). First, bar graphs are one of the most popular visualizations because they rely on length and 2-D position, the only two pre-attentive attributes that can be perceived quantitatively with a high degree of 43  precision [54]. Thus, results on bar graphs can arguably generalize to other popular visualizations that rely on the same pre-attentive attributes, like line-graphs and scatter-plots. Second, since bar graphs are so effective and popular they have been used as building blocks of more complex visualizations. For instance, [67] recently presented LineUp, an interactive visualization supporting the very common and critical task of ranking items based on multiple heterogeneous. As another example, ValueCharts [33] is a visualization that has been applied to elicit user preferences in decision making in different domains, as well as a component of a sophisticated interface to query event sequences. We argue that our results may well generalize to these more complex and extremely useful visualizations based on bar graphs. Third, most of the interventions considered in the chapter can be applied to other visualizations besides bar graphs. Thus, our results may well generalize also to these visualizations. For example, [100] demonstrated several example applications of reference line, bolding, and de-emphasis in pie charts and line charts in addition to bar graphs. Average reference lines have been used to visually compare individual marks to a predetermined value in various charts [100,124]. Bolding and de-emphasis form a perceptual group based on the Gestalt principle of similarity [100] and thus have been applied in various visualizations to relate items (e.g., TreeMap, Scatter Plot Matrix, Arc diagram [76]).  We now discuss the user study results with respect to our original research questions. In the study we wanted to ascertain (1) if highlighting interventions can improve visualization processing; (2) if there is a highlighting intervention that is the most effective, and (3) if questions 1 & 2 are affected by user characteristics, task complexity, and intervention timing. We investigated these questions in the context of performing visual tasks with bar graphs.  44  As for question 1, our results show that all the highlighting interventions we tested, except for Avg. Ref Lines, can improve visualization processing compared to receiving no interventions, both in terms of task performance and a user’s perceived usefulness. Thus, these interventions should be further investigated as means of providing users with dynamic support during visualization tasks.  As for question 2, results show that no single highlighting intervention is the most effective in general. De-Emphasis always performed at the top, in terms of both performance and rated usefulness, but it was absolute best only with the simpler RV tasks, and when it was present from the beginning of the task (delivery condition T0). Hence, we did find significant effects of task complexity and delivery time on intervention effectiveness (question 3). When considering task performance, there was no longer a significant difference between De-Emphasis and Bolding during complex tasks, or among De-Emphasis, Bolding and Connected Arrows when the interventions were delivered dynamically. For the long-term goal of providing adaptive highlighting interventions, this latter result suggests that future studies should focus on further investigating the effectiveness of De-Emphasis, Bolding and Connected Arrows in dynamic delivery conditions, and in particular in conjunction with delivery criteria based on actual user needs (e.g., at the onset of confusion as suggested in [40]). It is already a very encouraging result however, that delivering the interventions dynamically did not neutralize their effectiveness compared to no intervention, suggesting that their benefits outweigh their potential intrusiveness.  Still in relation to question 3, we also found an impact of user characteristics, in terms of an effect of Visual WM on ratings for perceived intervention usefulness, namely high Visual WM users rated Avg. Ref Lines interventions as more useful, and average Visual WM users rated 45  Bolding interventions as more useful. This result is in line with previous findings that Visual WM affects subjective ratings for visualizations [157]. Our results suggest that, if information on a user's Visual WM is available, higher perceived usefulness may be achieved by using Bolding as a highlighting intervention for users with average Visual WM. Interesting effects of individual differences were also found when analyzing the interaction with task complexity for task performance. In particular, for each of the three cognitive abilities tested in the study, we found no significant difference in performance among participants with different levels of these abilities for simple tasks (RV). In contrast, for complex tasks (CDV) participants with high measures performed significantly better, indicating that complexity can significantly impact user performance (i.e., time on task) depending on cognitive abilities. Similar results were found in previous work for Perceptual Speed [157], but this is the first study that extends them to Visual and Verbal WM, likely because of the increased complexity of our tasks. The implication for user-adaptive visualizations is that participants with low-medium cognitive measures would benefit the most from help such as adaptive interventions. The fact that there were no interaction effects between cognitive abilities and the different highlighting interventions targeted in the study suggests that perhaps other types of interventions should be explored to help users with low-medium cognitive measures. For instance, previous work linking gaze patterns to performance when processing bar graphs [157], suggests that users with low Perceptual Speed may benefit from help in processing a graph’s legend, whereas users with low Verbal WM may benefit from interventions that facilitate processing of the verbal elements of a graph. 46  Also to note, we did not find any significant results for the personality trait Locus of Control. A likely explanation is that most findings for this user characteristic were found when comparing list-like visualizations and visualizations with a strong containment metaphor [71], which were not the target of our interventions. We also did not find any significant results for the visualization expertise measures we collected from users. This could be due to the fact that some users were possibly biased when self-reporting their expertise, or that previous expertise was not a relevant factor with regard to a user's performance/preference with the visualization tasks administered in our study. Our next step involves an analysis of user eye gaze behavior in order to verify and better qualify our findings, and to suggest further interventions for adaptive help. We also plan to run similar experiments on more complex visualizations and on a broader set of interventions. 47  Chapter 3: Analyzing Eye Tracking to Understand User Characteristics During Visualization Processing with Highlighting Interventions Preface    ̶ This chapter6 presents an analysis of eye tracking data collected from the intervention study presented in Chapter 2, to understand if and how user characteristics impact visual processing of bar charts. We then link this processing to task performance in order to provide insights on the question how to adapt. Our results identify specific visualization regions that cause poor task performance in users with low values of certain cognitive measures, and should therefore be the target of personalized visualization support. In particular, our findings show that users with low Verbal Working Memory might benefit from interventions that facilitate processing of the legend in bar charts. 3.1 Introduction Information visualization (Infoviz) systems are widely used across many domains and applications in order to explore, manage, and better understand data. Despite their increasing                                                   6 The content of this chapter was published as [154]: Toker and Conati. (2014) Eye tracking to understand user differences in visualization processing with highlighting interventions. User Modeling, Adaptation, and Personalization. Lecture Notes in Computer Science (UMAP ‘14). 48  frequency of use and the rise of big data, these systems have typically continued to follow a one-size-fits-all approach in terms of how they account for their users. An ever increasing body of research however, has shown that individual user differences can play a role in user performance or preference for a given infoviz system [32,42,70,157,167]. These findings suggest that visualization effectiveness may be improved by having Infoviz systems that can detect relevant user differences during visualization processing, and adapt accordingly. Researchers have already started looking at adaptation approaches that recommend alternative visualizations based on detected user needs (e.g., [63,68]). By contrast, in this chapter we focus on exploring the potential of adaptive interventions aimed at improving the effectiveness of the visualization currently used. In particular, we use eye tracking to evaluate the impact on visualization processing of four highlighting interventions which could eventually be used to provide adaptive support by dynamically redirecting the user's attention to different subsets of the visualized data as needed (e.g., when the visualization is used together with a verbal description that discusses different aspects of a dataset [32]). Previous work has already looked at the impact of these interventions on user task performance [32]. In this chapter however, we analyze user gaze behavior based on eye tracking data collected during the study in [32], in order to gain a more fine-grained understanding of how the study factors (e.g., interventions, user differences, task complexity) impact visualization processing. For gaze data analysis, we employ the same methodology proposed in [158]7, consisting of several stages of data preprocessing and statistical modeling. The work in [158] looked at a simple gaze data set to understand how a set of individual differences affect visualization processing while performing                                                   7 A reminder that this work was conducted as part of my Master’s thesis (cf. Table 1.1 in Chapter 1). 49  a variety of tasks with two different visualizations (bar graphs and radar graphs). In this chapter, we focus on bar graphs only, and extend the work in [158] by looking at (i) a larger set of individual differences; (ii) more complex data sets, and (iii) if/how the related visual processing is impacted by different highlighting interventions. We also include a new region of visualization processing (answer input area of interest), to track gaze behaviors within the region where users input their answers to the study tasks. The research questions we investigate here in this chapter are as follows:  Q1. How do the tested sets of user characteristics, highlighting interventions, and task complexity impact gaze behavior during bar graph visualization processing? Q2. How do results in Q1 relate to results on the impact of these factors on task performance reported in [32], and what are the implications for adaptive visualizations?  In answering these research questions, our objective is to inform the next stages of design for a real-time user-adaptive information visualization system. Our results do in fact show significant impacts of user characteristics, task type, and interventions on gaze behaviors. These results are then used to shed light on why significant performance differences occurred during visualization processing as reported in [32]. Based on these outcomes, we offer design recommendations for providing adaptive visualization support for bar graph processing using highlighting interventions. 3.2 Related Work Recent work has begun to evaluate the benefits of user adaptation for information visualization systems. Both Grawemeyer [68], and Gotz & Wen [63] found positive results when evaluating systems that provide recommendations on a set of available visualizations based on a user's 50  tasks, prior knowledge, and performance. While these systems adapt only to user features such as domain knowledge or performance tracked via interface-related behaviors, several studies have shown that other user characteristics can impact visualization performance. Various cognitive abilities such as perceptual speed, verbal working memory, and visual working memory have been shown to impact user performance and/or user subjective experience with visualization tasks [32,42,157,167]. Researchers have also shown that personality traits (e.g., locus of control) can have similar impacts on performance [70]. Given this increasing evidence on the impact of user differences in visualization performance, researchers have been investigating ways to capture the relevant user traits in real-time so as to inform adaptive information visualization systems, with substantial attention being devoted to approaches leveraging gaze data. For example, Gridinger et al. [72] used group-wise similarity of gaze patterns to predict domain expertise in processing visualizations of weather patterns. Steichen et al. [149] and Toker et al. [162] predict, respectively, user characteristics and skill acquisition based solely on tracking a large set of aggregate gaze features collected during visualization tasks. Eye tracking has also been investigated as a promising source of information for understanding how to adapt to specific user traits for supporting effective visualization processing. For instance, several studies have shown significant differences in gaze patterns of experts and novices during visualization tasks in a variety of domains, including chemistry (e.g., [151,152]) and general information search [105]. It should be noted however, that little work has been done to formally connect differences in gaze behaviors due to user characteristics, to objective measures of task performance. Building this connection is key in order to understand how to improve visualization performance by tailoring support to specific user traits. Toker et al. [158] have begun to address this gap by running a formal analysis of 51  eye gaze behaviors with bar and radar graph visualizations. In a previous study with these visualizations, users with low values for perceptual speed had been found to perform poorly compared to users with high perceptual speed [157]. By then analyzing the gaze data, [158] explains this performance difference in terms of the higher processing time that low perceptual speed users need to devote to the visualization's legend. Based on these findings, [158] recommended that low perceptual speed users ought to be supported by designing interventions that target the legend region. In this chapter, we apply the same methodology as [158] towards the performance results from the study reported in [32] in order to gain a better understanding of how user differences impact visualization processing when highlighting interventions are available.  Figure 3.1: Sample bar graph visualization and task administered in the study. 3.3 User Study The study that generated the data used in this chapter investigated the effectiveness of four highlighting interventions designed to help the processing of bar graphs, as well as how this effectiveness is impacted by both task complexity and different user traits. The study was a single session, within-subjects design, lasting at most 90 minutes. 62 participants performed tasks using bar graphs (Figure 3.1) with a fully-automated interface. Gaze was tracked using a 52  Tobii T120 eye tracker and calibration was taken twice: once at the start and once at the mid-point of the study. Bars graph were chosen because they are a common visualization for which there is already evidence of the impact of individual differences and the need for adaptive support [157].  Task complexity was varied by having subjects perform 2 different types of tasks, chosen from a standard set of primitive data analysis tasks in Amar et al. [5]. The first task type was Retrieve Value (RV), one of the simplest task types in [5], which in the study consisted of retrieving the value for a specific individual in the dataset and comparing it against the group average (e.g., "Is Michael's grade in Chemistry above the class average?"). The second, more complex task type, was Compute Derived Value (CDV) which in the study required users to first perform a set of comparisons, and then compute an aggregate of the comparison outcomes (e.g., "In how many cities is the movie Vampire Attack above the average revenue and the movie How to Date Your Friends below it?"). All tasks involved the same number of data points (6), and series elements (8). It should be noted that these datasets were more complex than those used in a previous study on the impact of individual differences on bar graph processing [158], which involved at most three data points per series.  Figure 3.2: The four highlighting interventions evaluated in the study. Each intervention evaluated in the study (shown in Figure 3.2) was designed to highlight graph bars that were relevant to answer the current question, to guide a user's focus to a 53  specific subset of the visualized data while still retaining the overall context of the data as a whole [54]. The Bolding intervention draws a thickened box around the relevant bars; De-Emphasis fades all non-relevant bars; Average Reference Lines draws a horizontal line from the top of the left-most bar (representing the average) to the last relevant bar; Connected Arrows involves a series of connected arrows pointing downwards to the relevant bars. Participants began by completing a set of tests that measured the 5 user characteristics evaluated in the study which included: (1) Perceptual speed, a measure of speed when performing simple perceptual tasks [49]; (2) Visual Working Memory, a measure of storage and manipulation capacity of visual and spatial information [59]; (3) Verbal Working Memory, a measure of storage and manipulation capacity of verbal information [165]; (4) Bar Graph Expertise, a self-reported measure of a user's experience with using bar graphs; (5) Locus of Control, a personality trait measuring whether individuals tend to take responsibility for their circumstances or blame them on external factors. These measures were selected because they had been previously shown to influence user performance or satisfaction in bar graph studies [32,42,157,167] or other visualizations [70]. Next, each participant performed each of the two task types (RV & CDV) with each of the 4 interventions as well as No Intervention as a baseline for comparison, in a fully randomized manner, yielding a total of 80 trials per participant. 3.4 Eye Tracking Pre-processing & Analysis Following the same approach in [158], the eye tracking data is processed in three stages. First, we generate a set of gaze features from the raw data. Next, principal component analysis (PCA) is performed on these features to obtain a set of factors which will act as the dependent measures for statistical analysis. Lastly, linear mixed-effect models (mixed models) are used to 54  evaluate the impact of the study factors and user characteristics on the eye tracking components. 3.4.1 Generate Low-Level Eye Tracking Features Eye tracking data consists of fixations (i.e., gaze points on the screen), and saccades (i.e., paths between fixations). We processed the raw gaze data from the study using EMDAT, an open-source toolkit8 which computes gaze features including sums, averages, and standard deviations of a variety of gaze measures, such as fixation rate and duration, saccade length, and absolute/relative saccade angles. These features can be computed with respect to the overall screen, using no information on the displayed content (e.g., mean fixation duration, sum lengths of saccades, average angles of saccades), and there are 14 such features, called High-level features, from now on. Features can also be computed for specific areas of interest (AOI) in the interface (AOI-level features). These include both proportionate measures indicating relative attention to each AOI (e.g., proportion of time/fixations spent looking at an AOI), as well as transition measures indicating how a user’s attention shifts between two AOIs (e.g., transition from AOI x to AOI y). This ensemble of features constitute the building blocks for comprehensive gaze processing [62]. The set of AOIs for the bar graph used in the study consists of: (1) 'High' AOI, a rectangular area that covers the top half of the vertical bars; (2) 'Low' AOI covers the lower half of the vertical bars, (3) 'Labels' AOI: covers the series elements labels, (4) 'Legend' AOI: covers the legend, (5) 'Question' AOI: covers the text describing the                                                   8 Eye Movement Data Analysis Toolkit, available at:  http://www.cs.ubc.ca/~skardan/EMDAT/ 55  task to be performed, and (6) 'Input' AOI: covers the radio buttons and submit button, (refer to Figure 3.1). 3.4.2 Generate Components using Dimensional Reduction The goal of this step is to use principal component analysis (PCA) in order to identify and combine groups of inter-related gaze features into components more suitable for data analysis [56]. We first group the gaze features into three non-overlapping families according to how the measures are intuitively related: High-level family, AOI-proportionate family, and AOI-transitions family. We then conduct a separate PCA on each family, of which the results are described next. In the subsequent tables, ‘**’ indicates features that are negatively correlated to the component they are member to. Since [158] used the same families of gaze features for their PCAs, we will comment on the similarities and differences with our results to show where the consistencies exist across different visualization contexts. Performing PCA on the 14 high-level gaze features generated five components                          (x2 = 22035.01, df = 91, p < .001, explained variance 88.31%), shown in Table 3.1. The names for the components are based on commonalities among their features. These 5 components are identical to those found in [158], even though the underlying gaze features were generated from two different studies (one using radar graphs and bar graphs, and one using only bar graphs and interventions). This is initial yet strong evidence that the relationships between the 14 High-level gaze features may be consistent regardless of the visualization context.    56  Component Name High-level family gaze features Sum-Measures Total-num-fixations,  Sum-rel.-saccade-angles,  Sum-abs-saccade-angles,   Sum-saccade-length, Sum-fixation-durations Fixation-Measures Mean-fixation-durations,  Std-dev-fixation-durations, Fixation-rate** Saccade-Length Mean-saccade-length,   Std-dev-saccade-length Saccade-Angles Mean-rel.-saccade-angles,  Std-dev-rel.-saccade-angles, Std-dev-abs-saccade-angles Mean-Abs-Saccade-Angles Mean-abs-saccade-angles   Table 3.1: PCA results for high-level family. Performing PCA on the 12 features in the AOI-proportionate family produced five components (x²= 15271.10, df = 66, p < .001, explained variance 93.71%), shown in Table 3.2. Although the 'Input' AOI was not examined in [158], there are still strong similarities between their PCA results and ours. In both PCAs, proportionate measures of total-duration and total-fixations for any AOI always appear together in some component, indicating that these features are strongly correlated. Furthermore, the components related to proportionate attention to ‘Label’, ‘Low’, and ‘Legend’ AOI are identical to those in [158]. One obvious difference with [158] is that here we included an additional AOI, whose proportionate features were grouped by PCA in the same component (prop-Input in Table 3.2). A second difference is that in [158] the 'Question' and 'High' AOI-proportionate gaze features produced separate components, whereas here they were combined into one component (prop-Question/High in Table 3.2). This is an indication that unlike High-level gaze features, certain AOI related gaze behaviors are likely dependent on interaction contexts (e.g., visualization type, task complexity).   57  Component Name AOI-proportionate family gaze features prop-Question/High Question-prop-total-duration,   Question-prop-total-fixations, High-prop-total-duration**,   High-prop-total-fixations** prop-Low Low-prop-total-duration,   Low-prop-total-fixations prop-Labels Labels-prop-total-duration,   Labels-prop-total-fixations prop-Input Input-prop-total-duration,   Input-prop-total-fixations prop-Legend Legend-prop-total-duration,  Legend-prop-total-fixations   Table 3.2: PCA results for AOI-proportionate family. Performing PCA on the 36 gaze features in the AOI-transition family generated five components (x² = 22755.8, df = 630, p < .001, explained variance 48.2%), shown in Table 3.3. Unlike [158], where each transition component included features related mostly to one specific AOI, here the transition components are a lot more noisy, meaning that there is more overlap between which AOI(s) primarily comprise a given component. These findings indicate that of the 3 families of gaze features examined, transition features are the least similar across interaction contexts, which is likely due to the finer granularity of interaction with the visualization that they capture. Component Name AOI-transitions family gaze features trans-Label/Low Low→label,  Label→low,  Label→labels, Question→label, Label→question,  Label→legend,  Legend→label, Legend→low,  Low→low,  Low→legend, Question→low, Question→question,  Low→question trans-High/ Legend/Question High→legend, Legend→high,  Legend→question, Question→legend, High→ question, High→high, Question→question, Question→high, Legend→legend trans-Input Legend →input, Input→legend, Input→input, Question→input, Input→question,  Input→low,  Low→input trans-Low Low→high,  High→low,  Low→low,  Question→low, Low→question trans-Label/Question  Input→label, Label→high, Label→input, High→high, Label→question, Question→label, Input→question   Table 3.3: PCA resutls for AOI transitions family. 58  3.4.3 Mixed Model Analysis The final step of our analysis involves running a formal statistical model (mixed-model) to evaluate the impact of our study parameters (task complexity, interventions) and user characteristics on gaze components. For each of the three families of gaze features described in the previous section, we run a set of mixed models on each component (for a total of 15 sets of mixed models). Each mixed model is a 2 (task type) by 5 (intervention) with the respective component as the dependent measure. Additionally, as was done in [32], each of the five covariates (perceptual speed, verbalWM, visualWM, expertise, locus of control) are separately analyzed by running an additional mixed model for each covariate and the experimental factors. Given the high number of covariates, this approach ensures that we do not over-fit the models. To account for multiple comparisons within each family of gaze features, each mixed model is adjusted using a Bonferroni correction with value equal to the number of components in each family (i.e., 5), resulting in an overall total of 15 corrections. Statistical significance is thus reported post-correction at the .05 level. 3.5 Results  In this section, we report a selection of results from the gaze analysis, organized into three parts: results on effects relating to user characteristics; results relating to highlighting interventions; and results relating to task type (i.e., Compute Derived Value & Retrieve Value) that that do not directly involve user characteristics. All reported results are statistically significant (p < .05), however due to space limitations only the effect sizes (R²) are shown. 59  3.5.1 Impact of User Characteristics on Gaze Patterns The user differences for which we found significant effects on gaze data are perceptual speed (PS), visual working memory (VisualWM), and verbal working memory (VerbalWM). These are also the user characteristics that were found to significantly impact user performance in [32]. In particular, users with low measures of PS and VisualWM were significantly slower when completing harder tasks (CDV) than users with high VerbalWM. Users with low VerbalWM were significantly slower than high VerbalWM users regardless of task type. In the following sections, we link differences in task performance (previous results presented in [32]) to gaze behaviors (new results in this chapter), which together offer explanations as to where/how poor performance is occurring within a task, as well as how this knowledge can inform the design of user-adaptive support. Results for user characteristics are presented based on a median split of users along these measures (e.g., low vs. high perceptual speed).  Figure 3.3: Interaction effect between Perceptual Speed (High/Low) and TaskType (Retrieve Value/Compute Derived Value) on the prop_Labels component. Interaction Effect: PerceptualSpeed*TaskType. We found an interaction effect between PS and TaskType on prop-Labels (R² = .009), shown in Figure 3.3. This effect indicates that, for harder tasks (CDV), users with low PS are spending more of their time looking at the 60  labels of the bar graph. Similar results were also reported in [158], where they found that users with low PS transitioned more often to the labels when working on harder tasks. Given that low PS users showed poorer performance in harder tasks [32], these results reinforce the need to consider offering adaptive interventions that can help low PS users to process graph labels. For instance, we may want to extend our set of highlighting interventions to apply to labels. Interaction Effect: VisualWM*TaskType. We found interaction effects for visualWM*TaskType on features in both the AOI-proportionate and AOI-transitions families: prop-Input (R ²= .014) and trans-Input (R = .016), shown in Figure 3.4.   Figure 3.4: Interaction between Visual Working Memory (High/Low) and TaskType (Retrieve Value/Compute Derived Value) for two 'Input' AOI related components. These effects indicate that for harder tasks, users with low visualWM spend more of their time looking at the 'Input' AOI and are also transitioning more frequently to it, compared to users with high visualWM. The latter finding on transition frequency specifically suggests that low visualWM users likely have difficulty connecting the answer options in the input area with the information in the graph, which causes them to go back and forth between the input and the other graph areas more often than high visualWM users do. This behavior can explain why in [32] low visualWM users were found to be slower at solving the tasks than their high visualWM counterparts. This combination of findings suggest that we may want to experiment 61  with designing adaptive support for low visualWM users that focuses on facilitating processing of the input options in relation to the task (e.g., experiment with different input methods or visual representations of radio buttons).  We also found an interaction effect between visualWM and TaskType on the Saccade-Length component (R² = .008) indicating that, for harder tasks, users with low visualWM had longer saccades and a greater standard deviation of saccade lengths. This is akin to these users taking 'broader strokes' as they look about the screen, as well as having less consistently sized saccades. This finding may be an additional manifestation of the difficulty these users experience with harder tasks, further explaining why they were slower at completing them. Interestingly, no links between visualWM and gaze behaviors were found in [158]. One explanation is that the more complex datasets used for the visualizations targeted in this chapter provided an increase in visual complexity which drew out the impact of visualWM capacity. Main Effect: VerbalWM. We found a main effect of verbalWM on the AOI-transitions family, specifically on the trans-High/Legend/Question component (R² = .005). This effect indicates that users with low verbalWM transitioned over the 'High', 'Legend', and 'Question' AOIs more often than users with high verbalWM. Both legend and question are textual elements, thus this finding is consistent with the fact that users with lower verbal capacity may need to review these textual elements more often. Similarly, [158] reported a main effect of verbalWM on the proportion of time users spent looking at the main textual elements of the visualization. They were, however, unable to establish whether these behaviors affected performance and may warrant adaptive interventions. In contrast, we can link the main effect discussed here to the increase in task completion time for low VisualWM reported in [32], 62  indicating that it is worthwhile to investigate adaptive interventions that aid the processing of a visualization’s textual component for these users. 3.5.2 Impact of Interventions on Visualization Processing Previous results in [32] show that three of the four highlighting interventions described in Section 3.3 led to better task performance compared to having no interventions, whereas the Avg.Ref.Line intervention did not. The eye tracking results in this subsection may help shed some light on this finding.   Figure 3.5: Main effect of intervention type on three different gaze components. We found main effects of intervention type on three different gaze components: Sum-Measures (a component of the High-level family consisting of sums over measures for overall fixations and saccade angles, R² = .102), as well as two components of the AOI-transitions family: trans-Label/Low (R² = .056) and trans-High/Legend/Question (R² = .049). Pairwise comparisons of the interventions indicated that for all three gaze components, Avg.Ref.Line has significantly higher values than ConnectedArrow and DeEmphasis (see Figure 3.5). In [32], Avg.Ref.Line was suggested to be a visual distractor that interferes with visualization processing because of its poor performance. Our results seem to confirm this suggestion, by showing that this intervention generated significant additional visual work (i.e., increased sum 63  measures and gaze transitions). It is interesting to note that, even though in [32] Avg.Ref.Line is comparable to No Intervention in terms of task performance, pairwise comparisons also indicated that the three gaze components values for No Intervention are significantly lower than Avg.Ref.Line, and are in fact more comparable to the other 3 interventions. Thus, it appears that for No Interventions, users still perform poorly, but not because of visual distraction. Since no other significant results were found based on the interventions, this eye-gaze analysis cannot account for why [32] found that three of the interventions were better than No Intervention. 3.5.3 Impact of TaskType on AOI Processing In this subsection, we report the most compelling results relating exclusively to main effects of TaskType. These results are interesting because under some conditions, an adaptive system may not have reliable information on its user’s cognitive abilities. Our results show that gaze behavior may help an adaptive system ascertain the complexity of the task at hand (e.g., easier vs. harder task), which by itself can be a valuable basis for providing adaptive support.  Figure 3.6: Main effect of TaskType (Retrieve Value/Compute Derived Value) on four of the five AOI-proportionate family components. 64  There are significant main effects of TaskType on four of the five components from the AOI-proportionate family (Figure 3.6). For three of these components: prop-Question/High (R² = .133), prop-Labels (R² = .305), and prop-Legend (R² = .149); values are higher for easier (RV) than for harder (CDV) tasks. Recall that the prop-Question/High component includes 'High' AOI features with a negative correlation (see Table 3.2) implying that the less time a user spends in the 'Question' AOI, the more time they spent in the 'High' AOI. Thus in terms of attention to the corresponding AOIs, these effects indicate that when performing harder tasks, users spend less time (in proportion) in the 'Legend', 'Label', and 'Question' AOIs, and more time in the 'High' AOI. This result is quite intuitive considering that this is the region were the actual data values are displayed, and thus users may need more time to process this information for more complex tasks. Adaptive interventions like the ones targeted in this chapter may help alleviate this problem. For the fourth component: prop-Input (R² = .114), values increase during harder tasks, indicating that for these tasks users also devote a higher proportion of their attention to the 'Input' AOI, as they do for the 'High' AOI. These findings offer further evidence that the response input region may play an important role in supporting optimal user performance, thus making it worthwhile to investigate forms of adaptations that target not only user differences (as discussed in a previous section), but also task complexity. 3.6 Conclusions and Future Work We presented an analysis of user gaze data to understand if and how user characteristics impact visual processing of bar charts in the presence of different highlighting interventions designed to facilitate visualization usage. We then linked these results to task performance, 65  obtained from a previous study, in order to provide insights on how to design user-adaptive information visualization systems. Our first research question (Q1) asked if and how our tested sets of user differences, highlighting interventions, and task complexity impact gaze behavior during bar graph visualization processing. We found several positive answers. For instance, with harder tasks, users with low perceptual speed (PS) spent more time processing the 'Label' AOI, whereas users with low visualWM spent more time looking at the 'Input' AOI and transitioning between that AOI and other parts of the screen. Similarly, users with low verbalWM spend more time processing some of the textual elements of the graph. Similar results for PS were obtained in [158], however, the findings related to verbalWM and visualWM are unique of our work. All users, regardless of cognitive abilities, spent more time processing the 'High' AOI as well as the 'Input' AOI when dealing with harder tasks. As for the highlighting interventions, Avg.Ref.Line caused significantly more transitions as well as an increase in fixations and saccades. Our second research question (Q2) asked how the above findings can be related to user performance results reported in [32], and the implications for adaptive visualizations. We found that most of our significant effects on gaze behaviors mirrored effects found on task performance in [32], allowing us to explain poor performance in terms of both specific gaze patterns, as well as the user differences that caused them. These connections indicate several new avenues of investigation for adaptive interventions, in addition to those discussed, for instance, in [158]. In particular, adaptive support may benefit users with low visualWM on harder tasks by targeting the input regions of bar graphs. Low verbalWM users may benefit from interventions that facilitate processing the textual information related to the task 66  questions and legend. We also discussed evidence as to why the Avg.Ref.Line intervention was distracting and did not improve performance, which provides preliminary abstract guidelines on what constitutes a distraction (e.g., increased Sum-Measures and AOI-transitions). In future work, we will evaluate pupil dilation data from the same study to understand how the study factors and user differences affect cognitive load9. We will also design and evaluate adaptive interventions based on the results in this chapter (e.g., various types of support for the input AOI and labels AOI).                                                   9 An evaluation of Intervention Type and cognitive load is in fact the subject of the next chapter (Chapter 4). 67  Chapter 4: Leveraging Pupil Measures for Understanding Users’ Cognitive Load During Visualization Processing with Highlighting Interventions Preface  ̶ In this chapter10, we describe a preliminary investigation of pupil dilation measurements collected from the intervention study in the previous chapter, to better understand user visualization processing. In particular, we address the question of how to adapt by looking at how a selection of pupil dilation measurements are affected when applying highlighting interventions designed to aid visualization processing of bar graphs. We provide preliminary evidence that monitoring pupil size as an estimate of cognitive load could be beneficial towards designing, testing, or validating highlighting interventions, since indications of high cognitive load could be used to filter out unsuitable interventions (as opposed to relying on task performance).                                                   10 The content of this chapter was published as [155]: Toker, Lallé, and Conati. (2017) Leveraging Pupil Dilation Measures for Understanding Users' Cognitive Load During Visualization Processing. Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP ‘17). 68  4.1 Introduction The primary purpose of information visualizations is to assist users in exploring, managing, and understanding data. To date, most visualizations follow a one-size-fits all approach and do not take into account user differences. Several studies have shown, however, that individual differences such as perceptual speed and verbal/visual working memory can significantly impact performance and preferences during visualization processing [32,42,157,167]. The long term goal of our research is to design user-adaptive visualizations that can support users based on their individual needs. As a first step toward this goal, Toker et al. [154,158] analyzed users' gaze behaviour during visualization processing using eye tracking and identified several significant differences in attention patterns. In this chapter, we extend this work by looking at pupil dilation measures. In particular, we analyze users' pupil dilation collected from a study involving visualization tasks with a bar graph and several alternative highlighting interventions designed to aid visualization processing. Results from this study pertaining to a variety of study factors on performance (completion time) already reported in [32]. Our aim is to combine these results with an analysis of pupil dilation to shed light on how the study factors (e.g., task type, interventions, and a variety of cognitive measures) impact visualization processing in terms of cognitive workload. Thus in this chapter we present a preliminary analysis focusing on the effect of interventions on two measures of pupil dilation (mean and standard deviation of pupil size). We also outline our pupil dilation calibration methodology which examines alternative calibrating options when measuring baseline pupil size. 69  4.2 Related Work The use of eye tracking in information visualization systems has already been shown to be a strong candidate for predicting characteristics about the user in real-time. Both Steichen et al. [148] and Gingerich et al. [60] showed that a large set of aggregate eye-gaze features are a viable source to predict user differences (e.g., perceptual speed, visual working memory). In addition, Lallé et al. [108] and Toker et al. [160] have shown that including pupil dilation measures along with the set of aggregate eye-gaze features can lead to significantly better predictions of user differences (e.g., confusion, skill acquisition).  Eye tracking has also been used to identify and understand differences in terms of how the visualization is processed by the user. Toker et al. [154,158] found several differences in visualization processing based on users' cognitive traits. For instance, users with low perceptual speed generated significantly more fixations and transitioned more often to the legend component of the visualization when compared to users with high perceptual speed. A similar result was found linking a user's visual working memory to the task answer input component of the visualization (e.g., radio buttons). These findings are important instances of how eye tracking can be leveraged for designing adaptive support since they identify specific elements of where users are having difficulty. Our aim is to extend this work with a similar analysis using pupil dilation data because it has been reliably shown that pupil dilation relates to changes in cognitive load [12,78].  Other research has also investigated pupil dilation within the context of user-adaptive systems. For instance, Iqbal et al. [84] evaluated cognitive workload during route planning and document editing tasks in order to identify opportune moments for interrupting the user. Prendinger et al. [141] monitor pupil dilation in order to predict user preferences when 70  confronted with a choice of visually presented objects. Martínez-Gómez & Aizawa [119] tracked pupil dilation measures in order to infer reading comprehension, which can be used to model individual users' topic familiarity. In this chapter, we examine pupil dilation in the context of information visualizations, to inform the design of adaptive interventions based on differences in cognitive workload.  Figure 4.1: Sample bar graph visualization and task administered in the user study. 4.3 User Study The dataset used in this chapter comes from a study that investigated the effectiveness of four highlighting interventions designed to help the processing of bar graphs, as well as how this effectiveness is impacted by both task complexity and different user traits. The long-term goal of this study was to understand if/which of these interventions would be suitable for providing adaptive support under specific circumstances, although in the study they were not presented adaptively. The study was a single session, within-subjects design, lasting at most 90 minutes. 62 participants performed 80 tasks using bar graphs (Figure 4.1) with a fully automated interface while their gaze was tracked via a Tobii T120 eye tracker. Users performed two different types of tasks (40 of each), chosen from a standard set of primitive data analysis tasks 71  in Amar et al. [5]. The first task was Retrieve Value, one of the simplest task types in [5], which in the study consisted of retrieving the value for a specific individual in the dataset and comparing it against the group average (e.g., "Is Michael's grade in Chemistry above the class average?"). The second, more complex task type, was Compute Derived Value, which in the study required users to perform a set of comparisons, and then compute an aggregate of the comparison outcomes (e.g., "In how many cities is the movie Shark Swamp above the average revenue and the movie Love Letter below it?"). All tasks involved visualizations with the same number of data points (48) and same number of bar groups (8). Each intervention evaluated in the study (shown in Figure 4.2) was designed to highlight graph bars that were relevant to answer the current question by guiding a user's focus to a specific subset of the visualized data while still retaining the overall context of the data as a whole [54]. The Bolding intervention draws a thickened box around the relevant bars; De-Emphasis fades all non-relevant bars; Average Reference Lines draws a horizontal line from the top of the left-most bar (representing the average) to the last relevant bar; Connected Arrows involves a series of connected arrows pointing downwards to the relevant bars. Each participant performed each of the two task types with each of the 4 interventions as well as No Intervention as a baseline for comparison, in a fully randomized manner. 72   Figure 4.2: The four different highlighting interventions evaluated in the user study. 4.4 Processing Pupil Data As mentioned in the previous section, user gaze during the study was tracked using a Tobii T120 eye tracker. In addition to sampling information on gaze fixations and transitions, the eye tracker also records users' pupil diameter. In order to avoid possible confounds on pupil size due to lighting changes, the study was administered in a windowless room with uniform lighting. Because there are typically physiological differences in pupil size among individual users, it is also customary to collect a baseline pupil size from each user that can be used to later normalize the pupil measures.  In most work, the baseline is obtained by measuring a user's rest pupil size, obtained at the beginning of a study under relaxed conditions where there is little or low cognitive load. In our study, we considered two different ways to create these conditions. One, following a standard approach found in the literature, involves having participants stare at a blank screen for several seconds. However, we were concerned about potential issues with luminosity differences between a blank screen and what is shown on the screen during an actual task. Therefore we measured an alternative calibration baseline by displaying a mock bar graph 73  visualization in order to produce similar lighting conditions to a real study task. We also removed the textual elements of the mock graph in order to minimize any added cognitive load. We distinguish between these two calibration measurements as: Blank/Graph11. Additionally, because the study was quite long and intensive (on average 90 min.), all participants were required to take a break halfway into the study. We took this opportunity to calibrate for pupil baselines twice in order to account for possible changes over the course of the study. Calibration measurements were therefore taken at the start of the session and again after the break, which are distinguished by: Start/Break. In terms of calibration methodology, we are interested in knowing how similar/dissimilar the baseline pupil size measurements are in terms of luminosity differences between a blank screen versus a screen with a mock bar graph (Blank/Graph), as well as differences over time during the study (Start/Break). A Pearson correlation of the baseline pupil values for Blank vs. Graph produced an extremely strong correlation that was statistically significant (r = .921, n = 122, p < .001), indicating that these measures are almost identical. A Pearson correlation of the baseline pupil values for Start vs. Break also yielded a strong correlation that was statistically significant (r = .902, n = 122, p < .001), indicating that calibration across time intervals is also very consistent. In light of these findings, we selected the baseline pupil measurement obtained under the Blank/Start calibration condition for adjusting pupil measurements during the first half of the study, and the Blank/Break baseline for adjusting pupil values in the second half of the study.                                                   11 Relative luminance of the Graph calibration screen was calculated to be 16% darker than the Blank calibration screen. 74  4.5 Results Several measures related to pupil dilation have been used in the literature which include: mean pupil size, minimum pupil size, maximum pupil size, standard deviation of pupil size, as well as measures that track the speed and acceleration changes in pupil diameter (see [119] for a summary). For this chapter’s preliminary investigation, we focus on two of these pupil measures for analysis. First we select mean_pupilsize since it is a well-established measurement that appears in almost all work that investigates pupil size. Second we select a somewhat less common measure std.dev_pupilsize because previous work looking at gaze fixation related measures [154,158] have found significant results relating to standard deviations which were computed based on gaze fixation angles. We then use the relevant baseline calibrations (see previous section) to normalize the pupil measures of each user by applying the percentage change in pupil size (PCPS) [84], which is defined as: measured_pupilsize −  baseline_pupilsizebaseline_pupilsize For both of the pupil measures mean_pupilsize and std.dev_pupilsize, we run a 5 (Intervention-Type) x 2 (Task-Type) ANOVA with Task-Order as a between subjects factor. Since we run two models, a Bonferroni correction of 2 is applied and p-values are reported post correction at the .05 level. 4.5.1 Effects of Intervention-Type There was a main effect of Intervention-Type on both mean_pupilsize (p < .001, R² = .942) and std.dev_pupilsize (p < .001, R² = .022). Refer to Figure 4.3 and Figure 4.4 for the directionality of these findings. 75   Figure 4.3: Main effect of Intervention-Type on users’ mean pupil size. Bonferroni-adjusted pairwise comparisons on mean_pupilsize (Figure 4.3) indicate that pupil size was significantly larger with the Average Reference Lines intervention than with all the other interventions. This is interesting because [32] reported, for the same study, that Average Reference Lines was the only intervention that did not significantly improve completion time when compared to tasks where No Intervention was provided. This suggests that the lack of performance improvement from Average Reference Lines could be explained in terms of increased cognitive load due to possible intrusiveness of this graphical object. It is also interesting to note that, whereas in [32] conditions with no interventions had similar performance as conditions with Average Reference Lines, No intervention has a significantly lower mean_pupilsize than Average Reference Lines. This suggests that slower completion time with no intervention is not the result of increased cognitive load, but rather it may be due to the lack of guidance provided by the more successful interventions. 76   Figure 4.4: Main effect of Intervention-Type on standard deviation of users’ pupil size. As for std.dev_pupilsize (see Figure 4.4), pairwise comparisons indicate that std.dev_pupilsize during Average Reference Lines is significantly lower than with all other interventions except for Bolding. Given that Average Reference Lines also has the highest mean_pupilsize, the low std.dev_pupilsize tells us that users are likely maintaining a consistently high cognitive load throughout the whole task when they receive this intervention. In contrast, with other interventions there are only selected points with higher values of std.dev_pupilsize. Because in [32] these interventions were associated with improved performance, these higher values may be associated with some notion of productive cognitive load (i.e., greater variability in pupil size is possibly an indicator of useful cognitive activity). 4.6 Conclusions & Future Work The long-term goal of our work is to build user-adaptive visualizations that can support the user based on their individual needs and states. In this chapter, we examined how pupil dilation measurements can be leveraged to better understand information visualization processing.  77  We started by providing methodology towards controlling for possible confounds that can interfere with measuring rest pupil size in a user study, which is needed to correct for physiological differences between users. We evaluated our methodology by comparing two different calibration methods for obtaining a user’s rest pupil size. First, we compared rest pupil sizes obtained on a blank screen versus a screen displaying a mock visualization since the screen brightness of our study tasks did not match the brightness of a blank calibration screen. We found a very strong significant correlation between both measurements, indicating that differences in rest pupil size between the two screens of differing brightness was consistent across users. Next, we compared rest pupil sizes obtained at the beginning of the study and during the middle of the study because the duration of the study was over an hour long. We also found a very strong significant correlation between both measurements, indicating that little difference exists between the two calibration times. Thus for our study, taking only one measurement of rest pupil size at the beginning of the study would have likely been adequate. Still, other researchers thinking of using pupil measurements in their studies ought to consider using the full set of calibration methods we presented here, in order to see if our findings will hold under other study conditions. Next, we examined the effect that several highlighting interventions had on pupil dilation. In particular, we found that Average Reference Lines was the only intervention for which mean pupil size was significantly larger. Average Reference Lines was also the only intervention that did not improve user performance, suggesting that the lack of improvement in performance is due to the high cognitive load induced by this intervention. We offer two possible implications for user-adaptive visualizations based on this finding. First, monitoring pupil size could be beneficial towards designing, testing, or validating interventions, since instances of high 78  cognitive load alone could be used to filter out unsuitable interventions (as opposed to relying on task performance). Second, pupil size could be leveraged as a real-time indicator of cognitive load to detect if users are having difficulty12. Adaptations could then be triggered to support instances of high cognitive load during visualization processing. In fact, similar approaches have already been used in other areas of HCI, where cognitive load is tracked to determine suitable times to interrupt the user (e.g., [84]). Lastly, more work will be needed to see if our findings will transfer to other visualizations, tasks, or interventions. Our hope is that members of the user modeling community interested in using pupil dilation methods in their research can help further corroborate our results.                                                   12 While it is true that higher cognitive load can indicate difficulty/challenge it can also indicate increased interest/engagement. However, we argue that it is more likely the former because our results showed that users only had significantly higher pupil-related measures (our indicator of cognitive load) when working with the Average Reference Lines intervention, and unlike the other interventions we administered, Average Reference Lines yielded no improvement in task performance (i.e., time on task). Nevertheless, further studies are still needed to better explain this finding. 79  Chapter 5: Predicting Skill Acquisition from Eye Tracking Data During Visualization Processing Preface   ̶ Using data collected from the intervention study in Chapter 2, we investigate in this chapter13 if using a variety of behavioral measures collectible with an eye tracker can predict a user’s skill acquisition phase while performing visualization tasks with bar charts. We first address the question of what to adapt to by offering evidence that a user’s level of Evolving Skill with a visualization (referred to as Skill Acquisition in the rest of this chapter) has a significant impact on task performance, even during the usage of simple information visualizations. We then address the question of when to adapt by providing evidence that machine learning models trained on data collected from an eye tracker, can identify users’ Skill Acquisition state during tasks with bar charts. 5.1 Introduction There is increasing evidence that users’ abilities, personality, and preferences influence their performance and satisfaction during information visualization tasks, e.g., [32,38,39,70]. These findings have prompted researchers to investigate user-adaptive information visualizations,                                                   13 The content of this chapter was published as [160]: Toker, Lallé, and Conati. (2017) Pupillometry and Head Distance to the Screen to Predict Skill Acquisition During Information Visualization Tasks. Proceedings of the 22nd International Conference on Intelligent User Interfaces (IUI ‘17). 80  i.e., visualizations that recognize and adapt to each user’s specific needs. For instance, work has been done on predicting various human factors for adaptation such as: cognitive measures including perceptual speed, visual working memory, and verbal working memory [32,149]; user knowledge of the content to be visualized [28]; user task performance [63], and user confusion with the visualization interface [108]. This chapter focuses on the long-term goal of devising visualizations that provide personalized support to ease a user’s learning curve by supporting the transition from unskilled to being skilled at working with visualization-based tasks that are unfamiliar to the user. In order to achieve this goal, in this chapter we discuss how to track users as they acquire the set of skills necessary to efficiently perform a new activity, i.e., processing and performing tasks with a target visualization in our specific case. We model skill acquisition based on the presence of a learning curve which is a standard concept in cognitive psychology used to represent the relationship between practice and the associated changes in behavior [147] (i.e., changes in skill, expertise, speed). While learning curves have been extensively investigated to study and adapt to skill acquisition in educational settings (e.g., [10,110]), their usage for personalization in HCI and visualization has so far been limited. Still, detecting and adapting to skill acquisition is important because customized support could be offered to users if it is inferred that they are in a state of skill acquisition when working with a system, in order to improve both their short-term task performance as well as their acquisition of proficiency. For example, support could be offered by preventing access to more advanced interface features for novice users until the necessary skills are acquired, or specific functionalities that novice users might otherwise overlook could be highlighted [29,81]. 81  In the context of information visualization research, Toker et al. [162]14 previously showed the presence of a learning curve in a study were users performed basic visualization tasks with bar graphs. That work was the first to explore the feasibility of detecting skill acquisition in real-time from gaze data collected with a non-intrusive eye tracker. Skill acquisition was modeled into two stages: during skill learning vs. after skill learning. In their work, Toker et al. [162] reported a gaze-based prediction model, capable of beating a 50% baseline for a binary prediction over these two states during any given study task. In this chapter, we build upon and extend the work in [162] by investigating the benefit of using two additional measurements of user behavior detectable by an eye tracker during visualization processing: pupil dilation and distance of the head to the screen (head distance, for short). We make the following hypothesis: Adding features related to a user’s pupil dilation and head distance during visualization processing will improve prediction accuracy of a user’s skill acquisition stage (i.e., during vs. after learning) as compared to solely relying on gaze data features. Our hypothesis is based on the fact that these two data sources have been shown to be potential predictors of user states related to learning during interaction with educational software. For instance, pupil dilation has been consistently linked to cognitive load (e.g., [12,78,83]), which in turn has been shown to impact how much users can learn from e-learning environments [92]. Furthermore [111] showed that pupil dilation can be used to detect                                                   14 The publication [162] from IUI’14 constitutes work carried out during my PhD, however it is not included as a thesis chapter since the work presented in this chapter (Chapter 5) overrides the work in [162]. Details about this are provided in Section 1.5 of the thesis Introduction.  82  improvement in performance over time with a visual tool for decision making. Head distance can be seen as an indicator of body postures (i.e., leaning toward or away from the screen) that have been linked to both engagement or boredom [44,85] and to how well users learn with educational systems [8]. Furthermore, [89] has shown that head distance can predict boredom during student interaction with an computer-based tutor for biology. Here we leverage information about a user’s head distance and pupil dilation for predicting two different learning states with a visualization system. The main contribution of this chapter is that our hypothesis stands. Using the existing dataset collected from the study in [162], we show that adding pupil and head distance information to previously evaluated gaze features can significantly improve binary prediction accuracy of users’ skill acquisition state by as much as 5% in terms of peak accuracy, compared to solely using eye gaze features. A second contribution relates to the feasibility of a simpler content-independent model, that can predict skill acquisition when information regarding the layout of the visualization is unknown or is potentially too challenging to model, resulting in the impossibility to track many gaze features that are specific to the visualization. We show that a model using only pupil dilation and head distance features (which do not require knowledge of the visualization layout) is still capable of reaching predictive accuracies of 60% in 13 seconds (a bit more than halfway through the duration of a single task), outperforming a majority class baseline. Making predictions using solely content-independent features in this way provides evidence toward the potential generalizability of our approach to other types of visualizations.  In the rest of the chapter, we first discuss related work. Next, we describe the visualization and dataset utilized. We then show the presence of a learning curve and how the binary skill 83  acquisition states are defined, similar to [162]. After that, we summarize the new approaches we use to build our predictive models. We conclude with results and a discussion of main findings and work to come. 5.2 Related Work A typical method used in cognitive psychology for modeling how user performance improves with practice is by using a learning curve [147]. Learning curves are also frequently used in HCI for off-line comparison and evaluation of information visualization systems, (e.g., [137,144,175,176]). In contrast, we leverage the concept of a learning curve for building predictive models that can identify in real-time during task interaction two broad stages of a user’s skill acquisition while working with an information visualization system.  Similar work has been extensively conducted in the field of Intelligent Tutoring Systems (ITS), using learning curves to track and adapt to a student’s evolving domain knowledge (as opposed to level of skill in using the system itself) while working with educational software. Hidden Markov Models [9] or logistic regressions [138] were used to infer students’ mastery in a variety of domain skills (e.g., performing one and two digit subtraction for a math tutor) based on students’ past performance and interaction logs [9], or based on speech output [13]. In visualization research, Item Response Theory has been used to assess a user’s visualization literacy, i.e., the user’s skill in using visualizations to handle information in an effective/efficient manner [27]. In contrast, we use eye tracking data, namely gaze movements, pupil dilation, and head distance to the screen, to dynamically detect a user’s evolving proficiency in working with a visualization, in terms of two overall skill acquisition phases (during learning and after learning). 84  Eye gaze data has been extensively used to detect different kinds of user’s states during interaction with an ITS, such as boredom, curiosity, disengagement [46,89], mind-wandering [21], as well as domain learning [24,98]. In addition, [14] used gaze data to predict users’ problem-solving strategies as well as user performance while solving a visual puzzle. In visualization research, gaze data has previously been used to carry out off-line analysis to understand how users with different expertise or abilities process visualizations. For instance, offline analysis of gaze data was used to explain why performance differences occurred between users while working with bar and radar graph visualizations (e.g., users were having difficulty processing the visualization’s legend)[158]. Offline analysis was also used to understand processing differences with highlighting interventions provided on bar graphs [154], and to understand how users with different domain expertise processed visualizations (e.g., [37,131]). Gaze data has also been used online to predict users’ problem-solving strategies performance while solving a visual puzzle [14]. In visualization research, online analysis of gaze data has also been investigated to predict in real time long-term user’s cognitive abilities/traits (e.g., perceptual speed, visual working memory, verbal working memory, locus of control), as well as task type, task completion time, and user confusion [60,108,149]. Pupil dilation has been investigated as a source of information for user-adaptive systems because it has been shown to relate to changes in cognitive load (e.g., [12,78,83]). Iqbal et al. [84] evaluated cognitive workload via pupillary measures during route planning and document editing tasks in order to identify opportune moments for interrupting the user without causing excessive interference with their primary tasks. Prendinger et al. [141] monitored pupil dilation in order to predict user preferences when confronted with a choice of objects presented on the screen. Martinez-Gomez & Aizawa [119] tracked pupil dilation to infer a user’s reading 85  comprehension, and consequent topic familiarity. Lallé et al. [108] showed that including pupil dilation measures, in addition to eye gaze measures, improved the capability of predicting user confusion with an interactive visualization to support decision making. Head distance and body postures have been identified as reliable indicators of users’ affective state. For instance, D’Mello et al., [44,86] found that leaning backward, as tracked by a posture chair fitted with multiple sensors sensitive to pressure, can be a good predictor of boredom or disinterest in an educational context. Jaques [89] found similar results with a simpler indicator of posture, namely the viewing distance of the user’s head from the screen, measured by a Tobii T60 eye tracker. Specifically, the results in [89] show that a model solely based on head distance significantly outperforms a majority baseline to predict boredom during interaction with an ITS for biology, and confirmed that a greater head distance was correlated to feeling bored. Since boredom has been related to learning [8], in this chapter we investigate whether head distance can also be used as a useful predictor of users’ skill acquisition state while working with an information visualization. An alternative approach to predict skill acquisition during visualization tasks is described in [111], which requires gathering information over multiple interface usages for each user. A learning curve was fit for each individual participant using a power law function, which captures the user’s initial level of expertise with a given visualization, as well as the rate of learning with the visualization. However, due to how the learning curves are modeled, the approach in [111] requires access to the history of a given user in terms of past exposition to the visualization, as predictions were made across a series of consecutive tasks completed by each user. Therefore, adaptive support could only be provided for subsequent tasks since the prediction of users’ learning curves are made either at the end of or between tasks. In this 86  chapter, we adopt a within-task oriented approach where user skill is predicted during the task. Specifically, information is collected from the beginning of a task without looking at previous performance data from earlier tasks (if they even exist). This approach is thus more suitable in situations where users interact with a visualization system only once, or when the user history in terms of the amount of practice with a visualization is not available. For instance, these conditions may occur with public kiosks or web-based visual tools which are typically designed for broad general audiences. Our approach can also allow for the swift delivery of adaptive support to users since predictions are possible after only a few seconds of observed interaction with a task. 5.3 Dataset, Features, & Labels In this chapter, we employ an existing corpus of data generated from a prior study. We leverage the data from this study in order to investigate users’ skill acquisition while they perform a series of 80 basic visualization tasks using bar graphs. The dataset consists of task performance and eye tracking data for 62 participants. Over the course of 90 minutes, each participant completed 80 randomized tasks, covering several combinations of task type and experimental conditions (Figure 5.1 shows an example task used in the study). The study tasks involved comparing individuals against a group average (data points in the bar graph) on a set of dimensions (data series in the bar graph). For variety, the task questions were drawn from four different domains. All tasks involved the same number of data points (six, including the average) and data series (eight). There were two types, chosen from a set of primitive data analysis tasks that [5] identifies as "largely capturing people’s activities while employing information visualization". The first task type was Retrieve Value (a relatively simple task), 87  which consisted of retrieving a specific individual in the target domain and comparing it against the group average; (e.g., "Is Christopher’s grade in English below the class average for that course?"). The second task type was Compute Derived Value (a more complex task type), which required users to first perform a set of comparisons, and then compute an aggregate of the comparison outcomes; (e.g., "In how many cities is the movie The Lost Explorer above the average revenue and the movie An Unfinished Life below it?"). User gaze was tracked with a Tobii T120 eye tracker, used as the study main display. Baseline pupil width was collected from each participant at the beginning of the study, with lighting conditions strictly controlled and remaining constant during the study. For a complete description of the study see [32].  Figure 5.1 Sample bar graph visualization and task administered to users during the study. 5.3.1 Eye Tracking Feature Sets Here we describe the three different feature sets generated from eye tracking data. All participants were required to have a visual acuity of 20/20, either uncorrected or corrected with glasses.  88  1. GAZE Features (86 total): a) AOI-Independent Features (14 total): Sum, Mean, & Stddev of fixation durations Sum, Mean, & Stddev of saccade distance Sum, Mean, & Stddev of relative saccade angles Sum, Mean, & Stddev of absolute saccade angles Fixation rate Count of fixations b) AOI-Specific Features (72 total): Fixation rate on AOI Longest fixation on AOI Time of first & last fixation on AOI Sum of fixation durations on AOI Count of fixations on AOI Count of transitions from this AOI to each AOI 2. PUPIL Features (10 total): Mean, Stddev, Min, & Max of pupil width Mean, Stddev, Min, & Max of pupil dilation velocity Pupil width at the first & last fixation in a given task 3. HEAD DISTANCE Features (6 total): Mean, Stddev, Min, & Max of head distance to screen Head distance at the first & last fixation in a given task  Table 5.1: Set of features generated using Tobii T-120 eye tracking setup and EMDAT processing. Gaze Features. Raw gaze data consists of fixations (points of gaze on the screen) and saccades (quick movements between fixations). Raw gaze data is collected from the Tobii T120 eye tracker using the ClearView fixation filter, and is then processed with EMDAT (www.github.com/ATUAV/EMDAT) to generate a battery of aggregate gaze-based features. Some of these features capture overall gaze activity on the screen (see 1a. in Table 5.1) while others do so for specific Areas of Interest (AOI) in the visualization (see 1b. in Table 5.1). Six 89  areas of interest corresponding to various conceptually distinct regions of the visualization layout are utilized (see Figure 5.2).  Figure 5.2: Areas of Interest (AOI) defined over the interface. Pupil Dilation Features. The Tobii T120 eye tracker records the user’s pupil diameter (the horizontal width of each pupil) at each sample (120hz). Similar to gaze, we used EMDAT to compute a variety of features that describe the pupil diameter over the span of a task, for a total of 10 features (see 2. in Table 5.1). The features mean, stddev, min and max pupil width are included since other work has used these measures to capture the range of a user’s cognitive load during tasks [119]. Additionally, we include the start and end pupil width, because research has shown that there can be local peaks and troughs of users’ cognitive load at boundaries between sub-tasks [84]. As for pupil velocity, we also generated the mean, stdev, min and max. Previous work has used pupil velocity to infer users’ search intentions in video retrieval tasks [169], as well as reading comprehension [119]. To account for potential physiological differences in pupil size among individual users, measured pupil dilation values 90  for each user are adjusted with respect to their baseline using the percentage change in pupil size (PCPS), reported in µm, which [84] defines as: measured pupilsize − baseline pupilsizebaseline pupilsize  Pupil dilation features are generated without any knowledge of the visualization layout, and are thus considered content-independent. While including other more complex features such as Index of Cognitive Activity [117] and maximum pupil power [20] may lead to even better prediction results, we investigate only basic standard pupil features given that our main goal is to determine the general usefulness of including pupil features for predicting users’ skill acquisition phase. Head Distance to Screen Features. The Tobii T120 eye tracker measures head distance by recording the viewing distance from both the user’s eyes to the screen at each sample (120Hz). In order to estimate head distance to the screen, EMDAT averages the viewing distance of the left and right eye, measured in cm. As with pupil width measures, we used EMDAT to compute a similar set of features that describe user head distance to the screen over the span of each task (see 3. in Table 5.1). Since head distance features are computed independent of the visualization layout, they are also considered content-independent. 5.3.2 Labeling Skill Acquisition A previous analysis of this dataset detected the presence of a learning curve [162] shown in Figure 5.3, where the average task completion time across all users is plotted over the 80 study tasks in ascending order of completion. For the first 40 trials task performance continues to improve while users become more practiced as they perform additional tasks (left of blue dashed line in Figure 5.3). For the successive 40 trials (right of blue dashed line in Figure 5.3), 91  performance stabilizes as indicated by both reduced variance across trials and a lower bound on performance (dotted green horizontal line). Therefore, the first 40 trials that each user performs are labeled as during skill acquisition and the last 40 trials as after skill acquisition.  Figure 5.3: Improvement in average trial completion time across the 80 tasks in the dataset (randomly administered for each user). The blue line separates trials into two general stages of skill acquisition:   during - skill with the visualization is in the state of being acquired since performance is still improving; and after - skill with the visualization has been acquired since performance change has stabilized. 5.4 Machine Learning Setup The aim of this work is to use eye tracking data as input in order to predict the correct skill acquisition label (i.e., during vs. after skill acquisition) on any given trial without knowing which trial a user is currently doing. In order to simulate real-time predictions of a user’s skill level while engaged with a given task, we generate features over consecutively increasing time slices corresponding to partial observations of eye tracking data during a task. These time 92  slices range from 2 to 20 seconds (20 seconds is the mean time to complete a task), over 1 second intervals for each task. For example, features generated at a 6 second time slice would model the real-world scenario where an adaptive visualization has observed only the first 6 seconds of a user’s behavior from the beginning of the current task. At each of the 19 time-slices, we evaluate 5 different feature set combinations derived from eye tracking data (i.e., Gaze, Pupil, HeadDistance). We also include a baseline model, for a total of 6 models executed at each time slice.  To predict users’ skill acquisition phase, we built five different binary machine learning classifiers using the Caret machine learning package in R [104], and reported classifier performance as predictive accuracy, i.e., the total number of correct predictions divided by the total number of correct and incorrect predictions. First we tried linear regression, since it has been used previously for making predictions using similar data (e.g., [149]). Next, we tried four standard machine learning algorithms (Naive Bayes, SVM, Neural net, and Random forest), to see if it was possible to achieve better performance given that this chapter includes additional types of attributes (pupil & head distance) compared to previous work. Overall we saw better predictive accuracy from the Random Forest algorithm (which was also found to be the case for data collected from a different study reported in [60]), and we thus opted to report results for Random Forest only. In order to simulate real-world settings where data regarding a new user is unknown, classifiers were evaluated using 10-fold cross validation over users (i.e., at each fold of the cross validation, users in the test set do not appear in the training set). Then, we repeat this process 5 times (runs) to strengthen the stability and reproducibility of the results, and the performance of each algorithm is averaged over the 10 folds and the 5 runs. 93  5.4.1 Model Baseline Since classifications are done using consecutively increasing partial observations of eye tracking data within a given task (e.g., 2s, 3s, 4s, ... up to 20s), cases arise where some users complete the task in under 20 seconds, resulting in time slices in which several users are already done with the task. To generate a rigorous baseline, we remove such users from our dataset at those time slices before classifying each new time slice within a task. Retaining these users may bias our eye tracking features since several of them are correlated with time (e.g., sum fixation durations). Thus the majority class baseline is recalculated accordingly as time elapses within a task. In our dataset, not surprisingly, users who finish earlier within a given trial are more likely to be skilled users (i.e., users in the after skill acquisition state), which results in a rising proportion of unskilled users (i.e., users in the during skill acquisition state) as time lapses over a given trial. The dashed red line in Figure 5.4 shows indeed that this strict baseline becomes more weighted as time unfolds, with a starting baseline accuracy of 51% which rises over time to 64% baseline accuracy. 5.5 Results We first compare the performance accuracies of the various combinations of Gaze, Pupil and HeadDistance models, with the specific goal of ascertaining the added predictive value when including the Pupil and HeadDistance features along with Gaze. We then report the most predictive features of the best performing model and discuss how these features relate to skill acquisition in terms of directionality of the underlying features themselves. 94   Figure 5.4: Predictive accuracy across time slices based on feature set combination. GAZE is shown using a dotted blue line and corresponds to the best model previously published in [162]. 5.5.1 Predicting Skill Acquisition Figure 5.4 reports the accuracy over consecutive time slices (i.e., over the 19 time windows of increasing length described earlier) of the 5 combinations of tested feature sets, as well as the accuracy over time of the baseline (dashed red line). Note that the model that previously obtained the highest accuracy in [162] (i.e., Gaze) is represented by dotted blue line15. The trends shown in Figure 5.4 provide an initial assessment of how much interaction data a real-time classifier of skill acquisition would need in order to generate reliable predictions. Ultimately, depending on how early within the task adaptive support is required, Figure 5.4                                                   15 Note that previous work in [162] did not perform user-independent prediction, explaining the slightly higher accuracies reported in [162]. 95  illustrates the tradeoff in accuracy when predicting skill acquisition early on versus delaying the prediction as time elapses. To formally compare the accuracies of the 6 classifiers (i.e., 5 feature set combinations + 1 baseline), we run a linear mixed-effects model [56] with feature set (6 levels) and time-slice (19 levels) as the two independent variables, and predictive accuracy as the dependent measure. The analysis revealed a significant main effect of feature set on classification accuracy (F5,470 = 136.59, p < .001), which indicates that overall significant differences exist between feature sets regardless of the amount of eye tracking data available for classification as a task unfolds. Follow-up Bonferroni adjusted pairwise comparisons of the over-time accuracy for each of the 6 levels, shown in Table 5.2, revealed that: The HeadDistance+Pupil+Gaze model is better than all other models. Each of the Pupil and HeadDistance models do not beat the baseline. The Gaze model and HeadDistance+Pupil model beat the baseline, but are not significantly different from each other. We can see in Table 5.2 that all models using either eye tracking features or combining pupil and head distance features together outperform the baseline. The Gaze only model (investigated in [162]), outperforms Pupil only and HeadDistance only, but it is then outperformed by HeadDistance+Pupil+Gaze, indicating that combining all three feature sets significantly improve prediction accuracy of a user’s skill acquisition.     96    Models Average over-time accuracy HeadDistance+Pupil+Gaze 62.5% Gaze  59.7% HeadDistance+Pupil  59.1% HeadDistance 56.5% Pupil 56.3% Baseline 55.9%   Table 5.2: Effect of feature set combination on overall model performance averaged across all time-slices. Rows are arranged in descending order of classifier accuracy. Dashed lines separate models that are not statistically different from one another. In terms of peak accuracies, Figure 5.4 shows the best accuracy for the Gaze only model at 62.5%, whereas HeadDistance+Pupil+Gaze has a peak accuracy of 67%. Additionally, in terms of early prediction capabilities, HeadDistance+Pupil+Gaze achieves 57.5% after having seen only 2 seconds of a user interacting with the system (from a 51% baseline) and gets to 64% halfway through the duration of the interaction. Even though Figure 5.4 shows that the Gaze only model also performs relatively well during the first 10s, pairwise comparisons of the over-time accuracy for only the first 10s indicates that HeadDistance+Pupil+Gaze is in fact still significantly better than Gaze only (p < .001) for early prediction. Interestingly, the upward trend of the Gaze model ceases after 10 seconds. Although we don’t have a clear explanation for this finding, it is worth considering that many Gaze features are sensitive to accumulation (e.g., sum, count, total time spent, etc.), and thus might become less informative as time elapses. 97  Also worth noting is the fact that the model combining HeadDistance +Pupil still beats the baseline, with an over-time accuracy of 59.1%, which is only 3% behind the model using HeadDistance+Pupil+Gaze. This result is important because it illustrates the potential of utilizing only a few eye tracker features (in this case 16 features, as opposed to 102 features when including Gaze information) and leaner feature sets are generally known to be less likely to overfit unseen data [3]. Furthermore, HeadDistance and Pupil do not require knowledge of what is displayed on the screen, namely, they are content-independent. Although HeadDistance+Pupil reaches 60% accuracy in about 13 sec. (around two thirds of the interaction), Figure 5.4 shows that the accuracy of this model is not as good as Gaze during the first 12 seconds, and increases considerably afterward. Interestingly, HeadDistance+Pupil exhibits similar accuracy as the best model at the beginning 2 seconds of the task. Although we can’t clearly explain these findings, investigation of the most important features (see next Subsection) can provide more details about these trends. Overall, from a practical point of view, these results suggest that content-independent features only (pupil and head distance) are promising toward generalization, but may require a slightly delayed adaptation or customization offered to the users.  In terms of the added value of HeadDistance and Pupil feature sets as predictive sources, the fact that the combined HeadDistance+Pupil significantly outperforms either Pupil only or HeadDistance alone indicates that these two features set do not capture overlapping information and thus both feature sets ought to be utilized if possible. 98  5.5.2 Most Predictive Features We report the top features from the best performing classifier identified in the previous subsection, namely: HeadDistance+Pupil+Gaze. The purpose of reporting the features with the highest impact on classification accuracy is to shed light on which specific features within Gaze, Pupil, and HeadDistance contribute to the model and thus to what extent these features may relate to skill acquisition. Once trained, the random forest algorithm we used provides importance scores based on how much each feature contributes to making successful predictions. Since classifiers are constructed at each time-slice from 2s to 20s, we determine the features with the highest importance by averaging their scores across all time slices. The resulting averages are normalized so that the most important feature has a score of 100. Features with the 10 highest scores are shown in Table 5.3. Next, to gain insight into the underlying directionality of the features, we compute a difference in values of each feature between the two states of skill acquisition (last column in Table 5.3), by subtracting a feature’s mean value for all after tasks from the mean value of during tasks. For instance, since the difference in mean values for starting head distance is negative, -17.06 cm, it indicates that the values for this feature are typically lower in the during state of skill acquisition (i.e., closer to the screen).      99  Feature Importance Unit during -  after* pupil_width · max 100 µm  27.84 head_distance · start 91 cm -17.06 pupil_width · mean  84 µm  32.02 question AOI · duration 82 ms  51.32 head_distance· min 79 cm -17.17 pupil_width · start 64 µm  29.14 pupil_width · end 59 µm  29.44 pupil_velocity · max 56 µm/ms  27.02 question AOI · longest fix. 52 ms  97.17 pupil_velocity · stddev 51 µm/ms 111.2     Table 5.3: Top 10 most predictive features across all time slices, along with directionality of the feature. *Negative values indicate the feature is lower during skill acquisition. Head Distance: As Table 5.3 shows, 2 head distance features are in the top 10 (start and min), with head distance at the start of a task being the more important feature. Start head distance captures how close a user is to the screen at the very beginning of the tasks. In terms of directionality, starting head distance to the screen is closer during skill acquisition, meaning that users lean in more at the beginning of the task while they are still learning the system. It is worth mentioning that the value for the starting head distance feature does not change as a task unfolds. Thus it makes sense that if starting head distance is the second most predictive feature, then it would offer similar predictive value whether 2 seconds or 20 seconds have elapsed in the task. This is a very promising feature in terms of early predictions because it can be obtained at the very beginning of a task with little knowledge about the user. As users become more accustomed to the study tasks/visualization in the latter half of the study, they are leaning back more at the beginning of each task and are likely more relaxed and confident 100  with the system. Minimum head distance is the next most important head distance feature. In particular, this feature tracks the closest recorded head distance to the screen as a task unfolds. Unlike starting distance, this value could change during the course of a task (e.g., a user may lean in close partway in the task as opposed to at the start). Thus, minimum head distance likely captures engagement in the same way as starting distance, but this measure is sensitive to engagement/difficulty that occurs at later moments in the task. Pupil Dilation: Six pupil dilation features are among the top 10 in Table 5.3: max, mean, start, and end pupil width, along with max and stddev pupil velocity. Max pupil width is also the most important feature overall. For all of these pupil features, they are larger during skill acquisition. Larger pupil width [12,78,83] and faster repeated changes in pupil dilation [117] have reliably be linked to higher cognitive load. Thus these results suggest that cognitive load was both greater and less consistent during skill acquisition. Or conversely, for after tasks, users required less cognitive load (and consistently so) once the necessary skills to work with the tasks/visualization were obtained. Similar to start head distance, the start pupil is obtained at the very beginning of a task, and thus is promising in terms of early predictions as well. Gaze: Two AOI (Area of Interest) features are among the 10 most important in Table 5.3, and are both related to the question AOI, which covers the region of the visualization where the study tasks were displayed to the user (see Figure 5.2). These two features track the total fixation duration (sum_fixation_durations) and the duration of the longest fixation in the question AOI. The directionality indicates that users spent more time fixated within the question AOI of the visualization during skill acquisition, and also had larger maximum fixation durations. This finding indicates that features relating to the question AOI (as opposed to the other AOIs) are most useful for predicting skill acquisition, likely due to the fact that as 101  time passes, users become more familiar and proficient with how the task questions are posed and structured, and thus come to need less time to read/process them. 5.6 Discussion and Conclusions In this chapter, we presented work on classifying skill acquisition using various eye tracking data sources, with the long-term goal of using this research to design user-adaptive visualizations that can personalize the interaction to a user’s current skill state. Specifically, we investigated if and how using added feature sets based on pupil dilation and head distance to the screen can improve prediction of skill acquisition compared to solely using gaze movements, as was done in [162]. We show that when using pupil dilation and head distance feature sets together we can beat the baseline compared to using either feature set alone. Furthermore, combining pupil, head distance, and gaze features not only performs significantly better than using gaze only, but also achieve accuracies that are promising toward guiding real-time interventions. This better performing classifier achieves an over-time accuracy of 62.5% on unseen users, compared to 59.7% using solely gaze behavior. Even after seeing only 10 seconds of observed data, this classifier can predict a new user’s skill acquisition phase with 64% accuracy halfway through the duration of the interaction (from a 55% baseline), providing encouraging evidence on the feasibility of early prediction of users’ skill acquisition phase based on the various information sources available through an eye tracker. Early prediction is of prime importance for our long-term goal of adapting a visualization to the current skill acquisition phase of the users. We also show that when using only content-independent eye tracking features together (pupil and head distance), skill acquisition can be predicted with an over-time accuracy of 60% 102  after having seen about two thirds of the duration of the interaction. Although this result indicates that adaptation or customization driven by only pupil and head distance features may require a slightly delayed prediction, our findings are still promising for the possible generalization to other visualizations or interfaces since pupil and head distance features are computed independent of the visualization/interface layout. By investigating the most predictive features in our best performing classifier, we identified both increased pupil dilation related measures and leaning closer to the screen as key behaviors present while users are becoming familiar with the visualization system. Increased pupil dilation measures are most likely an indication that users have a higher cognitive load while learning the skills necessary to work with the visual tool. For head distance, leaning forward to the screen might indicate that users pay more attention to the components of visualization or are trying to concentrate more while they are less familiar with the tasks and visualization. Conversely, leaning back from the screen might reveal that users feel more at ease after skill acquisition has occurred. One caveat of our findings is that it can be difficult to reliably track pupil dilation in real-world settings, because of its well-known sensitivity to changes in environment lighting (e.g., [78]). Nevertheless there is already work showing that changes in lighting can be mitigated using advanced techniques based on wavelet decomposition [118], thus as part of our future work we plan to conduct studies to evaluate the effectiveness of these techniques for our user-modeling purposes. In sum, our work has provided initial evidence on the added value of using pupil dilation and head distance to predict skill acquisition during interactions with bar graphs, with the long-term goal of creating visualizations that can support users detected to be in the skill 103  acquisition phase. Having such visualizations is especially useful in single-serving or walk-up-and-go contexts, where users need to interact with a visualization for a limited time and would benefit from having support that helps them accomplish their desired tasks if they are not familiar with the interface.  To illustrate with a real-world example, multi-modal documents containing text that describe different aspects of accompanying graphs are extensively used in publications directed toward a broad audience (e.g., articles from the Economist) [40]. Typically, documents of this type are viewed only once, thus detecting skill acquisition quickly within a single-serving scenario could be of great value. There is already work on generating corpora of multimodal documents with explicit links between elements in the visualization and related sentences [101]. We are planning to leverage these corpora to generate an adaptive system that can track which reference to the visualization a user is reading in the text, and whether the user is unskilled with the visualization. Users’ attention can then be adaptively cued to relevant elements of the visualization using techniques such as highlighting (see [32] for examples of visual prompts evaluated on bar graphs). A second example of where we envision user-adaptive visualizations based on user skill is with MetroQuest [73]. MetroQuest is a commercialized decision-support tool deployed to engage and educate communities about urban plans, as well as to collect informed input to help policy makers understand the expectations of their target audiences. This tool aims to increase community awareness by providing users with several visualizations like deviation charts and interactive maps. Designing MetroQuest interfaces is challenging as this tool is often used in public kiosks by users with very heterogeneous backgrounds. For instance, while complex visualizations conveying rich information would satisfy some users, they may 104  overwhelm others who abandon their task as a result. The challenge is exacerbated since MetroQuest is typically used as a walk-up-and-use system (e.g., in public kiosks) that, in order to avoid attrition, must be self-explanatory and engaging to first-time users. Having the ability to provide adaptive support or customization based on a user’s skill acquisition phase would allow MetroQuest to potentially increase user engagement, and reduce attrition. Adaptive support could involve, for instance, displaying only one visualization for which the system detects that the user has sufficient skill for comprehension. Alternatively, the system could provide visual cues to facilitate the processing of the available visualizations, as discussed above. As future work, we plan to run studies to establish if/how the results we have presented on predicting skill acquisition generalize to other visualizations beyond bar graphs, especially in settings relating to the two real-world applications described above. We will also investigate further improvements to our classifiers for skill acquisition. For example, we plan to expand our set of eye tracking features to include more complex pupil measures such as maximum pupil power [20], as well as features based on the rate of change of our eye tracking measures (e.g., pupil and saccade acceleration) given that evidence has shown that kinematic features have the potential to further improve prediction accuracies of other user states [22,119]. We will also explore integrating eye tracking data with complementary input features such as mouse movements [126], or interface actions when available as suggested by [98], that could also serve to improve prediction accuracies of skill acquisition. 105  Chapter 6: Impact of User Characteristics on Performance and Gaze with Magazine Style Narrative Visualizations Preface   ̶  In this chapter16, we broaden the investigation of which user characteristics to adapt to from stand-alone visualizations to visualizations embedded in narrative text as they are commonly found in magazines, blogs, text-books, technical reports, etc. These types of documents are commonly referred to as Magazine Style Narrative Visualization, or MSNV for short [146]. Similar to what we did in Chapter 2 and Chapter 3, we analyze task performance and eye tracking data collected from a user study we conducted with MSNVs to uncover processing behaviors that are negatively impacting user experience (i.e., time on task) for users with low user characteristic abilities. First, we address the question of what to adapt to by providing evidence of several user characteristics that can impact task performance with MSNVs. We then address the question of how to adapt by presenting results from analyzing users’ gaze data, showing that adaptive support may benefit users with low English Reading Ability (referred to as Reading Proficiency in the rest of this chapter) while they are processing                                                   16 The content of this chapter is accepted for publication: Toker, Conati, and Carenini. (2019) Gaze Analysis of User Characteristics in Magazine Style Narrative Visualizations. The Journal of Personalization Research: User Modeling and User-Adapted Interaction (UMUAI ‘19), to appear. 106  MSNVs, specifically by helping them locate relevant information in the visualization. Note that Reading Proficiency in this chapter is referred to in the next Chapter (Chapter 7) as X_LEX, and another user characteristic related to reading ability discussed in this chapter Verbal IQ is referred to in Chapter 7 as NAART. 6.1 Introduction As digital information continues to accumulate in our lives, information visualizations have become increasingly relevant for discovering trends and shaping stories from this overabundance of data [80]. However, visualizations are typically designed and evaluated following a one size-fits-all approach, meaning they do not take into account the specific needs of individual users. This is problematic because there is mounting evidence that user characteristics such as cognitive abilities, personality traits, and learning abilities, can significantly influence user experience (e.g., performance and satisfaction) during information visualization tasks [111,133,157,167]. These findings have prompted researchers to investigate user-adaptive information visualizations, i.e., visualizations that aim to recognize and adapt to each user’s specific needs. Whereas existing work has been mostly limited to tasks involving just visualizations, the aim of our research is to broaden this work to include scenarios where users interact with visualizations embedded in narrative text, known as Magazine Style Narrative Visualization [146], or MSNV for short (e.g., Figure 6.1).  107   Figure 6.1: An example of two references in a MSNV document, each consisting of a sentence in the body of narrative text and corresponding data points within the visualization. Source: The Economist - Dec 22, 2012 Combining text and graphical modalities is a widespread and well-established approach to convey complex information, e.g., [52,112,120,145]. In a narrative visualization, graphics and text play complementary roles. While graphics can convey large amounts of data compactly and support discovery of trends and relationships, text is much more effective at pointing out and explaining key points about the data, in particular by focusing on specific temporal, causal and evaluative aspects [164]. As a result, in MSNVs often there is more than one visual task specified throughout the narrative text. Multiple visual tasks in MSNVs are captured by references, namely segments of text that specifies a visual task on an accompanying visualization. Typically, references are used to support arguments or statements being made in the document text by providing added details or interpretations on a subset of data shown in the accompanying visualization. Figure 6.1 provides an example of two references in a MSNV. 108  One reference is the sentence “India and China will have further strong rises”, and it refers to the bars marked by the solid red arrows in the accompanying bar chart. The second reference is the sentence “Brazil and Britain will suffer reverses”, and it refers to the bars pointed to by the dashed green arrows. As a user reads through a MSNV, they will often encounter a variety of references in the text, each soliciting attention to different aspects of the accompanying visualization. Visualizations in a MSNV cannot be designed to favor the visual task of any particular reference, because favoring one task may hinder the others, thus Carenini et al. [31] proposed to facilitate MSNV processing by interactively highlighting relevant aspects of the visualization depending on what part of the text the user is reading and possibly on the user’s characteristics that may impact MSNV processing. This guidance is a form of cuing, which has been investigated to support learning from multi-modal material in instructional settings; see [61] for an overview.  The long term goal of our work is to design and implement such user-adaptive support to MSNV processing, based on the following methodology:  Conduct exploratory user studies in order to identify which user characteristics can impact MSNV processing and therefore may warrant adaptive support.  Leverage eye tracking data to investigate where users who are low on the relevant abilities identified are struggling during MSNV processing.  Design adaptive support mechanisms to alleviate these difficulties. This chapter presents results related to the first two steps of this methodology, as well as guidelines on how to address the third step, grounded in these results. In previous work 109  [156]17, we reported a preliminary analysis on a user study we conducted with MSNVs. In that study, we measured a battery of nine different user characteristics in order to identify which ones play a significant role during MSNV processing, and found indications of user characteristics specifically impacting performance. In this chapter, we expand that analysis and include eye tracking data that was collected during the study. Here, we first identify 4 user characteristics (Verbal Working Memory, Reading Proficiency, Need for Cognition, and Verbal IQ) for which users low in either of these abilities are at a disadvantage, in terms of either longer time on task or low accuracy. Next, we perform an analysis of gaze data aimed at identifying where significant differences in MSNV performance are occurring for each of these four user characteristics in terms of how the documents are visually processed. To accomplish this task, we generate numerous gaze metrics over distinct regions (i.e., Areas of Interest, or AOIs for short) of the MSNV documents, and then leverage Linear Mixed-Effects Models to identify significant relationships between user characteristics and gaze metrics that relate to low task performance. The overall methodology adopted in this chapter is inspired by previous work on designing user-adaptive support for visualization processing [154,158], and allows us to clearly identify sub-optimal gaze processing behaviors of users with low abilities in user characteristics which contributes to their decreased task performance. Specifically, we identified several sub-optimal gaze processing behaviours shown by users with low measures of Reading Proficiency when                                                   17 The publication [156] from IUI’18 constitutes work carried out during my PhD, however it is not included as a thesis chapter since the work presented in this chapter (Chapter 6) overrides the work in [156]. Further details on this are provided in Section 1.5 of the thesis Introduction and also in Appendix A. 110  they process the MSNV visualizations. These behaviors are captured by different elements of the visualizations (e.g., transitions between relevant and non-relevant bars) suggesting that low Reading Proficiency users could benefit from guidance specific to the multimodal nature of the MNSV as proposed in [58].  The remainder of this chapter is structured as follows, first we discuss related work, followed by a description of the user study. Next, we conduct an analysis of user experience with MSNVs to identify relevant user characteristics. We then describe gaze metrics that we computed from eye tracking data collected during the study, followed by an analysis of gaze metrics, relevant user characteristics, and MSNV performance. Lastly, we wrap up with a discussion and conclusion. 6.2 Related Work 6.2.1 Relevant Findings from Psychology There has been extensive work in psychology on investigating how people process combinations of textual and graphical information, mainly related to instructional text, with several findings supporting the intuitions that two media are better than one and that user characteristics should impact MSNV processing. For instance, [77] showed that students who studied instructional material on pulley systems that contained both text and diagrams scored better on kinematic comprehension questions than those who studied an informationally equivalent version with only text. More tellingly for our work, the study also looked at the impact of two students’ abilities (aptitude for reasoning with mechanical principles and reading ability) on both learning and gaze patterns when studying with the text and diagram material. Interestingly, no effect was found for reading ability, possibly because the 111  participants were students at one of the top American universities and thus had all high reading abilities. Mechanical aptitude had no effect on learning outcomes, but a marginal effect on time taken to study the material (higher for low ability students), which could be explained by the significant differences found in two specific gaze patterns: low mechanical ability students re-read more clauses in the text and inspected the diagram more often. More recently, [172] present evidence that working memory capacity (WMC - a trait of individuals in relation to their ability to use their working memory system) can predict learning from illustrated text. They argue that lower WMC reduces a reader’s ability to select specific information and integrate it to develop overall understanding, and they suggest various forms of personalized support for learners with low WMC. In this chapter, we also consider two user characteristics related to working memory and study their impact on MSNV processing. Focusing on a different user trait, [96] argue that whether delivering instructional material by integrating two modalities increases comprehension or creates overload depends on the viewer’s expertise. For instance, in a much earlier seminal work [95] found that inexperienced electrical trainees learned better from textual explanations integrated into the diagrams of electrical circuits, whereas more experienced trainees performed better with the diagram only. In our study, we do not look at the user characteristic of domain expertise, but this could be a venue for future work. 112  6.2.2 User Characteristics in Visualization Research An accumulating amount of work has linked several user characteristics18 to performance and preference with various types of information visualizations. For instance, the cognitive ability perceptual speed has been shown to correlate negatively with time on task while working with static grouped bar charts [32], three-dimensional representations [167], as well as interactive stacked bar charts [38], and it can also influence visualization suitability among available alternatives [4,42]. For the cognitive ability visual working memory, users with high levels of this ability were found to have a stronger preference for radar charts over bar charts [157], and were shown to prefer deviation charts over maps [109]. Findings linking other cognitive traits to visualization performance include: disembedding on task accuracy [167], verbal working memory on response time [32,38], spatial memory on both task performance [42,167] and visualization usability [109], and need for cognition on task accuracy [42]. Even some personality traits, such as locus of control, have been shown to play a significant role in determining which layout of tree visualizations a user performs best with [71,133,177]. All of these findings provide strong motivation for developing visualizations that are user-adaptive, i.e., visualizations that can support individual users by tailoring the interaction according to their relevant user characteristics. Generally speaking, the work presented in this chapter is essentially broadening all this previous work on visualizations only to scenarios where users interact with visualizations embedded in narrative text.                                                   18 Definitions of the user characteristics discussed in this subsection are provided in Table 6.3 (Section 6.3.4).  113  6.2.3 User Adaptation Cuing, namely adding visual prompts that guide learners’ attention to relevant elements in multimodal material, has been extensively investigated as a means to provide support (see [61] for an overview) and has generated positive results for written text with graphics. For instance, [58] and [134] show that color coding matching parts of the text and the graphics can increase comprehension. Yet, this approach can raise the issue of not having a sufficient number of easily distinguishable colors for color matching. [93] sidestepped this problem by color matching corresponding parts of text and graphics dynamically. They gave to novice learners instructional material on an electric circuit, including both a diagram and a textual description. Attentional guidance was provided dynamically when student clicked on a specific paragraph by color coding all the electrical elements mentioned, both in the text as well as in the diagram. Results showed that novices who received this guidance learned significantly more than those who studied the same material without it. Similarly, [31] proposes the concept of dynamic cuing for helping users process MSNVs, by guiding user attention to relevant parts of a graph as users read the corresponding textual reference (as detected via eye tracking), but they did not consider the impact to user characteristics as we do in this chapter.  Carenini et al. [32] evaluated several forms of dynamic highlighting to guide attention to relevant data points within grouped bar charts (stand-alone, i.e., not included in MSNVs) and showed a significant improvement in task performance compared to using no interventions, paving the way to the idea of effective cuing in MSNVs. As discussed in the introduction, [156] conducted a preliminary investigation on whether user experience while processing MSNVs (comprehension, time on task, and subjective measures of satisfaction) depends on specific user 114  cognitive abilities or traits. Here, we extend that work by further investigating the impact of user characteristics on MSNV processing, including an in-depth analysis of gaze patterns. Several works have also shown the value of providing dynamic personalized guidance in processing visualizations systems. Guidance is provided either by proposing different visualization based on detected user needs such as suboptimal behaviors [63] and evolving knowledge [68], or by changing aspects on the current visualization [129]. There is also initial research on providing dynamic guidance to reading. For instance, [113] leveraged eye tracking to ascertain the feasibility of inferring word relevance during reading tasks, to assess the informational needs of users and provided personalized content. Our work can be seen as building the foundations for extending personalized guidance to MSNVs reading. In narrative visualization, previous work has looked at automating the generation of new text and graphical presentations [69], as well as identifying sentences in the narrative text to corresponding datapoints in the accompanying visualization(s) of existing documents via either crowdsourcing or natural language processing techniques [123]. For now, in our work, we are assuming that the MSNVs are given with all the references annotated. Developing robust methods for generating novel MSNVs, or automatically extracting references from existing MSNVs, for adaptation is left as future work.  Other researchers have looked at supporting users while reading instructional texts by detecting instances of mind wandering and intervening to refocus user attention [45]. However, to the best of our knowledge, no one has focused on the next step of designing user-adaptive support to help users process them. 115  6.2.4 Eye Tracking in User Modeling for Information Visualizations Existing research has leveraged eye tracking data to perform a variety of user modeling tasks to facilitate the development of user-adaptive interfaces. Here, we focus on research examining users’ gaze in order to understand the relationship between user characteristics and information visualization processing. Several studies have shown significant differences in gaze patterns of experts and novices during visualization tasks in a variety of domains, including chemistry, e.g., [151,152], general information search [105], and geography [37,131]. However, little work has been done to formally connect significant differences in gaze behaviors due to user characteristics, to objective measures of task performance. Building this connection is key in order to understand if differences in users’ gaze behaviors even have an impact on performance with the visualization (otherwise there is little guidance on how to provide meaningful adaptive support), and if they do, which ones help or hinder performance (so that the gaze behaviors can be encouraged or discouraged accordingly). To the best of our knowledge, there are only three recent works that have begun to address this research gap. Firstly, [132] examined performance differences between experts and novices in cartography, for search tasks with map visualizations. Using basic fixation data, they identified that experts had shorter fixations and higher fixation rate than novices, suggesting that experts’ shorter response times were due respectively to their ability to interpret individual elements within the maps more efficiently and were able to scan the maps overall more efficiently. Secondly, [158] carried out an analysis of gaze data to explore why performance differences occurred between users while carrying out low-level analysis tasks on bar and radar graph visualizations. In this work, they identified 116  that users with low perceptual speed, who were slower on task, spent significantly more time looking at the legend and transitioned to it more frequently, indicating that these users were having difficulty processing and/or remembering the visualization’s legend. They also found that users with low verbal working memory, who were slower on task, generated more fixations and spent significantly more time reading the textual description of each visualization task to be performed. These findings thus offer preliminary guidance on where user-adaptive support could be provided, namely, by devising ways to help users with low perceptual speed process the legend, and similarly helping users with low verbal working memory process the textual description of each task. In the third recent work, [154] collected eye tracking data from a study using bar graphs and two types of low-level analysis tasks (simple and complex). In that work, they found that for complex tasks, users with low perceptual speed (who were slower with these tasks), spent significantly more of their time looking at the bar labels along the x-axis. They also identified that for complex tasks, users with low visual working memory (who were slower with these tasks), spent significantly more of their time looking at the list of possible answers for each task (multiple choice radio buttons), and also transitioned more frequently to them. These findings demonstrate the value of using eye tracking data to identify where potential adaptive support could be provided within the visualization interface. With the same goal in mind, the aim of the work we present in this chapter is to utilize eye tracking data to carry out a similar investigation on how user characteristics that impact task performance are influencing MSNV processing. 117  6.3 MSNV User Study We have conducted an exploratory user study to collect data on how users process MSNVs. First, we present the study procedure, followed by a description of the MSNV documents that were generated for the study. Next, we explain the dependent variables measured in the study, and after we present details on the set of user characteristics that were collected. 6.3.1 Study Procedure The experiment was a within-subjects repeated measures design, lasting at most 115 minutes. 56 subjects (32 female) ranging in age from 19 to 69, participated in the study. 60% of participants were university students, and the others were from a variety of backgrounds (e.g., retail manager, restaurant server, retired). Raw gaze data was captured during our study using a Tobii T-120 eye tracker with the IV-T fixation filter [130], and was calibrated at the beginning of the study to each user. The computer screen display was 1280 x 1024 pixels. Participants were given the task of reading over a MSNV document on the computer screen, and would signal they were done by clicking ‘next’. They were then presented with a set of questions on the screen designed to elicit their opinion of the document and to test their comprehension of relevant concepts discussed in it (see Section 6.3.3). Participants were required to carry out this task for 15 different MSNVs (described in Section 6.3.2). The ordering of the 15 MSNVs was randomized for each participant. Users were not given a time limit to read the MSNVs. However, to ensure that participants dedicated adequate effort to the task, they were told that there would be a $50 bonus for the three participants with the best performance, evaluated in terms of both speed and accuracy. The bonus was given in addition to the $45 we paid participants as compensation for the study. 118  Standard tests were used to assess the target battery of nine user characteristics (described in Section 6.3.4). The tests were split up so as to not fatigue users with too many tests all at the same time. Three of these tests (Visualization Literacy, Need for Cognition, Verbal Working Memory), which are computer-based and do not require an invigilator, were done at home prior to the experiment. A simple web-server was used to administer and record the test results accordingly. The other six user characteristic tests were administered in the lab: two before and four after the set of 15 MSNV tasks. The first two (Visual Working Memory, Verbal IQ) consisted of a computer test and a spoken test that both required specialized software. The last four (Perceptual Speed, Reading Proficiency, Spatial Memory, Disembedding) were all paper-based tests, and were completed consecutively at the end. The order of administration of tests was identical for all users. 6.3.2 MSNVs Used in the Study As we mentioned in the introduction, salient processing points in a MSNV are solicited by references, namely segments of text that specify a visual task on an accompanying visualization. The MSNVs we used for the study tasks were derived from an existing dataset of 40 magazine style documents extracted from real-world sources (e.g., Pew Research, The Guardian, and The Economist) where the references in each document had been previously identified via a rigorous coding process, indicating which data points in each visualization corresponds to each reference sentence [101]. Despite the obvious value of this dataset for our research, there were some issues with the format of the documents that we had to address. Each document in this dataset consisted of “snippets” of larger source documents whereby each snippet included exactly one paragraph of text and one accompanying visualization. This 119  simple document format was required to support the research purposes of [101] to automate the extraction of references in each document utilizing crowdsourcing and clustering. Regrettably for our purposes, many of these document snippets were fragmented, i.e., the document or individual sentences within the document were difficult to comprehend because some of the required details were expressed in sentences from prior paragraphs in the original source material that were not included. We solved this problem by retrieving and adding the missing text from the source articles, to which we have access. In cases where fragmentation issues could not be resolved, the document snippet was removed. To provide more realism, we also added the original date and title to each document. We also identified several document snippets that had been derived from the same source article, and merged them into a single MSNV respectively. Lastly, among the documents remaining after applying all of the above-mentioned changes, we selected a subset so as to have a varied number of words and references, to account for the potential influence that these factors of complexity might have on MSNV processing. We also selected documents to include a balanced variety of three bar chart types (i.e., simple, stacked, grouped [128]). We focused only on one class of visualizations to keep the complexity of the study manageable, and we chose bar charts because they are one of the most popular and effective visualizations for the common tasks of looking up and comparing values in simple tabular data [128]. The end result of our work yielded a set of 15 self-contained MSNV documents, consisting of one visualization each, and one body of narrative text (see Figure 6.2). Summary statistics on the composition of the 15 MSNV documents is provided in Table 6.1. 120   Figure 6.2: One of 15 MSNVs administered in the user study. *Note: Red highlighting is shown to illustrate the concept of a reference. Highlighting was not provided to users in the study.   MSNV Property Min Max Median Mean SD words: Total number of words in the body of narrative text. 43 228 75 90.8 49.7 sentences: Total number of sentences in the body of narrative text. 2 14 4 4.9 3.0 reference sentences: Total number of sentences in the body of narrative text that specify a visual task on the visualization.  1 7 2 2.6 1.8 datapoints in viz: Total number of data points in the visualization. 4 63 14 22.1 19.7 reference targets in viz: Total number of data points in the visualization mentioned by any reference sentences. 2 24 6 10.1 7.8       Table 6.1: Summary statistics illustrating the variety of document characteristics across the 15 MSNVs administered in the user study. 121  6.3.3 Dependent Measures The aim of our study was to evaluate the impact of user characteristics on users’ experience with MSNVs, where experience comprises of objective performance (time on task and comprehension) as well as subjective measures (MSNV ease-of-understanding and interest). Comprehension and subjective measures were assessed for each MSNV via a set of questions we designed (see Figure 6.3), which were shown to the user after they read each document. Given that the MSNV documents are fairly short in length, we wanted to ensure that the number and types of questions we asked were not too long and would not be more difficult to process than the MSNVs themselves. First, we asked two subjective questions using a 5-point Likert scale to measure, respectively, perceived ease-of-understanding and interest (top two questions in Figure 6.3), based on work by [170], where they used a similar question format to capture users’ subjective attitudes towards End User License Agreements. Next, we asked objective questions to measure document comprehension, based on the work by [48], where they employed five different types of multiple choice questions for evaluating users’ comprehension of National Geographic articles. We designed questions based on two of their question types, chosen because both types of questions could be asked for all the MSNVs in our dataset. The two question types we selected were:  One title question which asks to select a suitable alternative title for the MSNV (see question 5, bottom of Figure 6.3), and provides a simple way to ensure that the user had a grasp of the general document narrative.  One or two (depending on document length) recognition questions asking to recall specific information from the MSNV: identifying a named entity discussed in the text (e.g., question 3 in Figure 6.3), or identifying the magnitude/directionality of a 122  named entity discussed in the text (e.g., question 4 in Figure 6.3). For most documents, two recognition questions were asked. When the document was too short to provide enough content for generating two questions, only one recognition question was asked.  Figure 6.3: Subjective and comprehension questions presented to users after reading each MSNV document. Note: users were not allowed to proceed without answering all of the questions. In total, we generated four dependent measures (two subjective and two objective) that capture user experience with each MSNV document. The first three dependent measures are calculated from the set of questions described above, and include:  MSNV Ease-of-Understanding: subjective rating on a 5-point Likert Scale.  MSNV Interest: subjective rating on a 5-point Likert Scale. 123   MSNV Accuracy: accuracy (% correct) of the comprehension questions. The fourth dependent measure was logged during MSNV processing, and consists of:  MSNV Time on Task: time (seconds) spent on the MSNV document. Table 6.2 provides summary statistics on each of the four measures of user experience collected during the study. Measure Min Max Median Mean SD Time on Task (sec) 7.8 296.5 49.75 57.91 33.2 Comprehension Accuracy (%) 0 1 0.67 0.69 0.30 Ease-of-Understanding (1-5) 1 5 4 3.9 1.03 Document Interest (1-5) 1 5 3 3.7 1.26       Table 6.2: Summary statistics of the four measures of MSNV user experience obtained from the study. 6.3.4 User Characteristics We measured nine different user characteristics in the study using standard tests in psychology, defined in Table 6.3. The first seven characteristics consist of cognitive abilities and traits that we selected because previous research has shown that they play a significant role in user experience with visualizations. For instance, Perceptual Speed, Visual Working Memory, Verbal Working Memory, Visualization Literacy, and Spatial Memory were chosen because previous studies have shown that they can impact visualization preference and task performance with bar chart visualizations [32,38,109,157], which are also the types of visualizations in our MSNVs. We also included the user characteristic Need for Cognition because previous work has shown that it can influence accuracy with visualization search tasks [42], and also because we hypothesized that it may play a role in how much effort  users were willing to invest in reading the MSNV documents given that a minimum and maximum time 124  User Characteristic Definition Instrument NEED FOR COGNITION Extent to which individuals are inclined towards effortful cognitive activities [30]. Need for Cognition Scale [30], a questionnaire asking users to rate their agreement (5-point Likert-scale) with 18 statements about the satisfaction they gain from thinking in various scenarios. Final scores range from -36 to 36. VISUALIZATION LITERACY Ability to confidently use a visualization to translate questions specified in the data domain into visual queries in the visual domain, as well as interpreting visual patterns in the visual domain as properties in the data domain [27]. Visualization Literacy 101 – Bar Chart Test [27], a computer-based test where users perform a series of standard benchmark visualization tasks (e.g., finding min/max, estimating average, detecting trends) with bar chart visualizations. Final scores range from range from -2.0 to 1.0, and are computed based on accuracy and time taken. VISUAL WORKING MEMORY Measures the quantity of visual information (e.g., shapes and colors) that can be temporarily maintained or manipulated in working memory [114]. Colored Squares Sequential Comparison Task (uncued) [168], a computer-based test where users are briefly shown a sample array of n colored squares, then after a short blink delay, a single colored square appears and participants indicate (yes/no) if its color matches one in the sample array. This task repeats 120 times over three different array sizes (n = 4, 6, 8). Final scores range from 0 to 6 by averaging the scores obtained from each array size. SPATIAL MEMORY Ability to remember the configuration, location, and orientation of figural material [49]. MV-1 Shape Memory Test [49], a timed paper-based test that requires users to first study a page filled with abstract shapes, and afterwards recall the relative positions of specific subsets of shapes. Final scores range from 0 to 16. VERBAL WORKING MEMORY Measures the quantity of verbal information (e.g., words) that can be temporarily maintained and manipulated in working memory [7]. OSPAN (Operation-word span) [165], a short computer-based test where users are briefly shown a list of 1-6 words, then respond to a basic arithmetic operation, and afterwards are asked to recall the list of words. Final scores range from 0 to 6. PERCEPTUAL SPEED Speed in scanning/comparing figures or symbols, or carrying out other very simple tasks involving visual perception [49]. P-3 Identical Pictures Test [49], a timed paper-based test that measures how quickly users can locate matching objects amidst a set of distractors. Final scores range from 0 to 72. DISEMBEDDING Ability to hold a given visual percept or configuration in mind so as to disembed it from other well defined perceptual material [49]. CF-2 Hidden Patterns Test [49], a timed paper-based test that requires users to identify (i.e., disembed) if a given figure is hidden among other lines. Final scores range from 0 to 300. READING PROFICIENCY Vocabulary size and reading comprehension ability in English [121]. X_Lex Vocabulary Test [121], an untimed paper-based test. Users indicate on a vocabulary list (yes/no) if they know the meaning of each word. Some words are fake, and users are not told this. Final scores range from 0 to 100, based on the # of hits (word exists and the user indicates they know it) and false alarms (user indicates they know the meaning of a fake word). VERBAL IQ Overall verbal intellectual abilities that measures acquired knowledge, verbal reasoning, and attention to verbal materials [23]. North American Adult Reading Test (NAART) [150], an untimed spoken test where users are asked to read aloud a series of increasingly difficult words in English. The total number of incorrectly pronounced words are then used to compute the user’s Verbal IQ, with possible scores ranging from 74.41 to 128.7.    Table 6.3: The set of nine user characteristics measured in the study.125  limit was not enforced in the study. Although previous research examining the link between Disembedding and visualization performance is limited (e.g., [167]), we opted to include this user characteristic in our study because of the references contained in the MSNV documents. Specifically, we hypothesized that the act of processing any of the reference sentences may require some level of disembedding, namely, identifying groups of bars of interest amidst the full set of datapoints contained in the visualization. In addition, we included two user characteristics relating to reading comprehension ability, to account for potential performance differences arising due to reading the body of narrative text contained in each MSNV. Unfortunately, assessing reading comprehension ability can be a very time consuming endeavor. For instance, standard tests such as the ESOL, IELTS, and TOEFL iBT require more than an hour to administer, which was not feasible for our user study. Hence, we selected two tests that could each be administered in under 5 minutes and have been shown to reliably approximate two different constructs relating to reading comprehension ability: Reading Proficiency [122], and Verbal IQ [23]. User Characteristic Min Max Median Mean SD Need for Cognition -20 26 12.5 10.6 10.2 Visualization Literacy -2.1 1.0 0.47 0.30 0.71 Visual Working Memory 0 5 2.5 3.8 1.0 Spatial Memory 1 14 8 7.6 3.4 Verbal Working Memory 2 6 5 5.0 1.1 Perceptual Speed 25 66 45 45.2 8.9 Disembedding  12 84 61.5 57.6 15.3 Reading Proficiency 54.7 96.3 84.9 83.4 9.7 Verbal IQ 84.2 122.5 101.1 101.6 8.9       Table 6.4: Summary statistics showing the range of scores obtained for the user characteristics we measured in the study. 126  Lastly, we report in Table 6.4 summary statistics on the nine user characteristics test results, collected from the 56 users in our study. We also report in Table 6.5 pairwise correlation scores among the user characteristics to provide a sense of how well they are each capturing complementary or non-overlapping dimensions of user abilities. Since Shapiro-Wilk normality tests revealed that each of the nine user characteristics were not normally distributed (p < .001), we used a non-parametric correlation test, Kendall’s tau (τ) (as opposed to a standard Pearson’s r). In general, we found the user characteristics to have low or medium association with each other (i.e., τ ~0.19 or smaller)19. There are three exceptions. Two involve perceptual abilities, namely Visual Working Memory and Visualization Literacy (τ = 0.37) and Perceptual Speed and Disembedding (τ = 0.38). These higher correlations are likely due to the fact that some parts of each test for Visualization Literacy and Disembedding reply on lower-level perceptual abilities. In particular, the test for Visualization Literacy requires processing different colored bars to create mappings to their corresponding entities, a sub-task that leverages users’ Visual Working Memory which measures the quantity of colors that can be temporarily maintained or manipulated in working memory; and similarly because the test for Disembedding requires users to repeatedly match embedded shapes, a task that leverages users’ Perceptual Speed which measures how quickly users can scan figures or symbols. We opted to retain all of the above user characteristics because, despite the partial overlaps, none can be removed without losing information relating to specific scope of the perceptual abilities each                                                   19 Using the guidelines from [55] that r = 0.10 is a small correlation, r = 0.30 is medium, and r = 0.50 is large, we computed the Kendall’s τ equivalent according to [171], yielding: τ = 0.06 small association, τ = 0.19 medium, and τ = 0.33 large. 127  test is designed to capture. The third exception relates to the two characteristics that we used to measure users’ reading comprehension abilities: Reading Proficiency and Verbal IQ (τ = 0.27). Since reading comprehension ability is comprised of and can be assessed according to several different measurable factors [64] (in our case vocabulary size and pronunciation ability respectively), it is not surprising that there is some overlap between these two measures, but the correlation is only partial, indicating they are each still capturing distinct information. As with the previous two correlations with perceptual abilities, we opt to keep both reading ability measures to retain as much information as possible relating to the specific factors each measure captures20.  Need for Cognition Visualization Literacy Visual Working Memory Spatial Memory Verbal Working Memory Perceptual Speed Disembedding Reading Proficiency Verbal IQ Need for Cognition          Visualization Literacy 0.16         Visual Working Memory 0.21 0.37        Spatial Memory -0.01 0.14 0.09       Verbal Working Memory 0.11 0.02 0.04 0.10      Perceptual Speed 0.08 0.03 0.18 0.26 0.13     Disembedding  0.13 0.12 0.14 0.21 0.15 0.38    Reading Proficiency -0.07 0.02 0.05 -0.18 0.06 -0.03 -0.13   Verbal IQ -0.11 0.06 0.04 0.01 -0.02 0.10 0.10 0.27            Table 6.5: Kendall’s Tau Correlation scores between the all of the user characteristics.                                                   20 This idea is further examined in the next chapter (Chapter 7) in Section 7.4. Recall that Reading Proficiency is called X_Lex in Chapter 7 and Verbal IQ is called NAART in Chapter 7. 128  6.4 Effects of User Characteristics on MSNV User Experience In this section, the goal is to perform the first step towards designing user-adaptive support for MSNV processing, by identifying which user characteristics are impacting which measures of MSNV user experience and therefore may warrant further investigation for providing adaptive support. We first describe the statistical analysis used and then summarize the obtained results. After, we explain based on the results, which user characteristics we select for further investigation. 6.4.1 Analysis & Results To carry out the analysis we use Linear Mixed-Effects Models; an alternative to using a traditional repeated measures ANCOVA. We opted for Mixed Models, since they can model multiple random effects at once. For our purposes, the specification of two random effects is required since our study was a repeated measures design where all users were exposed to the same set of 15 documents. The first random effect user_id accounts for a within-subject correlation (i.e., due to non-independence) since multiple measurements are collected from the same user. The second random effect MSNV_id accounts for a within-document correlation (i.e., due to non-independence) since repeated measurements are collected from the same MSNV document. We used the lmerTest software package in R [106] and constructed one Mixed Model for each measure of MSNV user experience (described in Section 6.3.3) as the dependent measure, along with the nine user characteristics as covariates (described in Section 6.3.4), and user_id and MSNV_id as random effects. For each model, we run a bi-directional stepwise algorithm for model selection defined by Akaike Information Criterion (AIC) [2]. The 129  two subjective dependent measures (Ease-of-Understanding and Document Interest) were collected using a standard 5 point Likert scale. Shapiro-Wilk normality tests revealed that these two measures were not normally distributed (p < .001), therefore we applied the Aligned Rank Transformation (ART) using the ART-Tool [97] to convert them to a normal distribution. Significant results obtained from these four models are reported in Table 6.6. Main Effect of User Characteristic Time on Task Comprehension Accuracy Ease-of-Understanding Document Interest Verbal Working Memory p < .05, b = -0.08 not sig. not sig. not sig. Reading Proficiency p < .05, b = -0.09 not sig. not sig. not sig. Visualization Literacy p < .01, b = 0.13 p < .01, b = 0.13 not sig. not sig. Need for Cognition not sig. p < .05, b = 0.08 not sig. not sig. Verbal IQ not sig. p < .05, b = 0.07 not sig. not sig. Visual Working Memory not sig. not sig. not sig. not sig. Spatial Memory not sig. not sig. not sig. not sig. Perceptual Speed not sig. not sig. not sig. not sig. Disembedding  not sig. not sig. not sig. not sig.      Table 6.6: Results indicating which user characteristics have a significant effect on measures of MSNV user experience. The normalized model coefficient b indicates the size and directionality of the relationship. We identified main effects for five user characteristics on objective measures of MSNV performance, but no main effects of user characteristics were found on the two subjective measures of MSNV experience (see Table 6.6). Based on these results, we group our findings into three different categories of users: 130   The first group includes two user characteristics (Verbal Working Memory and Reading Proficiency) that display a negative directionality with Time on Task (as shown by the negative slope of the model coefficients b). Since no significant results were found for these two user characteristics on Comprehension Accuracy, it indicates that users low in these abilities spend more time looking at the MSNV to achieve comparable accuracy as their counterparts. The most straightforward explanation as to why these users are struggling is precisely because of their low abilities in either of these two user characteristics.  The second group of main effects includes user characteristics (Need for Cognition and Verbal IQ) that display a positive directionality with Comprehension Accuracy (as shown by the positive slope of the model coefficients b). Since no significant results were found for these two user characteristics on Time on Task, it indicates that users low in these abilities are spending comparable time as their counterparts on the task, but end up achieving lower accuracy. Once again, the most straightforward explanation as to why these users are struggling is precisely because of their low abilities in either of these two user characteristics.  The third group only includes Visualization Literacy, for which there is a positive directionality with both Time on Task and Accuracy (as shown by the positive slopes of the model coefficients b). This result indicates that users with high Visualization Literacy are slower on task, but the accuracy is improved, which is a standard tradeoff. In the next section, we present an analysis of gaze data aimed at explaining our findings in terms of how the documents are visually processed. We focus on the first and second categories, which include (Verbal Working Memory and Reading Proficiency) and (Need for Cognition and Verbal IQ). For these four characteristics, there was a clear indication that users 131  with low abilities are not using their time effectively (i.e., they either needed more time to achieve comparable accuracy as the high ability users, or given the same time they achieved lower accuracy). For such users, eye tracking data could arguably explain the sources of these inefficiencies, revealing sub-optimal behaviors and then providing ideas for adaptation. The analysis of the findings about Visualization Literacy, for which a similar relation to eye tracking data appears to be less straightforward, is left as future work. 6.5 Eye Tracking Analysis and Results In this section, we leverage eye tracking data to investigate where users with low user characteristics are struggling during MSNV processing. First, we present in Section 6.5.1 details on how the eye tracking data collected during the user study was utilized to generate numerous gaze metrics that capture various MSNV processing behaviors. Next, in Section 6.5.2 we carry out an exploratory analysis on the set of generated gaze metrics, to identify which among them are relevant to performance with the MSNVs (i.e., time on task followed by comprehension accuracy). After, we conduct a further exploratory analysis in Section 6.5.3 to identify which of the gaze metrics relevant to MSNV performance are significantly influenced by user characteristics. Based on these results, we further refine our analysis by looking at finer-grained gaze metrics (Section 6.5.4) and specific performance metrics and user characteristics (Sections 6.5.5 and 6.5.6). 6.5.1 Generating Gaze Metrics Raw gaze data comprises of fixations (points of gaze on the screen) and saccades (quick movements between fixations). In order to capture a more detailed understanding of users’ MSNV processing, we compute from the raw gaze data a set of summary statistics describing 132  numerous aspects of their gaze behaviors following a standard approach adopted in many other works [46,119,154,158]. We processed users’ raw gaze data using EMDAT (www.github.com/ATUAV/EMDAT), an open source library written in Python and developed in our research laboratory. EMDAT produces a comprehensive set of gaze-metrics specified over the entire display, and over specific Areas of Interests (AOIs). For our analysis, we selected only AOI-based gaze metrics because those specified over the entire display do not capture any information relating to the content of the MSNVs and thus are not useful for our research goal. The complete set of gaze metrics we selected are listed in Table 6.7. These metrics are defined over four AOIs that capture users’ gaze activity within different regions of the MSNV documents (see Figure 6.4). These AOIs were defined to gain a general sense of MSNV document processing according to the two primary forms of information contained in the MSNV documents, namely, two AOIs for the textual information (block of text on the left), and to two AOIs for the visual information (the area including the visualization on the right). The four AOIs are defined as:  Refs AOI: Combined areas of reference phrases in the text (purple-shaded boxes).  Text AOI: The rest of the MSNV document text (orange box minus purple boxes).  Referenced Bars (R-Bars) AOI: The combined area of all the bars in the visualization that are mentioned by any of the references (green boxes).  Viz AOI: The rest of the visualization region (pink box minus green boxes). 133   Figure 6.4: The four AOIs we defined to capture MSNV processing, shown here for one of the documents administered in the user study. No. Metric Description 1 • fixation_rate Fixation rate in AOI 2 • number_of_fixations Total number of fixations in AOI 3 • longest_fixation Longest fixation in AOI 4-6 • sum_fix_durations   • mean_fix_durations, • stddev_fix_durations Sum, Mean, and Std. Deviation of fixation durations in AOI 7-8 • time_to_first_fix  • time_to_last_fix Time to first and last fixation in AOI 9-12 • transitions_to_Text   • transitions_to_Viz • transitions_to_Refs   • transitions_to_R-Bars Number of gaze transitions from this AOI to every AOI 13-16 • prop_ trans_to_Text   • prop _trans_to_Viz • prop _trans_to_Refs  • prop _trans_to_R-Bars Proportion of gaze transitions from this AOI to every AOI (according to total gaze transitions in all AOIs) 17 • prop_num_fixations Proportion of fixations in AOI (according to total fixations in all AOIs)    Table 6.7: Set of 17 Gaze metrics generated for each of the 4 AOIs. These metrics are generated by EMDAT for each user and each task. 134  6.5.2 Identifying Gaze Metrics Relevant to MSNV Performance Our goal here is to identify which gaze metrics have a significant relationship with MSNV time on task and MSNV comprehension accuracy. For the purposes of our research, gaze metrics that do not have any significant relationship to task performance are non-relevant and do not warrant further consideration. Non-relevant gaze metrics offer no concrete indication on how the captured processing behavior translates to MSNV performance, and thus provide little guidance towards designing meaningful adaptive support. First, we checked for correlations among gaze metrics within each AOI. Recall there are a total of 68 gaze metrics (17 gaze metrics x 4 AOIs, c.f. Table 6.7). Shapiro-Wilk normality tests revealed that time_to_first_fix on the Text AOI was not normally distributed (p < .001) and therefore we removed it. This measure was skewed heavily to the right and captured very little variability likely because the Text AOI was usually the first place users looked at the outset of each MSNV task. Pearson correlations on gaze metrics within each of the 4 AOI groups revealed very high correlations (r > 0.9) in all four AOIs among: sum_fix_durations, number_of_fixations, and transitions_to_self; as well as longest_fixation and stddev_fix_durations. Based on these high correlations, we removed three gaze metrics for each AOI: number_of_fixations, transitions_to_self, and also stddev_fix_durations from further analysis. Correlations of gaze metrics were not checked between AOIs because our goal is to investigate how different areas of the MSNV documents are processed, and we wanted to preserve the ability to report and discuss results at the granularity of each AOI.  Therefore, the total number of gaze metrics we retain for further investigation is 55: (14 metrics x 4 AOIs) – 1 metric in the Text AOI. 135  6.5.2.1 Gaze Metrics Relevant to Time on Task In order to identify gaze metrics that have a significant relationship to time on task, we conduct an analysis using Mixed Models (see description in Section 6.4.1). For each of the 55 gaze metrics, we construct one Mixed Model, using gaze metric as the independent measure, with time on task as the dependent measure, and user_id and MSNV_id as random effects (i.e., repeated measures). Due to the exploratory nature of our analysis, we account for multiple comparisons using the Benjamini–Hochberg procedure to control for the false discovery rate [15]. The obtained p-values from our models are ordered from smallest to largest, such that the smallest p-value has a rank of i = 1, the next smallest has i = 2, etc. Then we compare each individual p-value to its Benjamini-Hochberg critical threshold of q = (i/m)α, where i is the rank, m is the total number of models, and α is set to 0.05. Next we find the largest p-value that has p < q given its rank r, and then all p-values at rank i ≤ r are also considered significant. Applying the Benjamini–Hochberg procedure to our results yielded a critical threshold of q = .0273, obtained at rank r = 30. Thus, our analysis revealed 30 relevant gaze metrics that have a significant relationship with time on task, listed in Table 6.8, and in all cases (as indicated by slope of the model coefficient b) have positive correlation with time on task, except for one metric: fixation_rate in the Text AOI. Even though it is not surprising that many of these gaze metrics are highly correlated to time on task (i.e., they get bigger as more time is spent on task), we report them here for completeness. The interesting part of these identified gaze metrics will surface in the next part of our analysis, when they are examined to see to what extent any of these relationships are qualified by the user characteristics Verbal Working Memory and Reading Proficiency.  136  Gaze Metric Text AOI Refs AOI Viz AOI R-Bars AOI sum_fix_durations p < .001 b = 0.76 p < .001 b = 0.67 p < .001 b = 0.55 p < .001 b = 0.30 longest_fixation p < .001 b = 0.24 p < .001 b = 0.23 p < .001 b = 0.19 p < .001 b = 0.19 time_to_first_fix  p < .001 b = 0.20 p = .016 b = 0.06 p < .001 b = 0.20 time_to_last_fix p < .001 b = 0.86 p < .001 b = 0.76 p < .001 b = 0.79 p < .001 b = 0.58 mean_fix_durations p = .001 b = 0.18 p < .001 b = 0.14 not sig. p < .001 b = 0.18 fixation_rate p < .001 b = -0.22 not sig. not sig. not sig. prop_num_fixations not sig. not sig. not sig. not sig. transitions_to_Text  p < .001 b = 0.87 p < .001 b = 0.31 p < .001 b = 0.11 transitions_to_Refs p < .001 b = 0.86  p < .001 b = 0.15 not sig. transitions_to_Viz p < .001 b = 0.29 p < .001 b = 0.11  p < .001 b = 0.30 transitions_to_R-Bars p < .001 b = 0.15 p = .001 b = 0.07 p < .001 b = 0.29       Table 6.8: For each AOI, we report gaze metrics that were found to be significant with time on task. The normalized model coefficient b indicates the size and directionality of the relationship. Cells shaded in gray indicate metrics excluded from the analysis due to either lack of normality: i.e., time_to_first_fix for Text AOI; or high correlation, i.e., transitions to self (the same AOI). 6.5.2.2 Gaze Metrics Relevant to Comprehension Accuracy For each of the 55 relevant gaze metrics, we construct one Mixed Model, using gaze metric as the independent measure, with comprehension accuracy as the dependent measure, and user_id and MSNV_id as random effects. Our analysis revealed only 3 gaze metrics with p < .05 on task accuracy, listed in Table 6.9. However, after applying the Benjamini–Hochberg procedure to adjust for multiple comparisons, neither of these 3 results were found to be significant (at best 137  they could be considered marginally significant). It is surprising to see that unlike time on task (reported in the previous sub-section), the collection of gaze metrics we evaluated has very little or no relationship with comprehension accuracy, with only a marginal indication that some processing behaviors captured in the R-Bars AOI of the MSNVs may play a role towards users’ comprehension. Ultimately, since we were unable to identify statistically significant relationships for any of the gaze metrics with comprehension accuracy, no subsequent analysis of gaze metrics will be carried out for Need for Cognition and Verbal IQ, since these two user characteristics were only found to impact comprehension accuracy. Gaze Metric Text AOI Refs AOI Viz AOI R-Bars AOI sum_fix_durations not sig. not sig. not sig. not sig. longest_fixation not sig. not sig. not sig. not sig. time_to_first_fix  not sig. not sig. not sig. time_to_last_fix not sig. not sig. not sig. p = .026 b = 0.15 mean_fix_durations not sig. not sig. not sig. p = .036 b = 0.20 fixation_rate not sig. not sig. not sig. p = .039 b = 0.14 prop_num_fixations not sig. not sig. not sig. not sig. transitions_to_Text  not sig. not sig. not sig. transitions_to_Refs not sig.  not sig. not sig. transitions_to_Viz not sig. not sig.  not sig. transitions_to_R-Bars not sig. not sig. not sig.       Table 6.9: No gaze metrics had a significant relationship with comprehension accuracy. Three metrics yielded p-values < .05, however none remained significant after correcting for multiple comparisons. Gray cells indicate metrics excluded from the analysis due to lack of normality or high correlation. 138  6.5.3 Impact of User Characteristics on Gaze Metrics Relevant to Time on Task As discussed in Section 6.4, we found 2 user characteristics (VerbalWM and ReadingP) which impact performance with MSNVs in a manner that may call for personalized support: namely, users with low values of either of these two characteristics were spending significantly more time on task to achieve comparable accuracy compared to users with higher values. Here, our goal is to see if any of these user characteristics (UC) impacts any of the 30 gaze metrics relevant to time on task (identified in Section 6.5.2.1), so as to detect possible sub-optimal gaze processing behaviors of users with low abilities in these user characteristics. We construct one Mixed Model for each of the 30 relevant gaze metrics as the dependent measure, with both UCs as covariates, and user_id and MSNV_id as random effects. We apply the Benjamini–Hochberg procedure to our results, yielding a critical threshold of q = .0166, obtained at rank r = 10. Significant results are reported in Table 6.10. The structure of Table 6.10 is designed to facilitate understanding which user characteristics have a significant effect on gaze metrics that belong to the same AOI (looking by column), as well as which user characteristics have a significant effect on the same type of gaze metrics across all four AOIs (looking by row). Thus, the rows in Table 6.10 list the type of gaze metric, the columns list the four AOIs on which these metrics are generated, and a cell (i,j) lists all (if any) UCs mediated by the gaze metric in row i generated over the AOI in column j. The model coefficient b listed under each UC indicates the directionality of the effect that this UC has on the corresponding gaze metric. For instance, the negative b in the first cell of Table 6.10 indicates a negative directionality, namely that users with low VerbalWM spend more time looking at the Text AOI compared to users with high VerbalWM. 139  Gaze Metric Text AOI Refs AOI Viz AOI R-Bars AOI sum_fix_durations VerbalWM p = .012 b = -0.11 not sig. ReadingP p = .002 b = -0.15 ReadingP p = .003 b = -0.12 longest_fixation not sig. not sig. ReadingP p = .01 b = -0.11 ReadingP p = .003 b = -0.15 time_to_first_fix  VerbalWM p = .003 b = -0.07 not sig. not sig. time_to_last_fix not sig. not sig. ReadingP p = .01 b = -0.13 ReadingP p = .005 b = -0.16 mean_fix_durations not sig. not sig.  not sig. fixation_rate not sig.    prop_num_fixations     transitions_to_Text  not sig. not sig. not sig. transitions_to_Refs not sig.  not sig.  transitions_to_Viz not sig. not sig.  ReadingP p = .007 b = -0.13 transitions_to_R-Bars not sig. not sig. ReadingP p = .006 b = -0.14       Table 6.10: Results showing in which AOIs a significant effect of user characteristics were found on the corresponding gaze metric. The normalized model coefficient b indicates the size and directionality of the relationship. Grey cells indicate gaze metrics non-relevant to time on task, and were thus not evaluated. Table 6.10 shows several results for both VerbalWM and ReadingP. It is interesting to see the distinct roles that VerbalWM and ReadingP each play when examining the table by column (recall too that these two UCs are virtually uncorrelated, c.f. Table 6.5 in Section 6.3.4). First, there are no main effects of VerbalWM on visualization processing; it only appears for textual AOIs (i.e., Text AOI and Refs AOI, first two columns of Table 6.10). Specifically, users with low 140  VerbalWM are spending significantly more time (sum_fix_durations) processing the Text AOI and their first encounter with the textual references (time_to_first_fix) on Refs AOI are significantly later in the task compared to users with high VerbalWM, likely because they are having issues reading the text. Our findings regarding this connection between VerbalWM and textual processing mirror results found in previous work [154,158] where textual elements consisted of text in the visualization’s legend, as well as sentences below the visualization eliciting the task to be carried out. Our results thus extend these previous findings on VerbalWM to include accompanying bodies of narrative text. In contrast, results for ReadingP are entirely related to visualization processing (i.e., Viz AOI and Bars AOI, shown in the last two columns of Table 6.10). For instance, users with low ReadingP are spending significantly more time (sum_fix_durations) processing the visualization and relevant bars, have higher values for their longest fixations (longest_fixation), and their last fixations (time_to_last_fix) are significantly later in the task compared to users with high ReadingP. Increased processing of the visualization by users with low ReadingP is also captured by additional back and forth transitions between the visualization and relevant bars (last two rows of Table 6.10). These results provide strong evidence that users with low ReadingP are having difficulty with visualization processing. As far as we are aware, our results are the first to show a significant impact of reading proficiency on visualization processing. As reported in Section 6.4, we found that users with low VerbalWM and low ReadingP spend significantly more time on task to achieve comparable comprehension accuracy as their counterparts. Our findings here shed light on exactly where these users are likely having difficulty, thus providing insights on how they could be helped in processing the MSNVs more efficiently. Our results show that, for users with low VerbalWM this help should target the 141  textual region of the MSNV, whereas users with low ReadingP would likely benefit from help in processing the visualization. Based on these findings, we choose to carry out an additional analysis of gaze metrics for users with low ReadingP, to ascertain whether we can identify specific aspects of the visualizations they need help with. Specifically, we will further examine the role that ReadingP has on visualization processing by defining a new set of finer-grained AOIs within the MSVN visualization only. We opt to focus only on ReadingP as a first step, because the primary goal of our current work is to target user-adaptive support on the visualization, and because previous research has indicated there are many candidate elements that could come into play during visualization processing (e.g., legend, labels, and bars relevant to the task). 6.5.4 Specifying Finer-Grained AOIs on the Visualization In the previous sub-section, we identified that users with low ReadingP were having difficulty processing the visualization part of the MSNVs, and these behaviors contribute to longer overall task completion time. In order to see if we can identify where within the visualization these users are having difficulty, we specify a new set of finer-grained AOIs defined explicitly over key features of the visualizations. The new AOI definitions are as follows (an example is shown in Figure 6.5):  Legend AOI: Area surrounding the visualization legend.  Labels AOI: Region along x or y-axis (depending on orientation of the visualization) where textual bar labels are shown.  Referenced Bars (R-Bars) AOI: Area covering the set of all bars mentioned by any reference (this AOI is identical to the R-Bars AOI in the previous section). 142   Non-Referenced Bars (NR-Bars) AOI: Area covering all the other bars in the visualization, not mentioned by any reference. We then re-compute the same collection of gaze metrics as before (see Table 6.7 in Section 6.5.1) using these four finer-grained AOIs, yielding a total of 68 new gaze metrics (i.e., 17 gaze metrics x 4 AOIs) to be used in the next analysis.  Figure 6.5: A visualization in our MSNV dataset illustrating the four finer-grained AOIs we defined: Legend AOI (purple box), Labels AOI (orange region), R-Bars AOI (green boxes), and NR-Bars AOI (blue boxes). 6.5.5 Finer-Grained AOI Gaze Metrics Relevant to Time on Task As done before (see Section 6.5.2), we first check for correlations among the 68 gaze metrics (i.e., 17 gaze metrics x 4 AOIs, c.f. Table 6.7) within each AOI. Shapiro-Wilk normality tests 143  revealed that all of the gaze metrics were normally distributed (p > .05). Pearson correlations revealed very high correlations (r > 0.9) in all four finer-grained AOIs among: sum_fixation_durations, number_of_fixations, and transitions_to_self; and longest_fixation and stddev_fixation_durations. Based on these high correlations, we removed for each AOI: number_of_fixations, transitions_to_self, and stddev_fixation_durations from further analysis. Therefore, the total number of finer-grained AOI gaze metrics we retain for further investigation is 56: (14 metrics x 4 AOIs). Next, to identify finer-grained AOI gaze metrics that have a significant relationship to time on task, we construct one Mixed Model for each of the 56 gaze metric as the independent measure, with time on task as the dependent measure, and user_id and MSNV_id as random effects. We apply the Benjamini–Hochberg procedure to our results, yielding a critical threshold of q = .0286, obtained at rank r = 32. Thus, our analysis revealed 32 finer-grained AOI gaze metrics that have a significant relationship with time on task, listed in Table 6.11, and in all cases (as indicated by b) have positive directionality with time on task (i.e., higher values of these gaze metrics indicate longer times on task).         144  Gaze Metric Legend AOI Labels AOI R-Bars AOI NR-Bars AOI sum_fix_durations p < .001 b = 0.25 p < .001 b = 0.25 p < .001 b = 0.30 p < .001 b = 0.27 longest_fixation p < .001 b = 0.16 p < .001 b = 0.13 p < .001 b = 0.16 p < .001 b = 0.15 time_to_first_fix p < .001 b = 0.17 p < .001 b = 0.17 p < .001 b = 0.20 p < .001 b = 0.15 time_to_last_fix p < .001 b = 0.38 p < .001 b = 0.63 p < .001 b = 0.58 p < .001 b = 0.62 mean_fix_durations p = .002 b = 0.13 not sig. p < .001 b = 0.18 p < .001 b = 0.11 fixation_rate not sig. not sig. not sig. not sig. prop_num_fixations not sig. not sig. not sig. p = .008 b = 0.07 transitions_to_Legend  p < .001 b = 0.11 p < .001 b = 0.09 p < .001 b = 0.22 transitions_to_Labels p = .005 b = 0.11  p < .001 b = 0.16 p < .001 b = 0.17 transitions_to_R-Bars p < .001 b = 0.10 p < .001 b = 0.17  p < .001 b = 0.22 transitions_to_NR-Bars p < .001 b = 0.18 p < .001 b = 0.15 p < .001 b = 0.25       Table 6.11: Results indicating which gaze metrics using finer-grained AOIs were found significant with time on task. Cells shaded in gray indicate metrics that were excluded from the analysis due to high correlation: i.e., transitions to self (the same AOI). The normalized model coefficient b indicates the size and directionality of the relationship. 6.5.6 Effects of Reading Proficiency on Finer-Grained AOI Gaze Metrics Here, our goal is to see where there is an effect of ReadingP on any of the finer-grained AOI gaze metrics identified in the previous sub-section. Using the same methodology as before, we construct one Mixed Model for each of the 32 relevant gaze metrics as the dependent measure, with ReadingP as a covariate, and user_id and MSNV_id as random effects. We apply the Benjamini–Hochberg procedure to our results, yielding a critical threshold of q = .0297, obtained at rank r = 19. Significant results from this analysis are reported in Table 6.12. 145  Gaze Metric Legend AOI Labels AOI R-Bars AOI NR-Bars AOI sum_fix_durations ReadingP p = .007 b = -0.10 ReadingP p = .003 b = -0.14 ReadingP p = .003 b = -0.12 ReadingP p = .002 b = -0.14 longest_fixation ReadingP p = .003 b = -0.12 ReadingP p = .018 b = -0.08 ReadingP p = .003 b = -0.15 ReadingP p = .01 b = -0.12 time_to_first_fix not sig. not sig. not sig. not sig. time_to_last_fix ReadingP p = .009 b = -0.14 ReadingP p = .01 b = -0.14 ReadingP p = .006 b = -0.16 ReadingP p = .013 b = -0.13 mean_fix_durations ReadingP p = .025 b = -0.06  not sig. not sig. fixation_rate     prop_num_fixations    not sig. transitions_to_Legend  ReadingP p = .004 b = -0.07 not sig. not sig. transitions_to_Labels not sig.  ReadingP p = .019 b = -0.09 ReadingP p = .007 b = -0.14 transitions_to_R-Bars not sig. not sig.  ReadingP p = .002 b = -0.13 transitions_to_NR-Bars not sig. ReadingP p = .001 b = -0.15 ReadingP p = .004 b = -0.10       Table 6.12: Results showing significant effects of ReadingP found on finer-grained AOI gaze metrics. The normalized model coefficient b indicates the size and directionality of the relationship. Grey cells indicate gaze metrics that are non-relevant to time on task, and were not evaluated. Starting with an examination of the first, second, and fourth rows in Table 6.12 (i.e., sum_fix_durations, longest_fixation, and time_to_last_fix), we can see that for all three of these gaze metrics, ReadingP appears with a negative directionality across all four of the AOI regions we examined (i.e., users with low ReadingP are generating higher values of these gaze metrics). 146  As such, there is no clear indication on where to begin providing support to users with low ReadingP, since all of the visualization regions are possible candidates. However, for the gaze metric mean_fix_durations (fifth row in Table 6.12), ReadingP appears for the Legend AOI only with a negative directionality, indicating that users with low ReadingP were generating longer fixations on average while processing the visualization’s Legend. Prior gaze research has shown that longer fixations are an indication that users are having difficulty extracting information, or at best are capturing some form of increased engagement [91]. Therefore, our findings suggest that users with low ReadingP require extra time and effort to process the bar-group mappings elicited by the legend. We also found significant main effects of ReadingP on several transition-based gaze metrics (see last 4 rows of Table 6.12). First, users with low ReadingP transitioned more often between the R-Bars AOI (i.e., bars in the visualization mentioned by references) and NR-Bars AOI (i.e., bars in the visualization not referenced and not relevant for comprehension), indicating that they might have problems identifying the referenced bars. Figure 6.6 (left) illustrates the observed differences in these two transitions between users with low and high ReadingP, reported via a median split. As mentioned above, this extra transitioning might be a consequence of the fact that these users have difficulty establishing the mapping encoded by the legend compared to their high ReadingP counterparts. Thus, in order to alleviate the time that users with low ReadingP are wasting scanning for the referenced bars, highlighting could be provided in real-time to guide their attention there, by using for instance the examples of effective bar chart highlighting techniques (e.g., bolding, de-emphasizing) for guiding users’ attention presented in [32].  Second, always looking at the last 4 rows of Table 6.12 we identified increased transitioning relating to the Labels AOI. Users with low ReadingP transitioned more often from R-Bars and 147  NR-Bars AOI to the Labels AOI, and transitioned more often from the Labels AOI to the Legend and NR-Bars AOI. Figure 6.6 (center and right) illustrates the observed differences in these four transitions between users with low and high ReadingP, reported via a median split. This extra processing implies these users are likely having difficulty with the mappings between the bars and their textual labels, and/or may be spending this extra time double-checking to ensure they are looking at the right bars once identified. As such, further guidance could be useful to help emphasize these mappings for these users, e.g., by providing highlighting on the labels, along with the highlighting of corresponding relevant bars as discussed above.  Figure 6.6: Differences in transition behaviors within the visualization between users with low vs high ReadingP (median split) that result in slower performance. Bars are shown with 95% confidence intervals. 6.6 Conclusion & Future Work In this chapter, we conducted several analyses using Linear Mixed-Effects Models to uncover processing behaviors that are negatively impacting user experience (i.e., time on task, comprehension accuracy) with Magazine Style Narrative Visualizations (MSNVs) for users with low abilities in several user characteristics. First, we identified two groups of users for 148  which low abilities in four user characteristics are at a disadvantage and thus could potentially benefit from adaptive support to aid them in processing MSNVs: low Need for Cognition or Verbal IQ achieved worse comprehension accuracy despite spending comparable time on task as their counterparts; and low Verbal Working Memory or Reading Proficiency require more time on task to achieve comparable accuracy as their counterparts. Next, we performed an analysis of gaze data aimed at identifying where significant differences in MSNV performance are occurring for these four user characteristics in terms of how the documents are visually processed. Our analysis did not uncover any significant relationships between MSNV processing and comprehension accuracy, and as a result, we were not able to provide any insights as to why users low in Need for Cognition or Verbal IQ were less accurate. However, our analysis did reveal numerous MSNV processing behaviors that related to time on task, and as a result, we were able to identify main effects of Verbal Working Memory and Reading Proficiency on several of them. First, we found that users with low Verbal Working Memory, spent more time processing the main body of text contained in the MSNVs, and took longer to locate for the first time the textual references that discuss specific bars/datapoints within the visualization. Second, we found that processing behaviors indicative of where users with low Reading Proficiency were struggling, were all exclusively related to the visualizations contained in each of the MSNV documents, which included difficulty processing relevant bars elicited by the references. Therefore, as a first step toward better understanding how meaningful adaptive support within the visualization could be devised for users with low Reading Proficiency, we conducted a follow-up analysis of finer-grained gaze processing behaviors within the visualization only. This follow-up analysis revealed several MSNV processing behaviors that capture where users with low Reading Proficiency were struggling. Here, we summarize our 149  findings and include preliminary suggestions on ways to provide help during MSNV processing:  Users with low Reading Proficiency transition back-and-forth significantly more often between the relevant bars and the non-relevant bars in the visualization. This extra transitioning is likely a consequence of difficulty establishing the mappings encoded by the Legend, given we also found that users with low Reading Proficiency spend significantly more time looking at the legend with longer average fixation durations. To alleviate the time these users are wasting scanning for the referenced bars, highlighting could be provided in real-time to guide their attention there (e.g., bolding, de-emphasizing) as was effectively demonstrated on bar chat visualizations in [32].  For several gaze transitions relating to the labels, users with low Reading Proficiency had significantly more transitions. Specifically, users with low Reading Proficiency transitioned more often from the relevant bars and non-relevant bars to the labels, and they transitioned more often from the labels to the legend and non-relevant bars. These users are likely having difficulty processing the mappings between the bars and their textual labels, and as such further highlighting on the labels, along with the highlighting of corresponding relevant bars as discussed in the previous bullet, could be provided concurrently to help reinforce these mappings. The research presented in this chapter is extending previous work on user-adaptive information visualizations to MSNVs, which are arguably more complex and challenging for the reader. Because of this additional complexity and being the first attempt, both the user study and the data analysis were limited along several dimensions. In terms of visualizations embedded in the MSNVs, we have only considered bar charts. In future work, we will investigate whether our findings generalize to MSNVs containing other possibly more complex 150  visualizations. Presumably, users will experience even more serious difficulties with such MSNVs, as references in the accompanying text will likely be longer and more complicated; and the same will be the case for visualization’s legends and labels. Still, considering the visualizations embedded within the MSVNs, the bar charts we examined were not all the same. Our MSNVs contained in equal proportion simple, stacked, and grouped bar charts. So an interesting question is whether bar chart styles influence user gaze behavior and ultimately user experience. Since the focus of this chapter is exclusively on the impact of user characteristics, answering this question is also left as future work. With respect to user traits, to keep the study manageable, we had to exclude some promising candidates, like locus of control and domain expertise. Further studies could explore the impact of these and other traits. Admittedly, the number of subjects in our user study was rather small, and several insignificant or marginally significant findings could turn into strong statistically significant results just by collecting more data. Based on this observation, future user studies should involve many more participants. However, because such studies are time-consuming and resource intensive they would be extremely challenging for a single research group. In principle, one way forward could be to leverage resources in multiple institutions. Notice that more data could also support exploring more specific research questions. For instance, as noted above, since our MSNVs contained in equal proportion simple, stacked, and grouped bar charts, it would be quite interesting to verify if the influence of user characteristics on user experience is mediated by the type of bar chart. However, this would require at least three times the amount of data we have collected so far. The ultimate objective of our work is to provide real-time, user-adaptive support to MSNV processing. In this direction, we are currently designing and implementing two important 151  functionalities. First, capturing users’ fixations in real-time by interfacing with the eye tracking hardware, so that that adaptations can be triggered for users with low characteristics based on when and where they are looking (e.g., triggering an intervention when a user looks at the visualization). Second, implementing the functionality to dynamically highlight specified regions of the visualization (e.g., highlighting the labels or legend), including the ability to control various properties of the highlighting (e.g., duration, fade-in time, color, shape, etc.). Once implemented, we are planning to conduct a follow-up user study to design and test the effectiveness of the adaptive strategies identified above for users with low Reading Proficiency. An interesting question to explore in such study is whether users with high Reading Proficiency would also benefit from these strategies and to what extent. Lastly, we are also planning to implement and evaluate detecting relevant user characteristics (e.g., Reading Proficiency) without the need to administer tests prior to the study. We plan to do this non-invasively and in real-time while users process the MSNVs, by feeding their gaze data into machine learning models to generate predictions of their desired user characteristics. The feasibility of this approach has been previously demonstrated with several information visualizations and tasks [41,60,108,111,149,160] via logistic regression and random forests, and we plan to leverage similar techniques and extend them to MSNVs. 152  Chapter 7: Impact of English Reading Comprehension Abilities on MSNV Processing with Users from a Non-English Speaking Country Preface  ̶ In this chapter21, we present a second user study with MSNVs, which is a replication of the study in the previous Chapter, but conducted in a non-English speaking country. We conduct an analysis similar to the previous chapter on performance and gaze data from both studies combined in order to address the questions of what to adapt to and how to adapt for MSNVs when including a pool of users with less frequent exposure to the English language. Our results show that that a similar form of adaptive support, as advocated in the previous chapter, could also benefit users with low English reading ability in a non-English speaking country. Namely, by helping them locate relevant information in the visualizations contained in MSNVs. We model users’ English reading ability in this chapter using a combined measure capturing several metrics related to reading comprehension ability, and refer to it as Combined English Reading Proficiency, or CERP for short. Included in these metrics are the X_LEX and                                                   21 The content of this chapter was published as [161]: Toker, Moro, Simko, Bielikova, and Conati. (2019) Impact of English Reading Comprehension Abilities on Processing Magazine Style Narrative Visualizations and Implications for Personalization. Proceedings of the 27th Conference on User Modeling, Adaptation and Personalization (UMAP ‘19). 153  NAART reading ability tests, which are also examined in the previous chapter (Chapter 6) but are referred to there as Reading Proficiency and Verbal IQ respectively. 7.1 Introduction There is increasing evidence that individual differences such as cognitive abilities [38,42,109,157,167], expertise [111], and personality traits [133] impact how users process specific information visualizations. These findings have prompted research on how to personalize visualization-based interactions to the specific abilities and traits of each individual user. Most of the research so far has focused on tasks involving just visualizations. However, there has also been initial work [156]22 on providing personalized support for processing Magazine Style Narrative Visualizations (MSNV for short), i.e., visualizations embedded in narrative text [146] (e.g., Figure 7.1) as they are frequently found in news articles, magazines, reports and instructional material. Combining text and graphical modalities is a widespread and well-established approach to convey complex information (e.g., [52,112,120,145]), but it can be prone to, e.g., the split-attention effect, where the user’s attention is split between two information sources with a possible increase in cognitive load and negative impact on comprehension [6]. This can be exacerbated in MSNVs where the text can make multiple references to the accompanying                                                   22 The publication [156] from IUI’18 constitutes work carried out during my PhD, however it is not included as a thesis chapter since the work presented in the previous chapter (Chapter 6) overrides the work in [156]. Also, since the work in Chapter 6 was not yet published at the time of the UMAP’19 [161] publication (the basis for this chapter, Chapter 7), the references here to IUI’18 [156] were used instead. Further details about this are provided in Section 1.5 of the thesis Introduction. 154  visualizations (e.g., two references in Figure 7.1), each soliciting attention to different aspects of the data being visualized and requiring the user to perform different visual tasks. To reduce possible negative effects on comprehension that might be generated by repeated transitions between text and graphics, Carenini et al. [31] proposed to provide dynamic attention guidance to visualization processing as users read the corresponding textual reference. This guidance is a form of cuing, which has been mostly investigated to support learning from multi-modal material in instructional settings [61].  Toker et al. [156] followed up on the idea of dynamic guidance for MSNVs by investigating whether it should be personalized to specific user abilities. They conducted a study to test whether lower levels of specific abilities related to visual processing and reading generated difficulties when processing MSNVs. A measure related to English reading comprehension ability was one of those identified to have a significant impact on MSNV processing, and thus to be a suitable target for personalization.  Figure 7.1: Example MSNV with two references to the embedded visualization (one underlined in green and one in red, for illustration purposes). 155  The impact of English comprehension reading ability (simply reading ability from now on) found in [156] was notable considering that the study was conducted in an English-speaking country. The majority of participants were educated adults; either native English speakers or using mainly English on a daily basis. These results gave us the idea to further explore the impact of reading ability when considering users in non-English speaking countries. Although recent years have seen a rise in non-English content on the Web, a substantial amount of content containing MSNVs are still in English (e.g., textbooks, scientific publications, blogs and other Web content) [66]. Thus, when investigating how to provide personalized support to the consumption of this content, we argue that it is crucial to also study users in non-English speaking countries. These users can consume content in English but do not use English as their primary means of communication on a daily basis. We refer to these users as NESC (Non-English speaking country) users. As a first step in this direction, this chapter broadens the work in [156] in two ways:  We replicate the study in Slovakia and perform an analysis specific to ascertaining the impact of measures related to English reading comprehension ability.  We analyze gaze data collected with non-intrusive eye trackers, to ascertain if there are specific attention patterns that negatively affect MSNV processing for users with lower reading ability, and who can be the target for adaptive support. One challenge in this work was to find a measure of English reading ability suitable to our Slovak users, considering there are many different factors that can influence a user’s reading ability, and time constraints can limit the type and number of tests that are feasible to administer. Here, we propose a measure that combines standard tests normally used in English speaking countries along with self-reported data. We show that lower levels of this combined 156  measure negatively impact MSNV processing, specifically by affecting document processing speed. Furthermore, by analyzing users’ gaze patterns we then identify several behaviors that explain this negative impact. Notably, these behaviors are related not only to text reading but also to the joint processing of textual and graphical information. Namely, users with lower reading ability need to transition more from text to visualization, and take longer to locate the relevant bars within the visualization. This suggests that that these users would benefit from attention guidance that would help them establish the mapping between textual references and corresponding visualization elements. Our findings are significant because: (i) they contribute to the so far limited understanding of the impact of reading ability to the processing of material that combines both text and graphics; and (ii) provide evidence toward the need and design of adaptive interventions that specifically target the multimodal nature of the MSNVs for users with low reading ability. Furthermore, our focus on general purpose MSNVs extends existing research on processing multimodal information that has mostly been related to educational material. In the rest of the chapter, we first present related work. Next, we describe the MSNV user study conducted in the non-English speaking country (Slovakia). We then evaluate the impact of measures related to English reading ability on MSNV performance. After, we provide an analysis of eye tracking data to identify patterns that negatively affect MSNV processing for users with low English reading abilities. Lastly, we discuss our results, contributions, and implications toward the design of user-adaptive MSNVs. 157  7.2 Related Work There has been extensive research in psychology on investigating how people process combinations of textual and graphical information, mainly related to instructional material. Some of this research focused on the impact of individual differences, but there are very limited findings specific to reading abilities. Kalyuga et al. [94,95] investigated viewer’s expertise as a factor that influences whether instructional material consisting of two modalities increases comprehension or creates overload. They found for instance, that inexperienced electrical trainees learned better from diagrams of electrical circuits integrated with textual explanations, whereas more experienced trainees performed better with the diagram only [95]. Wiley et al. [172] presented evidence that working memory capacity (WMC – a measure of one’s ability to use one’s working memory system) can predict learning from illustrated text. Specifically, lower WMC reduced a reader’s ability to select specific information in each modality and integrate it to develop overall understanding. Based on these findings, the authors discuss various forms of personalized support for learners with low WMC.  Hegarty and Just [77] looked at the impact of both students’ Reading Proficiency and Aptitude for Reasoning with Mechanical Principles when studying material on pulley systems that contained both text and diagrams. There was a marginal effect of mechanical aptitude on learning time (higher for low ability students), explained by significant differences found from the analysis of the learners’ gaze patterns: low mechanical ability students re-read more clauses in the text and inspected the diagram more often. No effect, however, was found for reading ability, possibly because participants were students at one of the top U.S. universities and thus likely had high reading abilities. 158  There has been some work on investigating how to facilitate the processing of educational text with graphics via cuing, i.e., adding visual prompts that guide learners’ attention to relevant elements in multimodal material (see [61] for an overview).  For instance, Folker at al. [58] and Ozcelik et al. [134] showed that color-coding matching parts of the text and the graphics can increase comprehension. However, this approach is limited by possibly not having enough easily distinguishable colors for color matching. Kalyuga [93] addressed this limitation by color matching corresponding parts of text and graphics dynamically. Novices studying electric circuits explained by diagrams and text, received guidance when they clicked on a specific paragraph that consisted of coloring in both the text and the diagram of all the electrical elements mentioned. Novices who received this guidance learned significantly more than those who did not. The results from our work support the idea of using a similar approach to users processing MSNVs, but tailored to a user’s reading ability.  Although we are not aware of other work specifically targeting personalized support for processing text with graphics, there is initial research on providing personalized guidance to reading only, and to processing stand-alone visualizations. For reading, in Loboda et al. [113] eye tracking was used to infer word relevance and user information needs during reading tasks, useful to provide personalized content. D’Mello et al. [45] looked at supporting reading by detecting instances of users’ mind wandering, and then intervening to refocus their attention.  For visualization processing, examples of personalized guidance include suggesting a different visualization based on detected user needs such as suboptimal behaviors [63] and evolving knowledge [68] or changing aspects of the current visualization [129]. Carenini et al. [32] evaluated several forms of dynamic highlighting (e.g., bolding, arrows) to guide attention 159  to relevant data points within grouped bar charts, and showed a significant improvement in task performance compared to using no interventions. 7.3 User Study The study that we conducted to collect data on how users in a non-English speaking country (NESC users) process MSNVs in English was held at the Slovak University of Technology in Bratislava, Slovakia23. The study included 52 participants (15 females), ranging in age from 20 to 35 (avg. = 23.1). Participants were recruited among university students and using the university Facebook page). 80% of the participants were students from computer science or engineering, 10% were students from other fields (e.g., chemistry or finances), and 10% had a variety of other occupations (e.g., veterinary surgeon, civil engineer, marketing assistant). In comparison, the original study in [156] which was carried out in an English speaking country (ESC users), included 56 participants (32 female) ranging in age from 19 to 69 (avg. = 28.02). 60% of participants in the original study were university students, and the others were from a variety of backgrounds (e.g., retail manager, restaurant server, retired). The study was conducted in a room with 20 computer stations [18] arranged as shown in Figure 7.2. Each station was equipped with a Tobii Pro X2-60, a non-intrusive camera-based eye tracker mounted below a 1920x1200 pixel screen. During the study, at most 12 stations were used at once; the participants were seated so that they would not disturb each other.                                                   23 http://uxi.sk, User Experience and Interaction Research Centre 160   Figure 7.2: Eye tracker lab setup used for conducting the study in the non-English speaking country. 7.3.1 Study Procedure The study procedure was directly taken from [156], with the difference that in [156], the participants were run one at a time (due to eye tracker availability). The group set-up in the new study resulted in minor changes to the procedure, mainly related to synchronizing the various stages of the study and avoiding disturbance in the group setting, as described later. The experiment was a within-subjects repeated measures design, lasting at most 120 minutes. Four to five experimenters were in the lab for the duration of each session. One of them would start the session by providing a scripted summary of the study objectives and structure, which included the administration of a variety of well-established standardized tests for cognitive abilities/traits that were tested in [156]. Although here in this work we focus only on the impact of measures relating to reading ability, we chose to test for all the user characteristics measured in [156] for consistency and possible further analysis. Four of these 161  tests (not the focus of this chapter) were administered at the start of the session with an average duration of 14.5 minutes24. Next, participants underwent eye tracker calibration, lasting a few seconds. They then answered a short pre-study questionnaire to obtain demographic information and to self-report on their preference and ability with English. After, participants were tasked to read 15 different MSNVs displayed one at a time on their screen in randomized order. For each MSNV, participants signaled that they were done reading by clicking a ‘next’ button at the bottom left of the screen. After this, they received a screen with questions eliciting their opinion and testing their comprehension of the recent document. Participants were not given a time limit to read the MSNVs. However, to increase their motivation to put effort in the tasks, they were told that their performance would be evaluated in terms of speed and accuracy and the top three participants would receive a 50€ reward. After reading the 15 MSNVs, a participant would leave the room (to avoid disturbing those still working) and would take a short break. Next, the participant proceeded to take three remaining tests from the set administered in [156]. First, they moved to a new room where they took a short personality test (not used in this chapter) followed by the X-Lex vocabulary test. This part lasted about 6 minutes. Then they proceeded to another room where they took the NAART test, lasting 5 minutes on average. Because NAART requires reading aloud, it was administered one participant at a time while others waited their turn outside. Since the X-Lex                                                   24 These tests measured: perceptual speed, spatial memory, disembedding, and visual working memory. For test sources and definitions see [156]. 162  and NAART tests are the focus of this chapter, they are described in more detail in the next section. 7.3.2 Materials All study materials were the same as those used in [156]. 7.3.2.1 Study Tasks The 15 MSNVs used for the study were derived from an existing dataset of magazine-style documents where the textual references to the accompanying visualizations in each document had been manually identified via crowdsourcing to indicate which data points in each visualization correspond to each reference [101]. All 15 MSNVs were self-contained excerpts from longer articles extracted from real-world sources including Pew Research, The Guardian, and The Economist. They were selected to each include one visualization (a simple, stacked, or grouped bar chart [128]), and a body of narrative text ranging between 42 and 228 words (avg. = 91) containing 1 to 7 references (avg. = 2.6). The number of words and references were varied to account for their potential influence on MSNV processing. 7.3.2.2 Dependent Measures For each MSNV, we collected:  Two measures of performance: Time on Task, as the time to read the MSNV; Task Accuracy, as the percentage of correct questions answered for the MSNV.  Two subjective measures related to perceived Ease-of-Understanding and document Interest. Accuracy and subjective measures were assessed via a set of questions shown to the users on one screen after they read each MSNV document (see Figure 7.3). The first two questions in 163  Figure 7.3 measured on a 5-point Likert Scale perceived ease-of-understanding and interest based on the work of [170]. The remaining questions measured document comprehension, based on the work [48] consisting of:  One or two (depending on document length) recognition questions asking to recall specific information from the MSNV: either a named entity discussed in the text (e.g., question 3 in Figure 7.3); or the magnitude/directionality of two named entities (e.g., question 4 in Figure 7.3).  One title question which asks to select a suitable alternative title for the MSNV (see question 5, bottom of Figure 7.3) and provides a simple way to ensure that the user had a grasp of the general document narrative. The comprehension questions were designed so they could not be answered relying solely on general knowledge (i.e., measured comprehension actually reflects the content of the MSNVs)25.                                                   25 This was tested in a pilot study were users were given these questions without having seen the MSNVs to ensure that the overall mean accuracy achieved on each question was at most 50%. 164   Figure 7.3: Questions presented to users after reading each MSNV document. 7.3.2.3 Measures to Assess Reading Ability Many standard tests for reading ability tend to be quite long. For instance, the ESOL, IELTS, and TOEFL iBT all require more than an hour to administer, which was not feasible for this user study. Instead, following [156], we selected two well established tests to measure constructs 165  relating to English reading comprehension ability26 that are designed for time-constrained settings:  X_Lex Vocabulary Test [121]. This paper-based test consists of reading through a list of 180 words from which some are non-existent (i.e., fake). The users are instructed to indicate all the words for which they know the meaning and are scored based on the number of selected words that are real as well as the number of selected words that are non-existent. This test provides a quick method for profiling the English vocabulary of users, and scores from this test correlate well with English reading comprehension [122].  North American Adult Reading Test (NAART) [150]. This is a spoken test, for which users are recorded as they read aloud a series of 61 increasingly difficult English words and they are scored based on the number of correctly pronounced words. The recordings from our study were scored by a native English speaker following the NAART scoring manual. NAART is a good predictor of several constructs related to reading ability, i.e., verbal reasoning, verbal comprehension, or semantic knowledge [23]. In addition, users self-reported on 3 different questions relating to their English reading ability:   “What is your native language/first language learned?” Scored 1 for English, 0 otherwise.                                                   26 X_LEX in this chapter is the same as Reading Proficiency in the previous chapter (Chapter 6), and NAART in this chapter is the same as Verbal IQ in Chapter 6. 166   “Currently in your everyday life, is English your preferred language to speak/read?” Scored 1 for English, 0 otherwise.  “Please self-report your English language proficiency.” Measured on a 4-point scale; from beginner to expert. Scored from 1 to 4. 7.4 Impact of Reading Ability on Task Performance As we discussed in the introduction, our goal with this work is two-fold: (i) ascertain if we can reproduce with users in a non-English speaking country (NESC) the impact of measures related to English reading comprehension abilities on MSNV performance found in [156]; (ii) see if we can explain this effect in terms of suboptimal attention patterns. This section describes the analysis we conducted toward (i).   For our analysis, we selected Linear Mixed-Effects Models since they can handle multiple random effects at once. Because our study was a repeated measures design where all users were exposed to the same set of 15 documents, there are two random effects to handle. The first random effect user_id accounts for within-subject correlations due to the fact that multiple measurements were collected from the same user. The second random effect MSNV_id accounts for within-document correlation due to repeated measurements being collected from the same MSNV document. We used the lmerTest software package in R [106] to construct a separate mixed model for each dependent measure to be analysed, along with NAART and X-Lex as covariates, and user_id and MSNV_id as random effects.  We started by testing whether we could replicate with our new pool of NESC users, the effects of measures related to reading abilities (i.e., NAART and X-Lex) on task performance that were found in [156], but we found no effects. One possible explanation for this outcome is that the NESC users were all comparable in terms of task performance. Given these users reside 167  in a non-English speaking country, it’s also possible they were equally slower compared to users from an English speaking country. A comparison of time on task indeed shows that NESC users were slower reading the MSNV documents (Mean=69.2 sec, SD=32.7) vs. the ESC users in [156] (Mean=57.2 sec, SD=33.1). Therefore, we chose to combine our pool of NESC users with the pool of ESC study users to increase the range of performance in our dataset and thus improve our ability to detect potential differences due to English reading ability. The pool of users we added from the ESC consisted of 56 subjects (see Section 7.3 for their demographics); yielding a combined dataset of 108 users. 7.4.1 Combining Users’ Reading Ability Scores There are many factors in various combinations that can explain the reading abilities of users [65]. In addition to the X_Lex and NAART tests that were administered, we also collected three self-assessed measures related to English reading ability (described in the previous section). Among these self-assessed measures, noteworthy differences were reported between the two populations. For instance, in the NESC only 3.9% of users’ native language was English vs. 50.0% in ESC; and for preferred language only 5.5% of users reported English in the NESC vs. 76.8% in the ESC). Given the potential impact that these distinguishing measures could have on English reading ability (in addition to the NAART and X_LEX tests), we wanted to ensure that we leveraged as much information from these different sources as possible to characterize users’ reading ability across the pooled dataset of ESC and NESC users. To this end, we performed a dimensional reduction, so that users’ English reading ability could be modeled in as few variables as necessary. We opted to use Principal Component Analysis (PCA), which facilitates the identification and combination of groups of inter-related variables into 168  components more suitable for data analysis [56].  A PCA on the five different measures resulted in one output component. Bartlett's test of sphericity x² (10) = 3103.37, p < .001, indicated that the PCA was appropriate. Kaiser's sampling adequacy was good at 0.79 [82], and all variables showed a communality > 0.65 which is above the acceptable limit of 0.51 [56]. The component we generated had an eigenvalue over Kaiser's criterion of 1 and explained 59.4% of the variance, providing us with one suitable measure that alleviates the need to include multiple measures related to reading ability in our subsequent analyses. In the rest of the chapter we refer to the component we generated as the Combined English Reading Proficiency score (or CERP for short). Figure 7.4 shows the distribution of CERP scores in our pooled dataset based on users’ study location (ESC vs. NESC). In general, users from the NESC population have lower CERP scores (not a surprise), and the range of scores, when compared to the ESC population, is much narrower. In contrast, the scores in the ESC population are overall much higher (to be expected), but the spread of scores is much wider. This larger spread in the ESC population is likely attributed to the fact that 50% of them are non-native English speakers. However, unlike users in the NESC population, non-native English speakers in the ESC receive more regular exposure to English and are likely more comfortable working in English (as evidenced 76.8% of ESC users who prefer to speak/write in English in their everyday lives). In other words, there appears to be an important distinction between non-native English speakers living in a non-English speaking country vs. non-native English speakers living in an English speaking country, and what we observe in Figure 7.4 appears to be consistent with this idea. 169   Figure 7.4: Distribution of users from the English Speaking Country and Non-English Speaking Country (NESC) populations according to their Combined English Reading Proficiency (CERP) scores. 7.4.2 Analysis of CERP on MSNV Performance We constructed one Mixed Model for each measure of task performance as the dependent measure, along with CERP (continuous variable) as a covariate, and user_id and MSNV_id as random effects. Results are as follows: For the two objective measures, a significant effect, b = −5.93, t(−3.73), p < .001, was found for CERP on Time on Task. The negative slope of b indicates that users with lower CERP spent more time on task. Using a median split on users’ CERP scores, we found that it took the low CERP group 70.7 seconds on average to read the excerpts, while only 55.7 seconds were needed for high CERP users. No effect however, was found for CERP on Task Accuracy, b = −0.003, t(−0.46), p = .64. In conjunction though, these first two results provide evidence that all users were similarly accurate regardless of CERP score such that low CERP users likely needed extra 170  time to achieve comparable accuracy as their counterparts. The most straightforward explanation as to why these users are struggling is precisely because of their low English reading comprehension abilities as captured by CERP. For the two subjective measures, no significant effect was found for CERP on document Ease-of-Understanding, b = 0.10, t(1.77), p = .08, indicating that users rated the documents similarly regardless of their CERP score. Lastly, a main effect was found for CERP on document Interest, b = 0.29, t(3.91), p < .001. The positive slope of b indicates that users with low CERP rated the documents as less interesting. Given our results, Time on Task offered the strongest indication that users with low CERP were objectively struggling on task, and thus could likely benefit the most from some form of adaptive support. Therefore, we select Time on Task as the primary measure of performance for further investigation in the next section. 7.5 Gaze Analysis of MSNV Processing In this section, we leverage eye tracking data to address our second research goal, namely, to see if we can explain the effect of low English reading ability (as captured by users’ CERP score) in terms of suboptimal attention patterns during MSNV processing. First, Section 7.5.1 explains how the eye tracking data collected during the user study was utilized to generate numerous gaze metrics that capture various MSNV processing behaviors. Next, in Section 7.5.2 we analyze the set of generated gaze metrics, to identify which among them are relevant to performance with the MSNVs (i.e., time on task). Lastly, in section 7.5.3 we conduct an analysis to see if any of the gaze metrics relevant to MSNV performance are significantly influenced by users’ CERP score. 171  7.5.1 Computing Gaze Metrics Raw gaze data comprises of fixations (points of gaze on the screen) and saccades (quick movements between fixations). In order to capture a more detailed understanding of users’ MSNV processing, we compute from the raw gaze data a set of summary statistics describing numerous aspects of their gaze behaviors following a standard approach adopted in many other works [46,119,154,158]. Users’ raw gaze data was processed using EMDAT, an open source library (github.com/ATUAV/EMDAT) that produces a comprehensive set of gaze-metrics specified over the entire display, and over specific areas of interests (AOIs). For our analysis, we selected only AOI-based gaze metrics because those specified over the entire display do not capture any information relating to the content of the MSNVs and thus are not useful for our research goal. The complete set of gaze metrics we selected are listed in Table 7.1. These metrics are defined over four AOIs chosen to gain a general sense of MSNV document processing with respect to the two sources of information, text and visualization.  The four AOIs (see Figure 7.5) are defined as:  Refs AOI: The combined areas of all the reference phrases contained in the MSNV document (purple-shaded boxes).  Text AOI: The rest of the MSNV document text (orange box minus purple boxes).  Referenced Bars (R-Bars) AOI: The combined area of all the bars in the visualization that are mentioned by any of the references (green boxes).  Viz AOI: The rest of the visualization region (pink box minus green boxes). 172   Figure 7.5: The four AOIs we defined to capture MSNV processing, shown here for one of the MSNV documents administered in the user studies. No.  Gaze Metric Description 1 • fixation_rate Fixation rate in AOI (number of fixations ÷ total time spent in AOI) 2 • number_of_fixations Total number of fixations in AOI 3 • longest_fixation Longest fixation in AOI 4-6 • sum_fix_durations   • mean_fix_durations • stddev_fix_durations Sum, Mean, and Std. Deviation of fixation durations in AOI 7-8 • time_to_first_fix   • time_to_last_fix Time to first and last fixation in AOI 9-12 • transitions_to_Text   • transitions_to_Viz • transitions_to_Refs    • transitions_to_R-Bars Number of gaze transitions from this AOI to every AOI 13-16 • prop_ trans_to_Text    • prop _trans_to_Viz • prop _trans_to_Refs   • prop _trans_to_R-Bars Proportion of gaze transitions from this AOI to every AOI (according to total gaze transitions in all AOIs) 17 • prop_num_fixations Proportion of fixations in AOI (according to total fix. in all AOIs)    Table 7.1: Set of 17 gaze metrics generated for each of the 4 AOIs shown in Figure 7.5. 173  7.5.2 Identifying Relevant Gaze Metrics Here, our goal is to identify which gaze metrics have a significant relationship with MSNV time on task. For the purposes of our research, gaze metrics found with no significant relationship to time on task are non-relevant and do not warrant further consideration. Non-relevant gaze metrics offer no concrete indication on how the captured processing behavior translates to MSNV performance, and thus provide little guidance towards designing meaningful adaptive support. We construct one Mixed Model for each of the 56 gaze metrics (14 gaze metrics x 4 AOIs)27, using gaze metric as the independent measure, with time on task as the dependent measure, and user_id and MSNV_id as random effects. Given the relatively high number of models to check, we account for multiple comparison error by adjusting the obtained p-values using a Bonferroni correction [56] equal to the total number of gaze metrics within each family of AOIs (in this case 14). All significant results we found are reported in Table 7.2.                                                        27 17 gaze metrics were generated per AOI, however 3 gaze metrics were removed in each AOI from further analysis due to very strong positive correlations (r > 0.9). In each AOI: number_of_fixations and transitions_to_self were removed due to high correlation with sum_fixation_durations; and stddev_fix_durations was removed due to high correlation with longest_fixation. 174   Gaze Metric Text AOI Refs AOI Viz AOI R-Bars AOI sum_fix_durations p < .001 b = + p < .001 b = + p < .001 b = + p < .001 b = + longest_fixation p < .001 b = + p < .001 b = + p < .001 b = + p < .001 b = + time_to_first_fix not sig. p < .001 b = + p < .01 b = + p < .001 b = + time_to_last_fix p < .001 b = + p < .001 b = + p < .001 b = + p < .001 b = + mean_fix_durations p < .001 b = + p < .001 b = + p < .001 b = + p < .001 b = + fixation_rate p < .001,   b =   ̶not sig. not sig. not sig. transitions_to_Text  p < .001 b = + p < .001 b = + p < .001 b = + transitions_to_Refs p < .001 b = +  p < .001 b = + p < .001 b = + transitions_to_Viz p < .001 b = + p < .001 b = +  p < .001 b = + transitions_to_R-Bars p < .001 b = + p < .001 b = + p < .001 b = +       Table 7.2: Gaze metrics in each AOI found to be significant with time on task. The coefficient b indicates the directionality of the relationship (+/−). Grey cells indicate metrics excluded due to high correlations (i.e., transitions within the same AOI are highly correlated to sum_fix_durations). In all cases but one, the relationship between gaze metric and time on task has positive directionality indicated by the estimated coefficient b (i.e., higher values of the corresponding gaze metric relate to longer times on task). The exception is fixation_rate on the Text AOI, which has negative directionality (i.e., higher fixation rates in the Text related to lower time on task). We found no significant results for any of the 20 proportion-based gaze metrics (cf. metrics 13-17 in Table 7.1), and are thus not shown in Table 7.2. Overall, we identified a total of 32 relevant gaze metrics that have a significant effect on MSNV time on task. Even though it is 175  not surprising that many of these gaze metrics are highly correlated to time on task (i.e., they get bigger as more time is spent on task), we report them here for completeness. The interesting part of these identified gaze metrics will surface in the next part of our analysis, when they are examined to see to what extent any of these relationships are qualified by users’ reading ability. 7.5.3 Impact of CERP on Relevant Gaze Metrics As identified in Section 7.4.2, we found that the users’ CERP (Combined English Reading Proficiency) impacts task performance with MSNVs in a manner that may call for personalized support: namely users with low CERP spend significantly more time on task to achieve comparable accuracy compared to users with higher CERP scores. Here, our goal is to see if CERP impacts any of the 32 relevant gaze metrics identified in the previous sub-section, so as to identify possible gaze processing behaviors exuded by users with low CERP scores which are causing lower MSNV task performance. We construct one Mixed Model for each of the 32 relevant gaze metrics as the dependent measure, with CERP as a continuous covariate, and user_id and MSNV_id as random effects. As before, we apply a Bonferroni correction equal to the total number of gaze metrics within each family of AOIs (in this case 8). Results revealed a significant effect of CERP on seven of the tested gaze metrics (reported in Table 7.3). For all seven cases, the slope b is negative, indicating that lower CERP users produced higher values of the corresponding gaze metric.    176   Gaze Metric Text AOI Refs AOI Viz AOI R-Bars AOI sum_fix_durations CERP p < .05 b =  ̶ not sig. not sig. not sig. time_to_first_fix  not sig. not sig. CERP p < .05 b =  ̶ time_to_last_fix CERP p < .001 b =  ̶ CERP p < .05 b =  ̶ CERP p < .01 b =  ̶ CERP p < .05 b =  ̶ transitions_to_Viz CERP p < .05 b =  ̶ not sig.  not sig.      Table 7.3: Results showing in which AOIs a significant effect of CERP was found on the corresponding gaze metric. Metrics in grey cells were not relevant to time on task. First, we found that low CERP users have significantly higher times to last fixation in all of the AOIs (see third row of Table 7.3). Since we observe it in all AOIs, little guidance is offered by this result and may just be a direct consequence of users with low CERP spending overall more time processing MSNVs. Similarly, we would expect by nature of CERP the low CERP users to spend more time processing the narrative parts of MSNVs, i.e., reading the text, which is confirmed by a significant effect of CERP on sum_fixation_durations in the Text AOI (first row of Table 7.3). In addition, the lack of effect on sum_fixation_durations for the visual information (Viz and R-Bars AOIs) suggests that users spent the same amount of time processing the visualization regardless of their CERP score. Next, we found that users with low CERP are taking significantly longer to fixate (i.e., time_to_first_fixation) on the relevant bars in the visualization (see second row of Table 7.3). On average, we found that users with low CERP fixated on the R-Bars 32.1 seconds into the task compared to 23.0 seconds for users with high CERP. Interestingly, no significant effect of 177  CERP was detected on time_to_first_fixation in the Viz AOI, meaning that users are looking at the visualization for the first time at comparable times regardless of their CERP.  Therefore, it is likely not the case that users with low CERP are failing to look at the visualization soon enough, but rather they require more time to find the relevant information within the visualization; a behavior which is negatively impacting task performance and could be helped with an adaptive intervention. We also found a significant effect of CERP on one transition-based gaze metric (last row of Table 7.3) indicating that users with low CERP transitioned more often from the Text AOI to the Viz AOI. This result provides further evidence that users with low CERP are struggling with the mappings between textual and visual information (i.e., references), and makes the case for adaptive support even stronger. 7.6 Discussion and Conclusion In this chapter, we conducted an exploratory user study with Magazine Style Narrative Visualizations (MSNVs) in a non-English speaking country (NESC), thus broadening the work in [156]. Our aim was to ascertain (i) if the impact of measures related to English reading comprehension abilities on MSNV processing found in [156] can be reproduced in a NESC; and (ii) if we can explain this effect in terms of suboptimal attention patterns.  Regarding (i), we found no effect of measures related to reading abilities using solely NESC users (likely due to low variance in their task performance which was overall significantly lower than those of ESC users from [156]). We then proceeded to pool the two datasets together from both countries, thus producing a sample of users with a substantially wider range of task performance and consequently English reading abilities. Furthermore, in order to 178  leverage all of the information  we collected relating to users’ English reading abilities, we introduced a combined measure of English reading proficiency (CERP score) generated using PCA.  Our results from the pooled data confirmed the original findings in [156]. Namely, users’ English reading comprehension ability (as expressed by their CERP score) significantly impacted task performance with the MSNV documents. Specifically, we found that users with lower reading abilities (CERP score) required significantly more time to reach the same degree of comprehension as higher ability users (about 12 seconds on average per MSNV). Even though 12 seconds may seem a rather short amount of time to warrant adaptive interventions, bear in mind that the MSNV documents we administered consisted of very short excerpts of much longer documents. Therefore, in a real-world setting where MSNVs are typically much longer (i.e., many paragraphs and pages) and contain many visualizations, it is very likely that the effect we found of reading ability on task time would be greatly exacerbated for lower reading ability users.  Concerning (ii), our analysis of eye tracking data identified several MSNV processing behaviors of users with low English reading abilities (CERP) from the ESC and NESC that significantly contributed to lower task performance. Users with low reading ability were primarily struggling (i.e., behaviors that negatively impact performance) in two ways. First, they spent more time processing the narrative parts of MSNVs (i.e., reading the text); a result to be expected given the text is where the bulk of the reading occurs in the MSNV. Second, they struggled to locate the relevant bars in the visualization. Namely, it took them significantly longer to fixate on them for the first time, and they transitioned significantly more often from the narrative text to the visualization.  179  Our findings contribute to the future design of MSNVs with user-adaptive support in two major ways. First, we identified a significant user trait that such a system can leverage to determine which users to provide adaptations to; namely, to help users with low English reading abilities who are slower on task. Second, we provided significant evidence regarding how these users are struggling while processing MSNVs, which directly supports the implementation of highlighting interventions (e.g., bolding, arrows) to help guide users’ attention to relevant information in the visualization, similar to what was shown in [6] for tasks involving a single visualization only. In that work however, highlighting was provided only once per task, and only for a single subset of bars in the visualization. Since MSNVs typically contain many different references, each eliciting a different subset of data in the visualization, we propose using eye tracking to detect which part of the narrative text users with low reading ability are currently processing, so that highlighting can be provided in real-time to the corresponding set of bars in the visualization. Our results are of interest because they contribute to existing work on how to enhance the value of multimodal presentation of information based on written text and graphics. Whereas most of the existing works focus on learning with instructional texts, here we investigated a more open-ended task of processing MSNVs. Furthermore, our work adds English reading comprehension abilities to the list of user characteristics identified to be important in multimodal processing, which thus far has consisted mainly of domain expertise.   As future work, we are planning to conduct a study with MSNVs to test the effectiveness of the adaptive strategy that our findings advocated. Namely, we will use eye tracking to detect which part of the narrative text a user is currently reading and highlight the corresponding set of bars in the visualization to help users with low English reading abilities find them sooner. 180  Chapter 8: Conclusion In this thesis, we presented research that offers eight distinct contributions toward the design of user-adaptive visualizations that support visualization processing by detecting relevant user characteristics and providing personalized interventions on the visualization. In this chapter, we summarize our eight contributions according to the three key questions for designing user-adaptive interaction we presented in Chapter 1, namely: what to adapt to (three contributions), how to adapt (four contributions), and when to adapt (one contribution). After, we discuss limitations of our work, and conclude with current and future work. 8.1 What to Adapt To? This thesis makes three contributions toward understanding what user characteristics should be considered in user models for user-adaptive visualizations. First, in Chapter 2, we showed that the impact of user characteristics on visualization processing may depend on the complexity of the task. Specifically, for the cognitive abilities: Perceptual Speed, Verbal Working Memory, and Visual Working Memory, we found that that when users scored low on these measures they were significantly slower on task compared to high scoring users only during more complex tasks. The design implication of our finding is that some user adaptive visualizations may need to track task complexity given that certain user characteristics may only warrant adaptation as task complexity increases. Second, in Chapter 5, we provided results that a user’s level of Evolving Skill with a visualization can impact task performance, even during the usage of very common visualizations such as bar charts.  Our finding shows that unskilled users may benefit from 181  adaptive support while they are acquiring the skills necessary to become proficient with a visualization system. Third, we broadened the investigation of what to adapt to in Chapter 6 from stand-alone visualizations to visualizations embedded in narrative text, commonly referred to as Magazine Style Narrative Visualization (MSNV). Our results identified that Verbal Working Memory and English Reading Ability can both impact users’ ability to effectively process MSNVs. Namely, users with low ability in either of these two user characteristics required more time to achieve comparable accuracy as their higher ability counterparts, demonstrating they might benefit from adaptive support to help them process MSNVs more efficiently. 8.2 How to Adapt? Our research makes four contributions toward how to design adaptive interventions that can support the processing of specific visualizations. First, Chapter 2 provided evidence that adapting by highlighting relevant information in the visualization is beneficial to bar chart processing for both simple and more complex tasks. In particular, we identified three different types of highlighting that helped users achieve better task performance: Bold thickened the border around the relevant bars; Arrows pointed downward to relevant bars; and De-emphasis faded all other non-relevant bars. Another contribution we made consists of several insights toward the general question of how to adapt for user-adaptive visualization by applying a methodology that leverages eye tracking data to get a detailed understanding of where users are struggling during visualization processing based on their relevant user characteristics. Using this methodology, we showed in Chapter 3 that users low in the cognitive ability Verbal Working Memory could benefit from adaptations that facilitate processing of the legend in bar 182  chart visualizations. Our third contribution in Chapter 4  investigates how a selection of pupil dilation measurements are affected by the highlighting interventions we tested in Chapter 2. We provided preliminary evidence that monitoring users’ pupil dilation measurements as an estimate of cognitive load could be beneficial towards designing, testing, or validating highlighting interventions. Specifically, we opened an avenue of investigation that using estimates of high cognitive load could be a way to filter out unsuitable interventions, as opposed to relying on task performance. As our fourth contribution, in Chapters 6 and 7, we also used this methodology to show that users with low English Reading Ability could benefit from adaptations that help them locate relevant information in the visualization while they are processing MSNVs. 8.3 When to Adapt? Our research has contributed to enabling adaptive interventions that can be delivered during visualization tasks. Specifically, we investigated eye tracking as a non-invasive data source for predicting user characteristics in real-time, so that adaptations can be delivered as soon as these predictions are provided to the user model. We presented evidence in Chapter 5 that eye tracking and machine learning can predict a user’s level of Evolving Skill with a visualization as they perform tasks with bar chart visualizations. Recall that our findings on what to adapt to showed that a user’s skill level with a visualization is prone to change as they obtain more practice with a given visualization. Therefore using eye tracking to predict ‘non-static’ user characteristics such as this one is especially valuable since it offers an unobtrusive way to track its evolution over repeated interaction with a visualization, enabling adaptation that can be tailored according to detected changes in users’ skill level. 183  8.4 Limitations Here, we describe several limitations of the work presented in this thesis. First, all of the visualizations we examined, both for the highlighting intervention work and the MSNV work, were limited to various types of static bar charts, i.e., simple, grouped, and stacked. Therefore, further studies are needed to assess the extent that our findings may generalize to both interactive visualizations and other visualization types. In terms of what to adapt to, the generalizability of our findings is promising. Several of the user characteristics we found relevant with bar charts have also been shown by other researchers to affect task performance while working with other visualizations. For instance, with interactive stacked bar charts, users with low Perceptual Speed, Verbal Working Memory, or Visual Working Memory were shown to have slower task completion time [38]. In addition, users with low Perceptual Speed were also shown to be slower on task with radar charts [157], word maps and text charts [4], and 2D visualizations of 3D objects [167]. With regard to the questions how to adapt and when to adapt, several avenues of investigation are still open when additional visualizations will be studied. These include testing the applicability and effectiveness of our highlighting interventions with other visualizations; assessing from gaze data if users low in relevant characteristics struggle in the same way while processing other visualizations; and testing the feasibility of predicting user characteristics from eye tracking data during tasks with other visualizations. Nonetheless, we argue that our contributions toward designing user-adaptive visualizations are both impactful and pertinent because static bar charts are one of the most ubiquitous and effective information visualization techniques found in the real-world [25], and because they are also used as building blocks for more complex and interactive visualizations [34,67]. 184  Our research was also the first to look at the impact of user characteristics on the processing of MSNVs. The documents administered in our studies were limited to short excerpts from longer real-world MSNVs and included a single visualization in each. We therefore do not know how our results would generalize to documents that are longer and/or for ones containing multiple visualizations. Using short excerpts of MSNVs also limited the type of comprehension questions we could ask in our user studies, since the documents did not contain enough information to devise more complex types of questions, e.g., inference or synthesis. However, we argue that our work provides valuable contributions, because it established the initial building blocks for user-adaptive MSNVs and allows for future research to be incrementally built upon our results. 8.5 Current & Future Work Currently, we are working on ‘closing the adaptive loop’ by leveraging many of the research findings presented in this thesis in order to build a user-adaptive visualization system for Magazine Style Narrative Visualizations (MSNVs). Since MSNVs typically contain many different references in the narrative text, each discussing different subsets of data in the visualization, we are using eye tracking to detect which parts of the text a user is currently reading, so that the corresponding bars in the visualization can be highlighted to help users locate and keep track of relevant information in the visualization more easily. The type of highlighting we are using leverages our findings in Chapter 2. Namely, those shown to be effective at providing guidance to relevant bars in the visualization, which include bolding, arrows, and de-emphasis. Two main technical challenges we had to solve in order to provide adaptive highlighting on the visualizations in MSNVs were: i) interfacing with the eye tracking 185  hardware in order to capture and process users’ fixations in real-time so we can detect when and where users are looking; and ii) implementing the functionality to dynamically highlight specified bars in the visualization, including the ability to control various properties of the highlighting (e.g., location, duration, fade-in time, color, shape, etc.). At present, we are conducting a user study to test the effectiveness of our highlighting strategy during MSNV processing. We are also testing for several user characteristics in our study, including those considered in Chapters 6 and 7 that showed users with low abilities were at a disadvantage during MSNV processing (i.e., Verbal Working Memory & English Reading Ability) to see to what extent the highlighting our system provides is more or less beneficial to these users. As future work, we plan to investigate if using eye tracking to predict user characteristics during bar chart processing, as shown in Chapter 5, can also work for MSNVs. Specifically, we plan to conduct machine learning experiments using eye tracking collected from the MSNV studies in Chapters 6 and 7, to see if we can predict users’ Verbal Working Memory, English Reading Ability, and their level of Evolving Skill during MSNV processing. We are interested in generating such predictions because it can potentially eliminate the need to administer user characteristics tests, and can allow for adaptations that rely on tracking non-static user characteristics like users’ level of Evolving Skill with MSNVs. As further future work, we plan to address several of the limitations with MSNVs we described in the previous subsection. Namely, we plan to conduct user studies to evaluate the impact of user characteristics on MSNV processing with documents containing other types of visualizations, and with longer documents that contain multiple pages of narrative text and multiple visualizations within the same document. 186  Bibliography [1] Charu Aggarwal. 2016. Recommender systems: the textbook (1st edition ed.). Springer Science+Business Media, New York, NY. [2] H. Akaike. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19, 6 (1974), 716–723. DOI:https://doi.org/10.1109/TAC.1974.1100705 [3] Hirotugu Akaike. 1987. Factor analysis and AIC. Psychometrika 52, 3 (1987), 317–332. [4] Bryce Allen. 2000. Individual differences and the conundrums of user-centered design: Two experiments. Journal of the American Society for Information Science 51, 6 (2000), 508–520. [5] Robert Amar, James Eagan, and John Stasko. 2005. Low-Level Components of Analytic Activity in Information Visualization. In Proceedings of the 2005 IEEE Symposium on Information Visualization (InfoVis), 15–21. DOI:https://doi.org/10.1109/INFOVIS.2005.24 [6] P Ayres and G Cierniak. 2012. Split-Attention Effect. In Encyclopedia of the Sciences of Learning, ed. Norbet M. Seel. Springer, 3172–3175. [7] Alan D. Baddeley. 1986. Working memory. Clarendon Press ; Oxford University Press. [8] Ryan Baker, Sidney D’Mello, Ma.Mercedes Rodrigo, and Arthur C. Graesser. 2010. Better to be frustrated than bored: The incidence, persistence, and impact of learners’ cognitive–affective states during interactions with three different computer-based learning environments. International Journal of Human-Computer Studies 68, 4 (2010), 223–241. DOI:https://doi.org/10.1016/j.ijhcs.2009.12.003 [9] Ryan S. J. d. Baker, Albert T. Corbett, and Vincent Aleven. 2008. More Accurate Student Modeling through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing. In Proceedings of the 9th international conference on Intelligent Tutoring Systems (ITS), 406–415. DOI:https://doi.org/10.1007/978-3-540-69132-7_44 187  [10] Tiffany Barnes and John Stamper. 2008. Toward Automatic Hint Generation for Logic Proof Tutoring Using Historical Student Data. In 9th International Conference Intelligent Tutoring Systems (ITS), 373–382. DOI:https://doi.org/10.1007/978-3-540-69132-7_41 [11] Lyn Bartram, Colin Ware, and Tom Calvert. 2003. Moticons: detection, distraction and task. International Journal of Human-Computer Studies 58, 5 (2003), 515–545. DOI:https://doi.org/10.1016/S1071-5819(03)00021-1 [12] Jackson Beatty. 1982. Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological Bulletin 91, 2 (1982), 276–292. DOI:https://doi.org/10.1037/0033-2909.91.2.276 [13] Joseph E. Beck and June Sison. 2006. Using knowledge tracing in a noisy environment to measure student reading proficiencies. International Journal of Artificial Intelligence in Education 16, 2 (2006), 129–143. [14] Roman Bednarik, Shahram Eivazi, and Hana Vrzakova. 2013. A Computational Approach for Prediction of Problem-Solving Behavior Using Support Vector Machines and Eye-Tracking Data. In Eye Gaze in Intelligent User Interfaces, Yukiko I. Nakano, Cristina Conati and Thomas Bader (eds.). Springer, London, 111–134. Retrieved from http://www.springerlink.com/index/10.1007/978-1-4471-4784-8_7 [15] Yoav Benjamini and Yosef Hochberg. 1995. Controlling The False Discovery Rate - A Practical And Powerful Approach To Multiple Testing. Journal of the Royal Statistical Society 57, 1 (1995), 289–300. DOI:https://doi.org/10.2307/2346101 [16] Jacques Bertin. 1983. Semiology of graphics. University of Wisconsin Press, Madison, Wis. [17] Katherine Bessière, A. Fleming Seay, and Sara Kiesler. 2007. The Ideal Elf: Identity Exploration in World of Warcraft. CyberPsychology & Behavior 10, 4 (2007), 530–535. DOI:https://doi.org/10.1089/cpb.2007.9994 [18] Maria Bielikova, Martin Konopka, Jakub Simko, Jozef Tvarozek, Patrik Hlavac, and Eduard Kuric. 2018. Eye-tracking en masse: Group user studies, lab infrastructure, and practices. Journal of Eye Movement Research 11, (3) (2018), 6. DOI:https://doi.org/10.16910/jemr.11.3.6 188  [19] Max V. Birk, Dereck Toker, Regan L. Mandryk, and Cristina Conati. 2015. Modeling Motivation in a Social Network Game Using Player-Centric Traits and Personality Traits. In 23rd International Conference User Modeling, Adaptation and Personalization (UMAP), 18–30. DOI:https://doi.org/10.1007/978-3-319-20267-9_2 [20] Pradipta Biswas, Varun Dutt, and Pat Langdon. 2016. Comparing Ocular Parameters for Cognitive Load Measurement in Eye-Gaze-Controlled Interfaces for Automotive and Desktop Computing Environments. International Journal of Human-Computer Interaction 32, 1 (2016), 23–38. DOI:https://doi.org/10.1080/10447318.2015.1084112 [21] R. Bixler, K. Kopp, and S. D’Mello. 2014. Evaluation of a Personalized Method for Proactive Mind Wandering Reduction. In Proceedings of the 4th Workshop on Personalization Approaches for Learning Environments, 22nd conference on User Modeling, Adaptation, and Personalization, 33–41. [22] Robert Bixler and Sidney D’Mello. 2015. Automatic Gaze-Based Detection of Mind Wandering with Metacognitive Awareness. In Proceedings of the 23rd International Conference on User Modeling, Adaptation and Personalization (UMAP), 31–43. DOI:https://doi.org/10.1007/978-3-319-20267-9_3 [23] Jennifer R. Blair and Otfried Spreen. 1989. Predicting premorbid IQ: A revision of the national adult reading test. Clinical Neuropsychologist 3, 2 (1989), 129–136. DOI:https://doi.org/10.1080/13854048908403285 [24] Daria Bondareva, Cristina Conati, Reza Feyzi-Behnagh, Jason M. Harley, Roger Azevedo, and François Bouchet. 2013. Inferring Learning from Gaze Data during Interaction with an Environment to Support Self-Regulated Learning. In 16th International Conference Artificial Intelligence in Education (AIED), 229–238. DOI:https://doi.org/10.1007/978-3-642-39112-5_24 [25] Michelle A. Borkin, Azalea A. Vo, Zoya Bylinskii, Phillip Isola, Shashank Sunkavalli, Aude Oliva, and Hanspeter Pfister. 2013. What Makes a Visualization Memorable? IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 2306–2315. DOI:https://doi.org/10.1109/TVCG.2013.234 [26] M. Bostock, V. Ogievetsky, and J. Heer. 2011. D3 Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 2301–2309. DOI:https://doi.org/10.1109/TVCG.2011.185 189  [27] Jeremy Boy, Ronald A. Rensink, Enrico Bertini, and Jean-Daniel Fekete. 2014. A Principled Way of Assessing Visualization Literacy. IEEE Transactions on Visualization and Computer Graphics 20, 12 (2014), 1963–1972. DOI:https://doi.org/10.1109/TVCG.2014.2346984 [28] Peter Brusilovsky, Jae-wook Ahn, Tibor Dumitriu, and Michael Yudelson. 2006. Adaptive Knowledge-Based Visualization for Accessing Educational Examples. In Proceedings of the 10th International Conference on Information Visualization (IV), 142–150. DOI:https://doi.org/10.1109/IV.2006.16 [29] Andrea Bunt, Cristina Conati, and Joanna McGrenere. 2007. Supporting interface customization using a mixed-initiative approach. In Proceedings of the 12th International Conference on Intelligent User Interfaces (IUI), 92–101. DOI:https://doi.org/10.1145/1216295.1216317 [30] John T. Cacioppo, Richard E. Petty, and Chuan Feng Kao. 1984. The Efficient Assessment of Need for Cognition. Journal of Personality Assessment 48, 3 (1984), 306–307. DOI:https://doi.org/10.1207/s15327752jpa4803_13 [31] Giuseppe Carenini, Cristina Conati, Enamul Hoque, and Ben Steichen. 2013. User Task Adaptation in Multimedia Presentations. In Proceedings of the 1st International Workshop on User-Adaptive Information Visualization (WUAV 2013), in conjunction with the 21st conference on User Modeling, Adaptation and Personalization (UMAP). [32] Giuseppe Carenini, Cristina Conati, Enamul Hoque, Ben Steichen, Dereck Toker, and James T. Enns. 2014. Highlighting Interventions and User Differences: Informing Adaptive Information Visualization Support. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 1835–1844. DOI:https://doi.org/10.1145/2556288.2557141 [33] Giuseppe Carenini and John Loyd. 2004. ValueCharts: analyzing linear models expressing preferences and evaluations. In Proceedings of the working conference on Advanced visual interfaces (AVI), 150. DOI:https://doi.org/10.1145/989863.989885 [34] Giuseppe Carenini and John Loyd. 2004. ValueCharts: analyzing linear models expressing preferences and evaluations. In Proceedings of the Working Conference on Advanced Visual Interfaces, 150–157. DOI:https://doi.org/10.1145/989863.989885 190  [35] Giuseppe Carenini and Lucas Rizoli. 2008. A multimedia interface for facilitating comparisons of opinions. In Proceedings of the 13th international Conference on Intelligent User Interfaces (IUI), 325. DOI:https://doi.org/10.1145/1502650.1502696 [36] Chaomei Chen. 2000. Individual differences in a spatial-semantic virtual environment. Journal of the American Society for Information Science 51, 6 (2000), 529–542. [37] A. Çöltekin, S.I. Fabrikant, and Martin Lacayo. 2010. Exploring the efficiency of users’ visual analytics strategies based on sequence analysis of eye movement recordings. International Journal of Geographical Information Science 24, 10 (2010), 1559–1575. DOI:https://doi.org/10.1080/13658816.2010.511718 [38] Cristina Conati, Giuseppe Carenini, Ben Steichen, and Dereck Toker. 2014. Evaluating the Impact of User Characteristics and Different Layouts on an Interactive Visualization for Decision Making. In Proceedings of the 16th Eurographics Conference on Visualization, 371–380. [39] Cristina Conati, Giuseppe Carenini, Dereck Toker, and Sébastien Lallé. 2015. Towards User-Adaptive Information Visualization. In Twenty-Ninth AAAI Conference on Artificial Intelligence. Retrieved April 22, 2015 from http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9933 [40] Cristina Conati, Enamul Hoque, Toker, Dereck, and Steichen, Ben. 2013. When to Adapt: Detecting User’s Confusion During Visualization Processing. In Proc. of the 1st International Workshop on User-Adaptive Information Visualization (WUAV), in conjunction with the 21st Conference on User Modeling, Adaptation and Personalization (UMAP). [41] Cristina Conati, Sébastien Lallé, Md. Abed Rahman, and Dereck Toker. 2017. Further Results on Predicting Cognitive Abilities for Adaptive Visualizations. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 1568–1574. DOI:https://doi.org/10.24963/ijcai.2017/217 [42] Cristina Conati and Heather Maclaren. 2008. Exploring the role of individual differences in information visualization. In Proceedings of the working conference on Advanced visual interfaces, 199–206. DOI:https://doi.org/10.1145/1385569.1385602 [43] Leana Copeland, Tom Gedeon, and Sabrina Caldwell. 2015. Effects of text difficulty and readers on predicting reading comprehension from eye movements. In 6th IEEE 191  International Conference on Cognitive Infocommunications (CogInfoCom), 407–412. DOI:https://doi.org/10.1109/CogInfoCom.2015.7390628 [44] Sidney K. D’Mello, Scotty D. Craig, Jeremiah Sullins, and Arthur C. Graesser. 2006. Predicting affective states expressed through an emote-aloud procedure from AutoTutor’s mixed-initiative dialogue. International Journal of Artificial Intelligence in Education 16, 1 (2006), 3–28. [45] Sidney D’Mello, Catlin Mills, Robert Bixler, and Nigel Bosch. 2017. Zone out no more: Mitigating mind wandering during computerized reading. In Proceedings of the 10th International Conference on Educational Data Mining, 8–15. [46] Sidney D’Mello, Andrew Olney, Claire Williams, and Patrick Hays. 2012. Gaze tutor: A gaze-reactive intelligent tutoring system. International Journal of Human-Computer Studies 70, 5 (2012), 377–398. DOI:https://doi.org/10.1016/j.ijhcs.2012.01.004 [47] Sue Duchesne and Anne McMaugh. 2019. Educational psychology: for learning and teaching. Cengage Learning Australia. [48] Mary C. Dyson and Mark Haselgrove. 2001. The influence of reading speed and line length on the effectiveness of reading from screen. International Journal of Human-Computer Studies 54, 4 (April 2001), 585–612. DOI:https://doi.org/10.1006/ijhc.2001.0458 [49] Ruth B. Ekstrom, John W. French, Harry H. Harman, and Diran Dermen. 1976. Manual for kit of factor referenced cognitive tests. Educational Testing Service Princeton, NJ. [50] Stephanie Elzer, Sandra Carberry, and Ingrid Zukerman. 2011. The automated understanding of simple bar charts. Artif. Intell. 175, 2 (2011), 526–555. DOI:https://doi.org/10.1016/j.artint.2010.10.003 [51] Edgar Erdfelder, Franz Faul, and Axel Buchner. 1996. GPOWER: A general power analysis program. Behavior Research Methods, Instruments, & Computers 28, 1 (1996), 1–11. DOI:https://doi.org/10.3758/BF03203630 [52] ETSI. 2002. Human Factors (HF); Guidelines on the multimodality of icons, symbols and pictograms. European Telecommunications Standards Institute. [53] Michael W. Eysenck. 2004. Principles of cognitive psychology. Psychology Press, Hove. 192  [54] Stephen Few. 2009. Now you see it: simple visualization techniques for quantitative analysis. Analytics Press, Oakland, Calif. [55] Andy P. Field. 2003. How to design and report experiments. Sage publications Ltd. [56] Andy P Field. 2009. Discovering statistics using SPSS. SAGE. [57] David Flatla and Carl Gutwin. 2012. SSMRecolor: improving recoloring tools with situation-specific models of color differentiation. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems (CHI), 2297–2306. DOI:https://doi.org/10.1145/2207676.2208388 [58] Sonja Folker, Helge Ritter, and Lorenz Sichelschmidt. 2005. Processing and Integrating Multimodal Material — The Influence of Color-Coding. Proceedings of the Annual Meeting of the Cognitive Science Society 27, 27 (2005), 690–695. [59] Keisuke Fukuda and Edward K. Vogel. 2009. Human Variation in Overriding Attentional Capture. The Journal of Neuroscience 29, 27 (2009), 8726–8733. DOI:https://doi.org/10.1523/JNEUROSCI.2145-09.2009 [60] Matthew Gingerich and Cristina Conati. 2015. Constructing Models of User and Task Characteristics from Eye Gaze Data for User-Adaptive Information Highlighting. In Proceedings of the 29th Conference on Artificial Intelligence (AAAI), 1728–1734. Retrieved from https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9946 [61] Tamara van Gog. 2014. The Signaling (or Cueing) Principle in Multimedia Learning. In The Cambridge Handbook of Multimedia Learning (2nd ed.), Richard Mayer (ed.). Cambridge University Press, Cambridge, 263–278. DOI:https://doi.org/10.1017/CBO9781139547369.014 [62] Joseph H. Goldberg and Jonathan I. Helfman. 2010. Comparing information graphics: a critical look at eye tracking. In Proceedings of the 3rd BELIV Workshop: Beyond time and errors: novel evaluation methods for Information Visualization, 71–78. DOI:https://doi.org/10.1145/2110192.2110203 [63] David Gotz and Zhen Wen. 2009. Behavior-driven visualization recommendation. In Proceedings of the 14th International Conference on Intelligent User Interfaces (IUI), 315–324. DOI:https://doi.org/10.1145/1502650.1502695 193  [64] William Grabe and Xiangying Jiang. 2013. Assessing Reading. In The Companion to Language Assessment, Antony John Kunnan (ed.). John Wiley & Sons, Inc., Hoboken, NJ, USA, 185–200. DOI:https://doi.org/10.1002/9781118411360.wbcla060 [65] William Grabe and Xiangying Jiang. 2013. Assessing Reading. In The Companion to Language Assessment, Antony John Kunnan (ed.). John Wiley & Sons, Inc., Hoboken, NJ, USA, 185–200. DOI:https://doi.org/10.1002/9781118411360.wbcla060 [66] David Graddol. 2006. English next: why global English may mean the end of “English as a foreign language.” British Council, London. [67] Samuel Gratzl, Alexander Lex, Nils Gehlenborg, Hanspeter Pfister, and Marc Streit. 2013. LineUp: Visual Analysis of Multi-Attribute Rankings. IEEE Transactions on Visualization and Computer Graphics 19, 12 (December 2013), 2277–2286. DOI:https://doi.org/10.1109/TVCG.2013.173 [68] Beate Grawemeyer. 2006. Evaluation of ERST: an external representation selection tutor. In Proceedings of the 4th International Conference on Diagrammatic Representation and Inference, 154–167. DOI:https://doi.org/10.1007/11783183_21 [69] Nancy L Green, Giuseppe Carenini, Stephan Kerpedjiev, Joe Mattis, Johanna D Moore, and Steven F Roth. 2004. AutoBrief: an experimental system for the automatic generation of briefings in integrated text and information graphics. International Journal of Human-Computer Studies 61, 1 (2004), 32–70. DOI:https://doi.org/10.1016/j.ijhcs.2003.10.007 [70] Tera M. Green and Brian Fisher. 2012. Impact of personality factors on interface interaction and the development of user profiles: Next steps in the personal equation of interaction. Information Visualization 11, 3 (2012), 205–221. DOI:https://doi.org/10.1177/1473871612441542 [71] T.M. Green and Brian Fisher. 2010. Towards the Personal Equation of Interaction: The impact of personality factors on visual analytics interface interaction. In Proceedings of the 2010 IEEE Symposium on Visual Analytics Science and Technology, 203–210. DOI:https://doi.org/10.1109/VAST.2010.5653587 [72] Thomas Grindinger, Andrew T. Duchowski, and Michael Sawyer. 2010. Group-wise similarity and classification of aggregate scanpaths. In Proceedings of the 2010 194  Symposium on Eye-Tracking Research & Applications (ETRA), 101–104. DOI:https://doi.org/10.1145/1743666.1743691 [73] Susanna Haas Lyons, Mike Walsh, Erin Aleman, and John Robinson. 2014. Exploring regional futures: Lessons from Metropolitan Chicago’s online MetroQuest. Technological Forecasting and Social Change 82, (February 2014), 23–33. DOI:https://doi.org/10.1016/j.techfore.2013.05.009 [74] Mona Haraty and Joanna McGrenere. 2016. Designing for Advanced Personalization in Personal Task Management. In Proceedings of the ACM Conference on Designing Interactive Systems (DIS), 239–250. DOI:https://doi.org/10.1145/2901790.2901805 [75] Mark Harrower and Cynthia A. Brewer. 2011. ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps. In The Map Reader, Martin Dodge, Rob Kitchin and Chris Perkins (eds.). John Wiley & Sons, Ltd, Chichester, UK, 261–268. DOI:https://doi.org/10.1002/9780470979587.ch34 [76] Jeffrey Heer, Michael Bostock, and Vadim Ogievetsky. 2010. A tour through the visualization zoo. Communications of the ACM 53, 6 (June 2010), 59. DOI:https://doi.org/10.1145/1743546.1743567 [77] M. Hegarty and M.A. Just. 1993. Constructing Mental Models of Machines from Text and Diagrams. Journal of Memory and Language 32, 6 (December 1993), 717–742. DOI:https://doi.org/10.1006/jmla.1993.1036 [78] E. H. Hess and J. M. Polt. 1964. Pupil Size in Relation to Mental Activity during Simple Problem-Solving. Science 143, 3611 (1964), 1190–1192. DOI:https://doi.org/10.1126/science.143.3611.1190 [79] K. Höök. 2000. Steps to take before intelligent user interfaces become real. Interacting with Computers 12, 4 (2000), 409–426. DOI:https://doi.org/10.1016/S0953-5438(99)00006-5 [80] Dandan Huang, Melanie Tory, Bon Adriel Aseniero, Lyn Bartram, Scott Bateman, Sheelagh Carpendale, Anthony Tang, and Robert Woodbury. 2015. Personal visualization and personal visual analytics. IEEE Transactions on Visualization and Computer Graphics 21, 3 (2015), 420–433. DOI:https://doi.org/10.1109/TVCG.2014.2359887 195  [81] Amy Hurst, Scott E. Hudson, and Jennifer Mankoff. 2007. Dynamic Detection of Novice vs. Skilled Use Without a Task Model. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 271–280. DOI:https://doi.org/10.1145/1240624.1240669 [82] Graeme Hutcheson. 1999. The Multivariate Social Scientist. SAGE Publications, Ltd. DOI:https://doi.org/10.4135/9780857028075 [83] Jukka Hyönä, Jorma Tommola, and Anna-Mari Alaja. 1995. Pupil dilation as a measure of processing load in simultaneous interpretation and other language tasks. The Quarterly Journal of Experimental Psychology 48, 3 (1995), 598–612. DOI:https://doi.org/10.1080/14640749508401407 [84] Shamsi T. Iqbal, Piotr D. Adamczyk, Xianjun Sam Zheng, and Brian P. Bailey. 2005. Towards an index of opportunity: understanding changes in mental workload during task execution. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 311–320. DOI:https://doi.org/10.1145/1054972.1055016 [85] Allison M. Jacobs, Benjamin Fransen, J. Malcolm McCurry, Frederick W.P. Heckel, Alan R. Wagner, and J. Gregory Trafton. 2009. A preliminary system for recognizing boredom. In Proceedings of the 4th ACM/IEEE international conference on Human robot interaction (HRI), 299. DOI:https://doi.org/10.1145/1514095.1514185 [86] Allison M. Jacobs, Benjamin Fransen, J. Malcolm McCurry, Frederick W.P. Heckel, Alan R. Wagner, and J. Gregory Trafton. 2009. A Preliminary System for Recognizing Boredom. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction (HRI ’09), 299–300. DOI:https://doi.org/10.1145/1514095.1514185 [87] Anthony Jameson. 2003. The human-computer interaction handbook. In Julie A. Jacko and Andrew Sears (eds.). L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 305–330. Retrieved June 24, 2013 from http://dl.acm.org/citation.cfm?id=772072.772094 [88] Izabelle Janzen, Francesco Vitale, and Joanna McGrenere. 2018. Control and Personalization:Younger versus Older Users’ Experience of Notifications. Proceedings of Graphics Interface Toronto, (2018), 129–136. DOI:https://doi.org/10.20380/gi2018.19 [89] Natasha Jaques, Cristina Conati, Jason M. Harley, and Roger Azevedo. 2014. Predicting Affect from Gaze Data during Interaction with an Intelligent Tutoring 196  System. In Proceedings of the 12th International Conference on Intelligent Tutoring Systems (ITS), 29–38. DOI:https://doi.org/10.1007/978-3-319-07221-0_4 [90] Michael Johnston, John Chen, Patrick Ehlen, Hyuckchul Jung, Jay Lieske, Aarthi Reddy, Ethan Selfridge, Svetlana Stoyanchev, Brant Vasilieff, and Jay Wilpon. 2014. MVA: The Multimodal Virtual Assistant. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 257–259. DOI:https://doi.org/10.3115/v1/W14-4335 [91] Marcel Adam Just and Patricia A Carpenter. 1976. Eye fixations and cognitive processes. Cognitive Psychology 8, 4 (1976), 441–480. DOI:https://doi.org/10.1016/0010-0285(76)90015-3 [92] Slava Kalyuga. 2007. Enhancing Instructional Efficiency of Interactive E-learning Environments: A Cognitive Load Perspective. Educ Psychol Rev 19, 3 (August 2007), 387–399. DOI:https://doi.org/10.1007/s10648-007-9051-6 [93] Slava Kalyuga. 2009. Managing Cognitive Load in Adaptive Multimedia Learning: IGI Global. DOI:https://doi.org/10.4018/978-1-60566-048-6 [94] Slava Kalyuga, Paul Ayres, Paul Chandler, and John Sweller. 2003. The Expertise Reversal Effect. Educational Psychologist 38, 1 (2003), 23–31. DOI:https://doi.org/10.1207/S15326985EP3801_4 [95] Slava Kalyuga, Paul Chandler, and John Sweller. 1998. Levels of Expertise and Instructional Design. Human Factors: The Journal of the Human Factors and Ergonomics Society 40, 1 (1998), 1–17. DOI:https://doi.org/10.1518/001872098779480587 [96] Slava Kalyuga, Yin Kum Law, and Chee Ha Lee. 2013. Expertise reversal effect in reading Chinese texts with added causal words. Instructional Science 41, 3 (May 2013), 481–497. DOI:https://doi.org/10.1007/s11251-012-9239-0 [97] Maurits Clemens Kaptein, Clifford Nass, and Panos Markopoulos. 2010. Powerful and consistent analysis of likert-type ratingscales. In Proceedings of the 28th international conference on Human factors in computing systems (CHI), 2391. DOI:https://doi.org/10.1145/1753326.1753686 [98] Samad Kardan and Cristina Conati. 2013. Comparing and Combining Eye Gaze and Interface Actions for Determining User Learning with an Interactive Simulation. In 197  Proceedings of the 21st International Conference on User Modeling, Adaptation and Personalization (UMAP), 215–227. DOI:https://doi.org/10.1007/978-3-642-38844-6_18 [99] Rex B. Kline. 2016. Principles and practice of structural equation modeling (Fourth edition ed.). The Guilford Press. [100] N. Kong and M. Agrawala. 2012. Graphical Overlays: Using Layered Elements to Aid Chart Reading. IEEE Transactions on Visualization and Computer Graphics 18, 12 (December 2012), 2631–2638. DOI:https://doi.org/10.1109/TVCG.2012.229 [101] Nicholas Kong, Marti A. Hearst, and Maneesh Agrawala. 2014. Extracting references between text and charts via crowdsourcing. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI), 31–40. DOI:https://doi.org/10.1145/2556288.2557241 [102] Robert Kosara and Steve Haroz. 2018. Skipping the Replication Crisis in Visualization: Threats to Study Validity and How to Address Them : Position Paper. In 2018 IEEE Evaluation and Beyond - Methodological Approaches for Visualization (BELIV), 102–107. DOI:https://doi.org/10.1109/BELIV.2018.8634392 [103] Stephen Michael Kosslyn. 1994. Elements of graph design. W.H. Freeman, New York. [104] M. Kuhn. 2008. Building predictive models in R using the caret package. Journal of Statistical Software 28, 5 (2008), 1–26. [105] Bill Kules and Robert Capra. 2012. Influence of training and stage of search on gaze behavior in a library catalog faceted search interface. Journal of the American Society for Information Science and Technology 63, 1 (2012), 114–138. DOI:https://doi.org/10.1002/asi.21647 [106] Alexandra Kuznetsova, Per B. Brockhoff, and Rune H. B. Christensen. 2017. lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software 82, 13 (2017). DOI:https://doi.org/10.18637/jss.v082.i13 [107] Sébastien Lallé and Cristina Conati. 2019. The role of user differences in customization: a case study in personalization for infovis-based content. In Proceedings of the 24th International Conference on Intelligent User Interfaces (IUI), 329–339. DOI:https://doi.org/10.1145/3301275.3302283 198  [108] Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. 2016. Predicting confusion in information visualization from eye tracking and interaction data. In Proceedings on the 25th International Joint Conference on Artificial Intelligence (IJCAI), 2529–2535. Retrieved from https://www.ijcai.org/Proceedings/16/Papers/360.pdf [109] Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. 2017. Impact of Individual Differences on User Experience with a Visualization Interface for Public Engagement. In Proceedings of the 2nd International Workshop on Human Aspects in Adaptive and Personalized Interactive Environments (in conjunction with UMAP), 247–252. DOI:https://doi.org/10.1145/3099023.3099055 [110] Sébastien Lallé, Jack Mostow, Vanda Luengo, and Nathalie Guin. 2013. Comparing Student Models in Different Formalisms by Predicting Their Impact on Help Success. In 16th International Conference Artificial Intelligence in Education (AIED), 161–170. DOI:https://doi.org/10.1007/978-3-642-39112-5_17 [111] Sébastien Lallé, Dereck Toker, Cristina Conati, and Giuseppe Carenini. 2015. Prediction of Users’ Learning Curves for Adaptation while Using an Information Visualization. In Proceedings of the 20th International Conference on Intelligent User Interfaces (IUI), 357–368. DOI:https://doi.org/10.1145/2678025.2701376 [112] Jason Lankow, Josh Ritchie, and Ross Crooks. 2012. Infographics: the power of visual storytelling. John Wiley & Sons, Inc, Hoboken, N.J. [113] Tomasz D. Loboda, Peter Brusilovsky, and Jöerg Brunstein. 2011. Inferring word relevance from eye-movements of readers. In Proceedings of the 16th International Conference on Intelligent User Interfaces (IUI), 175–184. DOI:https://doi.org/10.1145/1943403.1943431 [114] Robert H. Logie. 2009. Visuo-spatial working memory (Nachdr. ed.). Psychology Press, Hove. [115] Wendy E. Mackay. 1991. Triggers and barriers to customizing software. In Proceedings of the SIGCHI conference on Human factors in computing systems Reaching through technology (CHI), 153–160. DOI:https://doi.org/10.1145/108844.108867 [116] Mehdi Malekzadeh, Mumtaz Mustafa, and Adel Lahsasna. A Review of Emotion Regulation in Intelligent Tutoring Systems. EDUC TECHNOL SOC 18, 4 , 435–445. DOI:https://www.jstor.org/stable/10.2307/jeductechsoci.18.4.435 199  [117] Sandra P. Marshall. 2002. The index of cognitive activity: Measuring cognitive workload. In Proceedings of the 7th  IEEE Human Factors Meeting, 5–9. Retrieved March 3, 2016 from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1042860 [118] Sandra P Marshall. 2007. Identifying cognitive state from eye metrics. Aviation, space, and environmental medicine 78, Supplement 1 (2007), B165--B175. [119] Pascual Martínez-Gómez and Akiko Aizawa. 2014. Recognition of understanding level and language skill using measurements of reading behavior. In Proceedings of the 19th International Conference on Intelligent User Interfaces (IUI), 95–104. DOI:https://doi.org/10.1145/2557500.2557546 [120] Richard E. Mayer. 2009. Multimedia learning (2nd ed ed.). Cambridge University Press, Cambridge ; New York. [121] Paul Meara. 2010. EFL Vocabulary Tests (second edition ed.). Lognostics, Swansea: Wales. [122] Paul Meara and Glyn Jones. 1990. Eurocentres Vocabulary Size Test 10KA. Eurocentres Learning Service, Zurich. [123] Ronald Metoyer, Qiyu Zhi, Bart Janczuk, and Walter Scheirer. 2018. Coupling Story to Visualization: Using Textual Analysis as a Bridge Between Data and Interpretation. In 23rd International Conference on Intelligent User Interfaces (IUI), 503–507. DOI:https://doi.org/10.1145/3172944.3173007 [124] Vibhu O. Mittal. 1997. Visual Prompts and Graphical Design: A Framework for Exploring the Design Space of 2-D Charts and Graphs. In Proceedings of the Fourteenth National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence (AAAI’97/IAAI’97), 57–63. Retrieved December 3, 2014 from http://dl.acm.org/citation.cfm?id=1867406.1867415 [125] Mohamed Mouine and Guy Lapalme. 2012. Using Clustering to Personalize Visualization. In Proceedings of the 16th International Conference on Information Visualisation (IV), 258–263. DOI:https://doi.org/10.1109/IV.2012.51 [126] Florian Mueller and Andrea Lockerd. 2001. Cheese: tracking mouse movement activity on websites, a tool for user modeling. In Extended Abstracts on Human Factors in Computing Systems (CHI), 279–280. DOI:https://doi.org/10.1145/634067.634233 200  [127] Mary Muir and Cristina Conati. 2012. An Analysis of Attention to Student – Adaptive Hints in an Educational Game. In 11th International Conference Intelligent Tutoring Systems (ITS), 112–122. DOI:https://doi.org/10.1007/978-3-642-30950-2_15 [128] Tamara Munzner. 2014. Visualization analysis and design. CRC Press, Taylor & Francis Group. [129] Kawa Nazemi, Reimond Retz, Jürgen Bernard, Jörn Kohlhammer, and Dieter Fellner. 2013. Adaptive Semantic Visualization for Bibliographic Entries. In Advances in Visual Computing (Lecture Notes in Computer Science), 13–24. DOI:https://doi.org/10.1007/978-3-642-41939-3_2 [130] Anneli Olsen. 2012. The tobii i-vt fixation filter. Tobii Technology (2012). Retrieved September 13, 2015 from http://www.tobii.com/global/analysis/training/whitepapers/tobii_whitepaper_tobiiivtfixationfilter.pdf [131] Kristien Ooms, Philippe De Maeyer, and Veerle Fack. 2014. Study of the attentive behavior of novice and expert map users using eye tracking. Cartography and Geographic Information Science 41, 1 (2014), 37–54. [132] Kristien Ooms, Philippe De Maeyer, Veerle Fack, Eva Van Assche, and Frank Witlox. 2012. Interpreting maps through the eyes of expert and novice users. International Journal of Geographical Information Science 26, 10 (2012), 1773–1788. [133] Alvitta Ottley, Huahai Yang, and Remco Chang. 2015. Personality as a Predictor of User Strategy: How Locus of Control Affects Search Strategies on Tree Visualizations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 3251–3254. [134] Erol Ozcelik, Ismahan Arslan-Ari, and Kursat Cagiltay. 2010. Why does signaling enhance multimedia learning? Evidence from eye movements. Computers in Human Behavior 26, 1 (2010), 110–117. DOI:https://doi.org/10.1016/j.chb.2009.09.001 [135] Evan M. Palmer, Todd S. Horowitz, Antonio Torralba, and Jeremy M. Wolfe. 2011. What are the shapes of response time distributions in visual search? Journal of Experimental Psychology: Human Perception and Performance 37, 1 (2011), 58–71. DOI:https://doi.org/10.1037/a0020747 201  [136] Prateek Panwar and Christopher M. Collins. 2018. Detecting Negative Emotion for Mixed Initiative Visual Analytics. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (CHI), 1–6. DOI:https://doi.org/10.1145/3170427.3188664 [137] V. Pascual-Cid, L. Vigentini, and M. Quixal. 2010. Visualising Virtual Learning Environments: Case Studies of the Website Exploration Tool. 149–155. DOI:https://doi.org/10.1109/IV.2010.31 [138] P.I. Pavlik, H. Cen, and K.R. Koedinger. 2009. Performance Factors Analysis–A New Alternative to Knowledge Tracing. In Proceedings of the International Conference on Artificial Intelligence in Education, 531–538. [139] Sintija Petrovica, Alla Anohina-Naumeca, and Hazım Kemal Ekenel. 2017. Emotion Recognition in Affective Tutoring Systems: Collection of Ground-truth Data. Procedia Computer Science 104, (2017), 437–444. DOI:https://doi.org/10.1016/j.procs.2017.01.157 [140] Martha C. Polson. 2013. Foundations of Intelligent Tutoring Systems (1st ed.). Psychology Press. DOI:https://doi.org/10.4324/9780203761557 [141] Helmut Prendinger, Aulikki Hyrskykari, Minoru Nakayama, Howell Istance, Nikolaus Bee, and Yosiyuki Takahasi. 2009. Attentive interfaces for users with disabilities: eye gaze for intention and uncertainty estimation. Universal Access in the Information Society 8, 4 (2009), 339–354. DOI:https://doi.org/10.1007/s10209-009-0144-5 [142] Yves Rosseel. 2012. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software 48, 2 (2012). DOI:https://doi.org/10.18637/jss.v048.i02 [143] Julian B. Rotter. 1966. Generalized expectancies for internal versus external control of reinforcement. Psychological Monographs: General and Applied 80, 1 (1966), 1–28. DOI:https://doi.org/10.1037/h0092976 [144] P. Saraiya, C. North, and K. Duca. 2005. An Insight-Based Methodology for Evaluating Bioinformatics Visualizations. IEEE Transactions on Visualization and Computer Graphics 11, 4 (2005), 443–456. DOI:https://doi.org/10.1109/TVCG.2005.53 [145] Katharina Scheiter, Eric Wiebe, and Jana Holsanova. 2011. Theoretical and Instructional Aspects of Learning with Visualizations. Instructional Design: Concepts, 202  Methodologies, Tools and Applications. IGI Global. DOI:https://doi.org/10.4018/978-1-60960-503-2 [146] E Segel and J Heer. 2010. Narrative Visualization: Telling Stories with Data. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), 1139–1148. DOI:https://doi.org/10.1109/TVCG.2010.179 [147] Craig Speelman and Kim Kirsner. 2005. Beyond the Learning Curve. Oxford University Press. Retrieved October 1, 2013 from http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780198570417.001.0001/acprof-9780198570417 [148] Ben Steichen, Giuseppe Carenini, and Cristina Conati. 2013. User-adaptive information visualization: using eye gaze data to infer visualization tasks and user cognitive abilities. In Proceedings of the 2013 international conference on Intelligent user interfaces (IUI ’13), 317–328. DOI:https://doi.org/10.1145/2449396.2449439 [149] Ben Steichen, Cristina Conati, and Giuseppe Carenini. 2014. Inferring Visualization Task Properties, User Performance, and User Cognitive Abilities from Eye Gaze Data. TIIS 4, 2 (2014), 11. DOI:https://doi.org/10.1145/2633043 [150] Esther Strauss, Elisabeth M. S. Sherman, Otfried Spreen, and Otfried Spreen. 2006. A compendium of neuropsychological tests: administration, norms, and commentary (3rd ed ed.). Oxford University Press, Oxford ; New York. [151] Robert H. Tai, John F. Loehr, and Frederick J. Brigham. 2006. An exploration of the use of eye-gaze tracking to study problem-solving on standardized science assessments. International Journal of Research & Method in Education 29, 2 (October 2006), 185–208. DOI:https://doi.org/10.1080/17437270600891614 [152] Hui Tang, Joseph J. Topczewski, Anna M. Topczewski, and Norbert J. Pienta. 2012. Permutation test for groups of scanpaths using normalized Levenshtein distances and application in NMR questions. In Proceedings of the Symposium on Eye Tracking Research and Applications, 169–172. DOI:https://doi.org/10.1145/2168556.2168584 [153] David Thue, Vadim Bulitko, Marcia Spetch, and Eric Wasylishen. 2007. Interactive Storytelling: A Player Modelling Approach. In Proceedings of the Third AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 43–48. 203  [154] Dereck Toker and Cristina Conati. 2014. Eye tracking to understand user differences in visualization processing with highlighting interventions. In Proceedings of the 22nd international conference on User Modeling, Adaptation, and Personalization (UMAP). DOI:https://doi.org/10.1007/978-3-319-08786-3_19 [155] Dereck Toker and Cristina Conati. 2017. Leveraging Pupil Dilation Measures for Understanding Users’ Cognitive Load During Visualization Processing. In Adjunct Publication of the 25th Conference on User Modeling, Adaptation and Personalization (UMAP), 267–270. DOI:https://doi.org/10.1145/3099023.3099059 [156] Dereck Toker, Cristina Conati, and Giuseppe Carenini. 2018. User-adaptive Support for Processing Magazine Style Narrative Visualizations: Identifying User Characteristics that Matter. In Proceedings of the 23rd International Conference on Intelligent User Interfaces (IUI), 199–204. DOI:https://doi.org/10.1145/3172944.3173009 [157] Dereck Toker, Cristina Conati, Giuseppe Carenini, and Mona Haraty. 2012. Towards adaptive information visualization: on the influence of user characteristics. In Proceedings of the 20th International Conference on User Modeling, Adaptation, and Personalization (UMAP), 274–285. DOI:https://doi.org/10.1007/978-3-642-31454-4_23 [158] Dereck Toker, Cristina Conati, Ben Steichen, and Giuseppe Carenini. 2013. Individual user characteristics and information visualization: connecting the dots through eye tracking. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 295–304. DOI:https://doi.org/10.1145/2470654.2470696 [159] Dereck J Toker. 2013. The Impact of Individual Differences on Visualization Effectiveness and Gaze Behaviour: Informing the Design of User Adaptive Interventions (MSc Thesis). University of British Columbia, Canada. DOI:https://doi.org/10.14288/1.0052188 [160] Dereck Toker, Sébastien Lallé, and Cristina Conati. 2017. Pupillometry and Head Distance to the Screen to Predict Skill Acquisition During Information Visualization Tasks. In Proceedings of the 22nd International Conference on Intelligent User Interfaces (IUI), 221–231. DOI:https://doi.org/10.1145/3025171.3025187 [161] Dereck Toker, Robert Moro, Jakub Simko, Maria Bielikova, and Cristina Conati. 2019. Impact of English Reading Comprehension Abilities on Processing Magazine Style Narrative Visualizations and Implications for Personalization. In Proceedings of the 204  27th ACM Conference on User Modeling, Adaptation and Personalization (UMAP), 309–317. DOI:https://doi.org/10.1145/3320435.3320447 [162] Dereck Toker, Ben Steichen, Matthew Gingerich, Cristina Conati, and Giuseppe Carenini. 2014. Towards Facilitating User Skill Acquisition - Identifying Untrained Visualization Users through Eye Tracking. In Proceedings of the 2014 international conference on Intelligent user interfaces (IUI). DOI:https://doi.org/10.1145/2557500.2557524 [163] James T. Townsend and F. Gregory Ashby. 1983. The stochastic modeling of elementary psychological processes. Cambridge University Press, Cambridge [Cambridgeshire] ; New York. [164] Edward R Tufte. 1997. Visual explanations: images and quantities, evidence and narrative. Graphics Press. [165] Marilyn L Turner and Randall W Engle. 1989. Is working memory capacity task dependent? Journal of Memory and Language 28, 2 (April 1989), 127–154. DOI:https://doi.org/10.1016/0749-596X(89)90040-5 [166] Trisha Van Zandt. 2000. How to fit a response time distribution. Psychonomic Bulletin & Review 7, 3 (September 2000), 424–465. DOI:https://doi.org/10.3758/BF03214357 [167] Maria C. Velez, Deborah Silver, and Marilyn Tremaine. 2005. Understanding visualization through spatial ability differences. In Proceedings of the IEEE Conference on Visualization, 511–518. DOI:https://doi.org/10.1109/VISUAL.2005.1532836 [168] Edward K. Vogel, Geoffrey F. Woodman, and Steven J. Luck. 2001. Storage of features, conjunctions, and objects in visual working memory. Journal of Experimental Psychology: Human Perception and Performance 27, 1 (2001), 92–114. DOI:https://doi.org/10.1037//0096-1523.27.1.92 [169] Stefanos Vrochidis, Ioannis Patras, and Ioannis Kompatsiaris. 2011. An Eye-tracking-based Approach to Facilitate Interactive Video Search. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR ’11), 43:1–43:8. DOI:https://doi.org/10.1145/1991996.1992039 [170] T. Franklin Waddell, Joshua R. Auriemma, and S. Shyam Sundar. 2016. Make it Simple, or Force Users to Read?: Paraphrased Design Improves Comprehension of End 205  User License Agreements. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI), 5252–5256. DOI:https://doi.org/10.1145/2858036.2858149 [171] David A. Walker. 2003. Converting Kendall’s Tau For Correlational Or Meta-Analytic Analyses. Journal of Modern Applied Statistical Methods 2, 2 (2003), 525–530. DOI:https://doi.org/10.22237/jmasm/1067646360 [172] Jennifer Wiley, Christopher A. Sanchez, and Allison J. Jaeger. 2014. The Individual Differences in Working Memory Capacity Principle in Multimedia Learning. In The Cambridge Handbook of Multimedia Learning (2nd ed.), Richard Mayer (ed.). Cambridge University Press, Cambridge, 598–620. DOI:https://doi.org/10.1017/CBO9781139547369.029 [173] Jacob O. Wobbrock, Leah Findlater, Darren Gergle, and James J. Higgins. 2011. The aligned rank transform for nonparametric factorial analyses using only ANOVA procedures. In Proceedings of the SIGCHI conference on human factors in computing systems (CHI), 143–146. DOI:https://doi.org/10.1145/1978942.1978963 [174] Beverly Park Woolf. 2010. Building intelligent interactive tutors: student-centered strategies for revolutionizing e-learning. Morgan Kaufmann. [175] Ying Zhu. 2007. Measuring Effective Data Visualization. In Proceedings of the 3rd International Symposium on Advances in Visual Computing (ISVC), 652–661. DOI:https://doi.org/DOI https://doi.org/10.1007/978-3-540-76856-2_64 [176] C. Ziemkiewicz and R. Kosara. 2008. The Shaping of Information by Visual Metaphors. IEEE Transactions on Visualization and Computer Graphics 14, 6 (2008), 1269–1276. DOI:https://doi.org/10.1109/TVCG.2008.171 [177] Caroline Ziemkiewicz, R. Jordan Crouser, Ashley Rye Yauilla, Sara L. Su, William Ribarsky, and Remco Chang. 2011. How locus of control influences compatibility with visualization style. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST), 81–90. DOI:https://doi.org/10.1109/VAST.2011.6102445 [178] 2013. Skytree Adviser. Retrieved July 26, 2019 from https://web.archive.org/web/20131129110924/http://www.skytree.net/products-services/adviser-beta  206  Appendix A This appendix explains a statistical flaw in IUI’18 [156] and how it was corrected. The new updated results are now contained within Chapter 6, in Section 6.4.1.  Our typical analysis pipeline with our user study data (also discussed in Section 1.5) consisted of running a General Linear Model in SPSS on study performance data (e.g., time on task, subjective measures), and then using Mixed Models in SPSS on eye tracking data. In an attempt to reconcile both steps together, we explored the feasibility of using Structure Equation Models using the lavaan package in R [142], and selected this method to perform an initial analysis of performance and subjective data from MSNV Study 1, which was published as a short paper at IUI’18 [156]. However, we later discovered a crucial error in our model specification that compromised some of the reported results, and further learned that SEM would not be suitable for our purposes. Specifically, we discovered that our model specification in [156] was missing the two random effects that were needed to properly model our repeated measures study data where all users were exposed to the same set of 15 documents in the study. The first random effect user_id was needed to account for a within-subject correlation since multiple measurements were collected from the same user. The second random effect document_id accounts for a within-document correlation since repeated measurements are collected from the same MSNV document across users. Not specifying random effects runs the risk of producing additional significant results, because the random effects provide the model with information specifying where there is non-independence among repeated samples, otherwise each observation is treated independently. For example, in MSNV Study-1, there 207  were 56 users each performing tasks with 15 different documents, yielding 840 observations. Without specifying the random effect for user_id (i.e., that different sets of  15 observations came from the same user), the model would then treat the 840 observations as 840 different users, which increases model power (erroneously), and can lead to significant results that are not true. At first, we attempted to correct this by specifying the two random effects using SEM. However SEM is only able to model one random effect at a time (known as Multilevel SEM), and would require running two separate models (one for each random effect), and then attempting to reconcile both sets of results [99], which is non-trivial and advanced task. Instead, we devised a solution with the assistance of several paid consultations offered by the UBC department of Statistics consulting group (SCARL). Using the lmerTest package in R [106], we devised the necessary model specifications and methodology that would allow us to correctly analyze both performance and gaze data from our user studies. As mentioned above, the corrected results from IUI’18 [156] are now reported in Section 6.4.1. Below, Table A.1 shows which results remained, and which ones were lost. Main Effect of User Characteristic Time on Task Comprehension Accuracy Ease-of-Understanding Document Interest Verbal Working Memory Significant    Reading Proficiency Significant    Visualization Literacy Significant    Need for Cognition Significant  Significant Significant Spatial Memory Significant  Significant       Table A.1: There were eight significant main effects (p < .05) reported in IUI’18 [156] of user characteristics on performance/subjective measures from MSNV Study 1. Only three main effects remained significant after the analysis was corrected. 208  During this time, I also became deeply concerned that we had potentially made the same error of random effect omission during my masters work with the Bar/Radar study (UMAP’12 [157]) and in my early PhD work on the Intervention study (CHI’14 [32]). In both cases, the analyses had been conducted using SPSS’s graphical user interface for General Linear Model repeated measures, and it was not clear to me if the random effects had been properly included since SPSS had never made explicit use of the term ‘random effect’. To this end, I decided to re-run both analyses using the new Mixed Model approach we had devised since I could be fully confident that the random effects would be specified correctly. And fortunately, all of the results held. In fact, the results were identical down to the least significant digit of the p-values, indicating to me that not only are the mathematics underlying the both approaches likely identical or very similar, but more importantly, that the SPSS’s GLM repeated measures interface had properly accounted for both random effects (i.e., within-subject and within-task correlation) properly. 209  Appendix B This appendix contains supplementary material for Section 2.3.4 in Chapter 3. The full set of correlation scores among the user characteristics measured from the Intervention Study are shown below in Table B.1.  Perceptual Speed Verbal Working Memory Visual Working Memory Locus of Control Expertise – Simple Expertise – Complex Perceptual Speed       Verbal Working Memory 0.01      Visual Working Memory 0.07 0.15     Locus of Control -0.27* 0.12 -0.06    Expertise – Simple -0.06 0.05 0.19 0.11   Expertise – Complex -0.20 -0.03 -0.03 0.05 0.47*         Table B.1: Pearson correlations between all of the user characteristics investigated in Chapter 2. All correlations are not significant except for a strong positive correlation (r = 0.47, p < .01) between Expertise–Simple and Expertise–Complex; and a weak negative correlation (r = -0.27, p < .01) between Perceptual Speed and Locus of Control. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0380857/manifest

Comment

Related Items