UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A neural architecture for detecting user confusion in eye-tracking data Sims, Shane 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2020_november_sims_shane.pdf [ 5.16MB ]
Metadata
JSON: 24-1.0394251.json
JSON-LD: 24-1.0394251-ld.json
RDF/XML (Pretty): 24-1.0394251-rdf.xml
RDF/JSON: 24-1.0394251-rdf.json
Turtle: 24-1.0394251-turtle.txt
N-Triples: 24-1.0394251-rdf-ntriples.txt
Original Record: 24-1.0394251-source.json
Full Text
24-1.0394251-fulltext.txt
Citation
24-1.0394251.ris

Full Text

A NEURAL ARCHITECTURE FOR DETECTING USER CONFUSION IN EYE-TRACKING DATA by  Shane Sims  B.Sc., The University of Calgary, 2018 B.A., The University of Calgary, 2011  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Computer Science)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  August 2020  © Shane Sims, 2020 ii  The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis entitled:  A Neural Architecture for Detecting User Confusion in Eye-tracking Data  submitted by Shane Sims in partial fulfillment of the requirements for the degree of Master of Science in Computer Science  Examining Committee: Cristina Conati, Department of Computer Science, UBC Supervisor  Robert Xiao, Department of Computer Science, UBC Supervisory Committee Member  iii  Abstract Encouraged by the success of deep learning in a variety of domains, we investigate the effectiveness of a novel application of such methods for detecting user confusion with eye-tracking data. We introduce an architecture that uses RNN and CNN sub-models in parallel to take advantage of the temporal and visuospatial aspects of our data. Experiments with a dataset of user interactions with the ValueChart visualization tool show that our model outperforms an existing model based on a Random Forest classifier, resulting in a 22% improvement in combined sensitivity & specificity. This is a larger improvement in performance than that achieved by either the CNN or RNN when considered alone, though all three deep learning models outperform the Random Forest baseline. To investigate this effect and understand the performance increase achieved by the deep learning models, we carried out preliminary investigations using explainable AI methods, from which we derive future directions for exploring performance gains from combining deep learning models.    iv  Lay Summary Artificial Intelligence (AI) is the field of study concerned with creating thinking machines out of computers. What distinguishes AI from other software is its ability to engage in tasks typically requiring human intelligence [14], like recognizing affect in others. In this work we apply AI methods to detect the affective state of confusion form eye-tracking data; a record of a user’s pupil size, gaze path, and head distance from the screen, over a period of time. By combining AI methods for process sequences (eye-tracking data) and images (an image of the user’s gaze path across the computer screen), we obtain higher performance than either method considered alone as well as methods used by others in the past. This work forms a step towards an intelligent system that can detect and adapt to user confusion in real-time, by providing a model with higher predictive capability that was previously available. v  Preface This Master’s thesis is the outcome of a research project done at the University of British Columbia, with M.Sc. student Vanessa Putnam as a collaborator for early parts of the work and advised by Professor Cristina Conati throughout. Shane Sims wrote all parts of the work presented in this thesis while Vanessa Putnam contributed significantly to the initial work concerned with using RNNs (Chapter 4.1), by researching possible RNN variants and the pros and cons of each, as well as processing the initial dataset for use with the RNNs (Chapter 3).  The following papers were published collaboratively during the course of this research project, a version of which forms the bulk of this thesis, except for Chapters 6 and 7: • Sims, S.D., Putnam, V., and Conati, C., 2019. Predicting Confusion from Eye-Tracking Data with Recurrent Neural Networks. International Joint Conferences on Artificial Intelligence (IJCAI) 2019, Humanizing AI (HAI) Workshop. (best paper award) • Sims, S.D., Conati, C., 2020. A Neural Architecture for Detecting User Confusion in Eye-tracking Data. To appear in Proceedings of the International Conference on Multimodal Interaction (ICMI 2020)1. (best paper award nominee)  1 This paper has been nominated for the best paper award by all three reviewers, who gave it a 5/5 rating. The reviews are included in Appendix A. vi  Table of Contents  Abstract ......................................................................................................................................... iii Lay Summary ............................................................................................................................... iv Preface ............................................................................................................................................ v Table of Contents ......................................................................................................................... vi List of Tables ............................................................................................................................... viii List of Figures ............................................................................................................................... ix List of Abbreviations ..................................................................................................................... x Acknowledgements ....................................................................................................................... xi Dedication .................................................................................................................................... xii Chapter 1: Introduction ................................................................................................................ 1 Chapter 2: Related Work ............................................................................................................. 5 Chapter 3: Dataset ........................................................................................................................ 8 3.1 Data Collection ................................................................................................................ 8 3.2 Data Pre-processing ....................................................................................................... 10 Chapter 4: Models and Approach ............................................................................................. 14 4.1 Recurrent Neural Networks ........................................................................................... 14 4.2 Convolutional Neural Network ..................................................................................... 16 4.3 Visuospatial Temporal Network ................................................................................... 19 4.4 Implementation .............................................................................................................. 20 Chapter 5: Evaluation ................................................................................................................. 21 5.1 Experiment Setup .......................................................................................................... 21 vii  5.2 Results Comparing GRU and RF .................................................................................. 23 5.3 Results Comparing CNN, GRU, and VTNet ................................................................ 24 Chapter 6: Investigating how to Interpret VTNet Performance ............................................ 27 6.1 Determining Signal Importance .................................................................................... 28 6.2 Understanding the Contents of the Learned Representation ......................................... 31 6.3 Understanding the Decision Process ............................................................................. 34 Chapter 7: Additional Attempts to Increase Performance  .................................................... 43 7.1 Add noise to inputs ........................................................................................................ 43 7.2 SMOTE Hidden States .................................................................................................. 44 Chapter 8: Conclusion & Future Work .................................................................................... 46 Bibliography ................................................................................................................................ 52 Appendices ................................................................................................................................... 59 Appendix A ICMI 2020 Reviews .............................................................................................. 59 Appendix B ............................................................................................................................... 64  viii  List of Tables  Table 1 Test set performance of GRU and RF in Chapter 5.3 ...................................................... 23 Table 2 Test set performance of neural models in Chapter 5.4 ..................................................... 24 Table 3 RNN performance with each combination of eye-tracking signals in Chapter 6.1 ......... 29 Table 4 Effect of feature combination on model performance ..................................................... 30 Table 5 Comparison of RF and LSTM with raw and fixation sequences and SMOTE augmented data in Chapter 6.2 ........................................................................................................................ 32 Table 6 Comparison of sequence length between fixation-based and raw cyclically split sequences in Chapter 6.2 ............................................................................................................... 33   ix  List of Figures Figure 1 An example of ValueChart in Chapter 3.1 ....................................................................... 8 Figure 2 A datapoint of raw ET samples extracted from a study trial in Chapter 3.1 .................. 10 Figure 3 Demonstrating cyclical split of a raw data item in Chapter 3.2.1 ................................... 11 Figure 4 The RNN architecture used in this paper in Chapter 4.1 ................................................ 15 Figure 5 The CNN architecture used in this paper in Chapter 4.2 ................................................ 16 Figure 6 An example of a scanpath in Chapter 4.2 ....................................................................... 17 Figure 7 The VTNet architecture in Chapter 4.3 .......................................................................... 19 Figure 8 ECG signal landmarks and example of a learned input mask ........................................ 35 Figure 9 Visualization of the input mask learned for an example correctly classified as ‘confused’ in Chapter 6.3 .............................................................................................................. 38 Figure 10 Visualization of the input mask learned for an example incorrectly classified as ‘not confused’ in Chapter 6.3 ............................................................................................................... 38 Figure 11 Visualization of the input mask learned for an example correctly classified as ‘not confused’ in Chapter 6.3 ............................................................................................................... 39 Figure 12 Visualization of the input mask learned for an example incorrectly classified as ‘confused’ in Chapter 6.3 .............................................................................................................. 39 Figure 13 x (blue) and y pixel (green) location of confused and not confused class item mask components over time in Chapter 6.3 ............................................................................................ 40 Figure 14 x (left) and y (right) coordinates of confused and not confused mask components over time in Chapter 6.3 ........................................................................................................................ 41 Figure 15 Pixel coordinates of confused and not confused mask components over time in Chapter 6.3 .................................................................................................................................................. 42 x  List of Abbreviations AI – Artificial Intelligence ANOVA – Analysis of Variance CNN – Convolutional Neural Network CV – Cross-validation ECG – Electrocardiogram EEG – Electroencephalogram ET – Eye-tracking G - Gaze GRU – Gated Recurrent Unit HD – Head Distance InfoVis – Information Visualization ITS – Intelligent Tutoring System LSTM – Long-Short Term Memory NLP – Natural Language Processing P – Pupil RF – Random Forest RNN – Recurrent Neural Network SMOTE – Synthetic Minority Oversampling Technique VTNet – Visuospatial Temporal Network XAI – Explainable Artificial Intelligence xi  Acknowledgements  I would like to thank all of the faculty in the Department of Computer Science at the University of British Columbia for providing me with an excellent graduate-level computer science education and the opportunity to explore so many areas of AI in the form of course projects. I particularly thank my supervisor Professor Cristina Conati, for her support and guidance in developing the work found in this thesis.  I would also like to thank my lab mate Vanessa Putnam, not only for her friendship but also for her help in the early stages of this project. In addition, thanks are owed to my lab mates Oswald Barral and Mateo Rizzo, who both helped review this work and offered thoughtful suggestions to improve it. I owe special thanks to Sébastien Lallé for his constant support and for always being available to answer my many questions.  Finally, I thank my long-time friend Kevin Malenfant. Without your help all of those years ago, getting to where I am now would simply not have been possible.   xii  Dedication  I dedicate this work to my wife, Brittany Sims. You have stuck by my side through so many changes and every instance of adversity I have faced. You believe in me when I don’t believe in myself and you took care of me throughout this entire process. This work would truly not be possible without you. For all of these reasons and so many more, thank you. I love you.1  Chapter 1: Introduction There is increasing interest in creating AI agents that can predict their user’s needs, states, and abilities, and then personalize the interaction with the user accordingly. This includes understanding and reacting to a user's affective state. One such state is confusion, which is particularly relevant to user experience while interacting with complex interfaces because when a user is confused, they can experience a decrease in satisfaction and performance (e.g., [36]). A system that can detect its user’s confusion gains an awareness that can be leveraged to provide appropriate interventions to resolve such confusion. Detecting and resolving confusion is becoming especially relevant in supporting users interacting with Information Visualizations (InfoVis) because such visualizations are now widespread in our daily lives and confusion has been found to hinder their usage, especially when they increase in complexity (e.g., [33]). Prior work [31] showed that confusion during visualization processing can be detected using a Random Forest (RF) classifier and features based on summative statistics of eye-tracking (ET) data (user gaze, pupil size, and head distance from the screen) computed as the interaction unfolds.  This classifier achieved 57% and 91% accuracy in predicting confusion and lack thereof, respectively. In this paper, we investigate whether we can improve upon the results of [31] by employing a deep learning model to detect confusion from the same ET data set.  The use of deep learning is generally limited in research on modeling and adapting an interaction to user affect, partially due to the difficulty in collecting and labelling large amounts of relevant data. A large dataset is often required to train deep learning models because these models contain far more learnable parameters than traditional machine learning methods. Corpora of data are available for sentiment analysis [53], i.e. detecting positive vs. negative affect (valence) from text, because it is relatively easy to label valence, at least as compared to 2  generating labels for finer-grained emotional states. There has also been work in using deep learning to detect affect from acted emotions in a video (e.g., [12]) where the affective labels are known a priori. By comparison, collecting datasets for specific unscripted user affective states in interactive tasks is very laborious, and thus such datasets are usually small compared to those in domains where deep learning has been most successful (e.g., [25]). For this reason, approaches to predicting user affect mostly use classical machine learning methods similar to those used in [31]. There are two groups of notable exceptions. Works such as [6, 19, 23] seek to predict multiple emotions (including confusion) in students interacting with educational software. They leverage Recurrent Neural Networks (RNNs) to learn from sequences of student interface actions but do so with engineered features based on knowledge of what is important while interacting with each system, thus not fully leveraging the RNN’s ability to learn representations from low-level data. The second exception relates to work that used deep learning on EEG signals to predict emotional valence and arousal in users watching short videos (1 minute), designed to elicit specific emotional reactions [49, 35]. Thus, such work is geared toward providing proof of concept on the suitability of deep learning to capture affective signals from EEG data; it does not pertain to modeling and possibly responding to affect as users engage with an interactive system.  The scarcity of affective interaction data is exacerbated with ET data because collecting reliable data currently requires specialized equipment and collection in a lab setting. The dataset used in this paper is no exception, containing data from only 136 users. To address this issue, we propose a deep learning architecture purposefully designed to process eye-tracking data while being as lightweight as possible (Chapter 4), which achieves a 49% improvement in detecting confusion compared to [31], with no loss in detecting an absence of confusion. 3  Therefore, the first contribution of this work is that, to the best of our knowledge, we are the first to show the suitability of a deep learning approach for the task of classifying user affect from ET data. This result may have wider implications for the use of ET data in user modelling as a whole, where such data has been shown to have great potential for modelling not only affect (e.g. [5, 22, 30]) but also user cognitive abilities (e.g. [25]) and long-term traits (e.g., [45]). By demonstrating the effectiveness of using deep learning based methods with a relatively small eye-tracking dataset, we hope to provide an impetus for further research in this direction.   Our second contribution is the architecture we designed to achieve our results, which combines a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN) to learn from sequential and visuospatial information in the ET data. Previous work that combined CNNs and RNNs dealt with temporal data (videos and EEG signals) suitable for having CNNs process input at each time step, as the RNN does [10, 44, 49, 35, 52]. This approach is not suitable for ET data (see Chapter 2). However, ET data has the property that a temporal sequence of the data can be represented in a single frame in a meaningful way, namely with a scanpath image that records spatial information about the aggregate eye-movements in the sequence. To leverage this property of ET data, our proposed architecture uses an RNN sub-model that takes sequential raw eye-tracking samples as input while a CNN sub-model processes the corresponding scanpath image in parallel. The sub-models are jointly trained in an end-to-end fashion as one unit. A formal evaluation of the model shows that it achieves better performance than either of its components do alone and significantly improves over previous work using non-deep learning methods [31]. We see our results as promising evidence that our proposed approach is worthy of further investigation as a general architecture, as interest in detecting user 4  states from eye-tracking data continues to increase and more datasets of this type become available. The rest of this thesis is structured as follows. Chapter 2 discusses related work. Chapter 3 presents the dataset of user confusion leveraged in this paper. Chapter 4 presents the deep learning model we propose, including an RNN (Chapter 4.1), a CNN (Chapter 4.2), and the VTNet architecture that combines the two (Chapter 4.3). Chapter 5 describes the evaluation of our proposed approach. Chapter 6 describes preliminary work we have done to understand the outcomes of the deep learning models, i.e. to address interpretability, and Chapter 7 describes further attempts to enhance the performance of the models. Chapter 8 concludes the thesis and outlines possible directions for future work.   5  Chapter 2: Related Work The body of work in predicting user affect with deep learning methods is relatively small (compared to tasks like image classification) and occurs mostly in computer vision and natural language processing (NLP), where established methods can be adapted for classifying emotion from images, video, and speech (e.g., [1, 12, 11]). Exceptions pertain to classifying the emotions of students interacting with an intelligent tutoring system (ITS) [6, 19, 23], and affective states from EEG signals [49, 35]. The ITS related works use RNNs to classify emotion from sequences of high-level interaction events (e.g., viewing a video lecture or textbook material, taking a quiz), which does not take full advantage of the RNN’s ability to learn a representation from low-level data (e.g., mouse movements). Like our work, the works leveraging EEG data [49, 35] use raw signals but are concerned with predicting emotional valence and arousal in users viewing music videos explicitly designed to elicit specific emotional responses [29], as opposed to affective events spontaneously occurring during an interactive task.  Eye-tracking (ET) data has been shown to contain good predictors of both affective and cognitive states, such as mind-wandering [5], boredom and curiosity [22], affective valence [30] and learning [25] while interacting with educational software, user intention while playing a strategy game [20], reader difficulty with texts in foreign languages [38], and user confusion while interacting with a visualization-based decision support tool [31]. This latter work predicted confusion by combining features derived from eye-tracking and interaction data as input to a Random Forest (RF) classifier. The classifier learns from engineered features based on summative statistics (e.g., mean and standard deviation) of measures related to the user's gaze, pupil size, and distance of the head to the screen. These measures include, for instance, rate and 6  duration of fixations (gaze maintained at a point), and length and angles of saccades (paths between fixations). We compare our deep learning based approach directly to this work.  The only work we identified that uses a deep learning approach to make predictions from ET data is one that aims to diagnose patient developmental disorders [40]. An RNN is used to learn patterns pertaining to the disorder from how patients look at a trained practitioner who is conducting a diagnostic interview. The model takes as input a temporal sequence that indicates if a patient was looking at certain regions (nose, jaw, etc.) of an interviewer’s face or not, at each time step. Deep learning has also been used for gaze estimation (i.e. predicting the (x, y) coordinates of a person’s gaze on a 2D plane) from images of the viewer’s face (e.g., [56]). Note that gaze estimation is what eye-tracking hardware does and this is distinct from using the estimated gaze data for a predictive task, as done in this work.  There are several works that (like our own) combine the particular strengths of RNNs and CNNs. Most of these works (e.g., [10, 44, 51]) relate specifically to processing videos, using a model class known as Recurrent Convolutional Networks (RCNs). RCNs typically operate on an input of image sequences (i.e. the frames of a video), where at each step a CNN extracts visual features from the given frame and feeds them to the RNN, which models the temporal dynamics of the sequence. In addition, the aforementioned work [49, 35] on detecting affective valence and arousal from EEG signals use an RNN and CNN on a sequence of multichannel EEG signals, where at each time step the RNN is fed the vector of channel values, while a CNN is given a matrix representing the same values, but arranged in a way that reflects the spatial relationship among the sensors placed on the user's head. This approach leverages the strength of CNNs in detecting patterns from spatial information, but the information must be provided at each time 7  step to reflect the changing signal values. This approach is also used by [52] to predict user intentions from EEG signals.  All of these approaches combine a CNN and RNN at every time step and therefore do not decouple the temporal from the spatial aspects of the data completely. Providing a temporally developed scanpath as input to a CNN at every time step (analogous to the above approach) would be less meaningful in our context because the purpose of providing the scanpath in the first place is to give a high-level picture of the user’s activity throughout an entire interaction. By providing such a single scanpath to the CNN, our approach allows for processing this high-level spatial representation of the user’s overall activity prior to an episode of confusion, which complements more local temporal information about potential episodes of confusion generated by the RNN from raw sequences. Combining the CNN and RNN in this way is also beneficial computationally, as while in previous work the CNN had to operate on a datapoint at each time step, our method requires the CNN to operate only once per datapoint; an important consideration for deploying a model to a system that intends to detect and address a user’s confusion in real-time. 8  Chapter 3: Dataset 3.1 Data Collection The dataset used in this thesis is the same one used in [31]. It was generated via a study designed to collect labelled data for episodes of confusion from users interacting with ValueChart [7], an interactive visualization-based tool for supporting decision making.   Figure 1 shows an example of ValueChart configured for selecting rental properties from a set of alternatives (represented by the rows in the chart), based on a set of relevant attributes represented as columns (e.g., rent, location). The width of each column indicates the importance (weight) of the corresponding attribute. The amount of filled color in each cell specifies how the corresponding alternative fares with respect to the related attribute. The stacked bars to the right group all values for each alternative, displaying its overall value (e.g., home4 in Figure 1 has the best overall value). Users can inspect the value of each attribute (e.g., the rent of home1) by left-clicking on the related alternative, they can sort the alternatives based on a specific attribute by double-clicking on its name, they can swap attribute position, and they can change an attribute’s Figure 1: An example of ValueChart to choose a house from available options (rows) based on their attributes (columns).  9  importance by resizing the width of its column. Although extensively evaluated for usability [48], the complexity of the decision tasks means that users can still experience confusion while interacting with ValueChart.  In the study that generated the dataset, 136 participants performed tasks with ValueChart, relevant to exploring available options for a home rental decision problem. There were 5 task types (e.g., retrieve the cheapest home, select the best home based on size and location), each repeated 8 times, resulting in 5440 tasks (mean duration = 22.3 s, standard deviation = 18.4 s). The user's eyes were tracked with a Tobii T120 eye-tracker embedded in the study computer's monitor. In addition to the gaze position, this eye-tracker also collects information on user pupil size and head distance from the screen. To collect ground truth labels for confusion, users self-reported their confusion during a task by clicking on a button labelled I am confused (top right in Figure 1). The confusion reports were verified at the end of the study session by asking users to confirm them after seeing replays of relevant interaction segments. This process resulted in 112 (2%) tasks with reported confusion (there was never more than one report per task) and 5328 without. This highly imbalanced dataset confirms that, overall, ValueChart has good usability but user confusion can still happen. In fact, 60% of users reported confusion at least once, indicating that it is worth capturing as a signal that indicates that the user needs help. 10  Each datapoint in the dataset is a task segment that ends when a confusion self-report occurs or at a randomly selected pivot point for tasks where no confusion was reported (See Figure 2), with an average duration of 13.7 s (standard deviation 11.3 s). As Figure 2 shows, the last second of data before a confusion report is removed to exclude signs of the intention to push the I am confused button. Before using this dataset, we apply general preprocessing steps aimed at removing invalid gaze data (similar to those performed in [31]), which includes removing elements of the dataset with less than 2 seconds worth of data2.   3.2 Data Pre-processing The Tobi T120 eye-tracker collects raw eye-tracking samples at a rate of 120 Hz. This raw data is usually processed with proprietary software into sequences of fixations, identified by clustering raw data to distinguish small eye-movements from real attention shifts. Leveraging fixations and saccades (the gaze paths between fixations) is the standard way to analyze ET data. In fact, the results on detecting confusion by Lallé et al. (2016) [31], the gold standard to which we compare our work, showed that summary statistics around fixations and saccades are strong features for classifying confusion. In contrast, we leverage deep learning to learn from the raw ET samples, the lowest level of data available from the eye-tracker, to ascertain whether this  2 See Appendix B for a description of all of the steps used to prepare the dataset. Figure 2: A datapoint of raw ET samples extracted from a study trial. 11  provides any further discriminators useful for classifying confusion. Any patterns that could be lost in going to a higher level of data abstraction are necessarily maintained at this level, where the model has the opportunity to discover these patterns, as well as any interactions among them [3].  Figure 3 (left) shows an example of a datapoint consisting of a sequence of raw ET samples, namely a 2D array with the number of rows corresponding to the number of samples captured in one of the confused/not confused datapoints described in Chapter 3.1 (and shown in Figure 2). Each ET sample (a row in Figure 3, left) includes 4 measures for each eye: the x and 𝑦 gaze coordinates (Gx, Gy, in Figure 3 (left)) on the study screen, the size of the pupil (P), and the distance of that eye (as a proxy for the head) from the screen (HD).  An advantage of learning from the raw ET samples is that they can support ad hoc data augmentation. Data augmentation is commonly used to deal with limited data availability and in its simplest incarnation, involves duplicating datapoints exactly (random over-sampling) [15]. Figure 3: The 2D array to the left is an example of a datapoint consisting of ET samples (rows). This datapoint is cyclically split to create four separate datapoints (right): rows that are four steps apart in the left table (coded with the same color) are assigned to the same split datapoint to the right. 12  Because of the nature of our data, we can do something better. We observe that in our datapoints (i.e., sequences of raw ET samples), values change only by a small amount from one sample to the next, because of the high sampling rate. This can be seen by looking at adjacent rows on the left of Figure 3. Given this observation, we split the sequence of ET samples in each datapoint into four separate datapoints with the same label of confusion or lack thereof. We do so by performing a cyclic split (e.g., as when dealing a deck of cards), which preserves the temporal structure of the time series data. Figure 3 demonstrates this splitting process: samples (rows) that are four steps apart in the 2D array to the left (coded with the same color in the figure) are assigned to the same split datapoint to the right. Thus, a datapoint with n samples is cyclically split into four datapoints, each containing n/4 samples.  This cyclical split provides our deep learning models with multiple opportunities to learn from the same datapoint in a more intelligent way than by simply duplicating it. The difference between resulting items provides intra-class variance, while the cyclic partition ensures the preservation of the data's sequential pattern. A different approach to data augmentation that has been used with signal data is to create multiple datapoints by slicing each datapoint using a sliding window, as in [49]. This approach is suitable when class discriminators are present in similar forms throughout the entire sequence (e.g. an EEG signal that captures a lingering emotion, as in [49]) because the sliding window breaks up the data sequence into segments that are essentially equivalent in terms of predictive power. We do not use this method because of the nature of confusion, which makes it unwarranted to assume that indicators of confusion are present to the same degree throughout the entire signal. A difficulty in using raw ET data collected at a high sampling rate is the length of the resulting sequences (as discussed in the next 13  chapter). The cyclical split also helps with this issue because it reduces the length of each datapoint by a factor of four.  14  Chapter 4: Models and Approach This chapter describes the intuition behind using an RNN and a CNN on ET data and combining them in a way that is appropriate for the data. Due to the relatively small size of our dataset, in each case, it was important to minimize model complexity. Thus, reducing the number of learnable parameters to avoid overfitting was the driving force behind the various design choices described in this chapter. 4.1 Recurrent Neural Networks  RNNs are a neural network variant especially suited for sequential data, such as ET data. We chose to investigate RNNs because of the nature of confusion itself. As an affective state, confusion doesn't occur instantly. Rather, it develops over a period of time as the brain uncovers discrepancies between its existing knowledge and what it observes, and continues with subsequent attempts to resolve these discrepancies until the person either resolves their confusion or gives up [11]. Confusion may develop based on events further back in time, in a strictly local sequence, or as a combination of both. RNNs are able to handle such varied temporal dependencies, which is why it was chosen for this investigation.  Two variations of RNN have become popular for modelling temporal data: Long-Short Term Memory (LSTM) networks and Gated Recurrent Units (GRU). LSTMs are gated RNNs that use self-loops to facilitate the learning of long-term dependencies while also ensuring long-term gradient flow [17]. A GRU is essentially a simplified LSTM that reduces the number of gates and thus the number of learnable parameters [9]. Because of this reduction in parameters, we chose to use the GRU as the RNN sub-model in our architecture. Based on evidence that for RNNs, neural network depth in the traditional sense (i.e. the number of layers) is often not as important as recurrent depth for classification tasks [54], we 15  limit our model to a single layer, thus limiting complexity. Figure 4 visualizes the GRU architecture we use. We chose a hidden layer of 256 units during hyperparameter tuning using common heuristics [16]. The GRU’s hidden layer is fully connected to each of the input elements, namely the values of an ET sample for a given time step. At each time step, the GRU produces an output value interpreted as a probability for the confused and not confused classes, using the softmax equation. This output at the end of a datapoint is the prediction of confusion or not for the corresponding trial (see Figure 4, right).  While there is no fixed length on which RNNs must operate, in practice sequences should be shorter than 400 steps (and often much shorter) [37]. Even after the cyclical split described in Chapter 3.2, 50% of our datapoints have a length longer than 600 ET samples. We address this issue by considering only 5 seconds of relevant ET samples before a confusion self-report (or placeholder for no confusion) in each data item since Lallé et al. (2016) found this interval to perform as well as when considering the full length of data back to the start of the trial3.  3 These 5 seconds exclude the one second just before the report, as discussed in Chapter 3.1 and Figure 2. Figure 4: The RNN architecture used in this paper. 16  4.2 Convolutional Neural Network Another way in which a sequence of raw ET samples can be represented is as a scanpath image. Given the coordinates in the raw eye-tracking samples, these images are created to contain the path made by the user’s gaze over the sequence, where dots represent individual samples, and connecting lines represent the transitions between two samples (shown in Figure 5)4. The temporal information of the gaze sequence is lost, but visuospatial information comes to the forefront. We leverage a CNN architecture to predict confusion in our datapoints from the scanpath images of the corresponding sequences of raw ET samples.        Because the sequence length does not change the size of the corresponding scanpath image, we use full sequences as input to the CNN, as opposed to the 5 second segments used for the RNN. This allows us to leverage the full information of the user’s gaze activity over the trial, regardless of how long it lasts. Although this might seem unwarranted given that Lallé et al. [31] found no added benefit when considering full sequence vs. 5 second ones, their comparison was based on a uniform data representation consisting of summary statistics of gaze, pupil size, and head distance.  Here we combine temporal information on 5 seconds of data, with a different  4 Such images are commonly available via the eye-tracker’s software, based on fixations. Ours are based on raw samples.  Figure 5: Example scanpath image. 17  representation focusing on the user’s complete attention patterns prior to the confusion report (or pivot point).      Scanpaths are rather different than the images that CNNs are typically used for. For instance, CNNs have been successfully used with natural images containing a hierarchy of parts (e.g. a car’s wheels and their subcomponents) as well as properties such as colour and texture, that CNNs model in their various layers [14]. No such hierarchies nor properties appear in a scanpath image. Instead, scanpaths capture a strictly visual and spatial (visuospatial) representation of gaze data where dots visualize where given gaze samples are located in relation to the others, the density of dots indicates the amount of user attention to a specific area, and connecting lines indicate the relative length and frequency of the saccades to and from that area. The image as a whole provides information about the user’s overall attention over the interface. A CNN can capture these relevant scanpath characteristics. We chose some of the hyperparameters of the CNN architecture (shown in Figure 6) with knowledge of our data and the CNN model class in mind while balancing the competing goal of minimizing learnable model Figure 6: The CNN architecture used in this paper. 18  parameters to prevent overfitting to our small dataset. The choices made to balance these competing goals are as follows: 1. As scanpaths consist of dots and lines, the deep hierarchies associated with natural images are not required. As such, our CNN consists of two convolutional layers (see Figure 6) of 16 and 6 channels, respectively. We determined these hyperparameters by increasing each from one until the validation set performance decreased. Having two layers makes sense, as this is enough to extract simple visual features while avoiding the additional parameters that come from unnecessary layers and overfitting to patterns unique to the training data.  2. Although having only two convolutional layers is advantageous for the reasons described above, it prevents the model from building a large receptive field (important for capturing local information) via depth. To balance this, we use a slightly larger kernel size than is common (5x5 versus the more common 3x3) in order to increase the receptive field’s width directly (kernel is the dark red square shown over the input image and in a subsequent layer in Figure 6). Though a larger kernel size requires more weights, the increase is much less than would come from additional convolutional layers, thus satisfying our goal of building a small model. 3. We make two changes related to the input. First, as colour has no meaning in a scanpath image, we use a single grayscale input channel, to further reduce the number of parameters. Second, as our images do not contain fine or nuanced textures (like the hair of an animal for instance), high resolution is not important. Thus, we downsize the images by a factor of 6, to reduce the dimensions and parameters of each convolutional layer. This single channel low-resolution input image (and its dimensions) are denoted as the input layer in Figure 6. 19  Finally, the CNN contains a 50-unit hidden layer connecting the output of the convolutions with the class predictions in the output layer (right of Figure 6). The size of this hidden layer was chosen as a reasonable progression between the numerous neurons resulting from the convolutions and the two-unit output layer. 4.3 Visuospatial Temporal Network Having developed the intuition behind using each of the RNN and CNN on eye-tracking data, here we describe an architecture to leverage the strength of both models together. In our approach, each of the CNN and RNN takes a different representation of the same data sequence and processes it independently. This model (visuospatial-temporal network, or VTNet from now on) is shown in Figure 7.  The GRU’s 256-unit hidden state that results from processing a datapoint is concatenated with the 50-element vector output of the CNN resulting from processing the corresponding scanpath, creating a single vector of size 306. This combined output vector is fully connected to a simple neural network with one hidden layer, (to create a Figure 7: The VTNet architecture. 20  differentiable classifier with minimal additional parameters), which classifies the input as either confused or not confused. The entire model is then learned end-to-end as a single unit.  Our hypothesis in creating the VTNet was that having a model that can process a multimodal representation of ET data will enhance its predictive abilities by having access to sequential information close to any confusion report as well as spatial information from earlier parts of the trial. This may be beneficial to predicting confusion if there are signals that occur earlier in the trial than the last 5 seconds available to the GRU.  Previous architectures that combine CNNs and RNNs (see Chapter 2), do so by feeding input to both sub-models at each time step and are thus not suitable for our learning task. This is because processing a scanpath image as it develops over time through a CNN to extract features for RNN input gives no more information than that already available in the raw sequence. Instead, ET data has the property that a given temporal sequence of data can be represented in a single frame in a meaningful way. That is, a given image of a user’s entire scanpath contains information about the aggregate spatial eye-movement.  4.4 Implementation All of our neural network-based models are implemented and trained using PyTorch. We use negative log-likelihood as our loss function, with the Adam optimizer [28]. We limit training to 100 epochs, employing linear learning rate decay and early stopping to end training when validation performance stops improving. We train our models using a single Nvidia GTX 1080 GPU.  21  Chapter 5: Evaluation We first determine how the GRU model (the RNN component of VTNet) performs compared to the RF approach in [31] (Chapter 5.2). We begin with this comparison because the RNN is the most intuitive neural model to use with raw ET data. Next, in Chapter 5.3, we evaluate the performance of the CNN architecture, described in Chapter 4.2, on scanpath images and determine whether combining it with the GRU in the VTNet architecture is more effective than its constituent parts are alone, as hypothesized in Chapter 4.3. 5.1 Experiment Setup Model performance is evaluated with sensitivity and specificity, which are the proportion of confused and not confused tasks correctly identified as such, respectively. Because of the dataset's class imbalance, both metrics together are more meaningful than accuracy alone. For instance, a 98% accuracy could be achieved by simply classifying everything as not confused, but not capturing any instance of confusion, thus preventing the real-time provision of support when confusion does arise. We also report the mean of sensitivity and specificity scores as combined accuracy; a unified measure of performance.   All models are evaluated using 10 runs of 10-fold cross-validation (giving 100 iterations of CV in total) to reduce fluctuations in the results due to the random selection of folds. All results reported in the next chapter are the average of the 10 runs of 10-fold CV. Further, cross-validation is done across users so that no user contributes datapoints to both the training and test sets of a given fold, thus measuring model performance on unseen users. Cross-validation is also stratified so that the distribution of confusion datapoints in each fold is kept similar to that of the dataset as a whole.  22  For the RF model, nested CV (i.e., further cross-validation on each training set) was used for feature selection, hyperparameter tuning, and to choose the decision threshold that maximizes sensitivity and specificity5. For the deep learning models, using nested CV would be computationally onerous. Instead, for each of the 100 iterations of CV, we randomly select 20% of the data as a validation set for hyperparameter tuning and decision threshold setting. Note that contrary to the nested CV, the validation set is holdout data that is not re-added to the training set for a final round of training prior to evaluation on the test set. This effectively results in the DL models being trained directly on 20% less data than the RF model.  To address the imbalance between confused and not confused datapoints in the dataset, Lallé et al., (2016) [31] used Synthetic Minority Oversampling Technique (SMOTE) [8] for their RF model, but recall that their model was not learning from ET data sequences. SMOTE is not generally suitable to augment sequences, because it measures similarity between samples by Euclidean distance, which is a bad match for long and temporally misaligned pairs [15]. However, preliminary experiments showed that SMOTE increased GRU performance with our data, possibly because we limit sequence length to 5 seconds worth of samples, and because confusion self-reports provide an anchor that may maintain a degree of temporal alignment in our sequences. Thus, for evaluating the performance the GRU when used on its own (Chapter 5.2) and for the RF model, classes in the training sets are balanced by first using SMOTE to increase the size of the minority class (confused) by 200% and then randomly down-sampling the majority class (as was done in [31]), resulting in approximately 1350 confused and 1350 not confused datapoints.   5 This is done by choosing the threshold closest to the (0, 1) point on the Receiver Operating Characteristic (ROC) curve. 23  We cannot use SMOTE when evaluating the CNN nor with the VTNet that includes it (Chapter 5.3), because we use the full ET sequences to produce the scanpaths, and as mentioned, SMOTE does not work well when having substantially longer sequences [15]. Thus, for these models, we just down-sample the majority class to achieve class balance, which reduces the number of non-confused items to approximately 450 (matching the number of confused items).  Validation and/or test sets are left unbalanced in all models, so as to evaluate the models on data reflecting the realistic class distribution of the original dataset. 5.2 Results Comparing GRU and RF Comparing the performance of the GRU and RF models (Table 1) shows that GRU outperforms the RF classifier in both sensitivity and combined accuracy, with no change in specificity.  The GRU achieves a combined accuracy of 0.78, compared to the 0.67 achieved by the RF. We test this result with an independent samples t-test, which shows that the difference is statistically significant6 (𝑡#$ = 6.28, 𝑝 < 	 .001). The difference in sensitivity is also significant (𝑡#$ =6.22, 𝑝 < 	 .001), with a substantial 41.5% improvement over the sensitivity of the RF model.  These results allow us to conclude that the GRU outperforms the RF in classifying confusion with this dataset, where the impact of the GRU is specifically in improving sensitivity, namely detecting confusion when it occurs, with no loss in the accuracy of predicting when a user is not confused.   6 Significance is defined at p <.05 throughout the paper. Model Sens. Spec. Combined RF 0.53 0.80 0.67 GRU 0.75 0.80 0.78  Table 1: Test set performance of GRU and RF. 24  In [31], the authors experimented with combining ET data and interaction data based on the interface actions available in the ValueChart (see Chapter 3.1) to train their model. This combination gave them their best results, namely 0.61 sensitivity and 0.926 specificity, for a combined accuracy of 0.768. With this additional data modality, the RF still doesn't perform better than the GRU trained only on ET data. This result is especially encouraging when we consider that the GRU is trained on 20% less data (the portion held out as the validation set). It should be noted that we also experimented with including interaction data in our approach, by adding information of mouse clicks to the vectors of sequential data fed to the GRU. However, adding this interaction data generated no significant improvement, likely because the number of these events is sparse in comparison to the number of samples in a given sequence. A more suitable way to include interaction data would be to include the mouse coordinates at each time a sample is collected. This would give a fine-grained stream of interaction data at a level of granularity similar to that found in the raw eye-tracking sample. However, tracking of mouse coordinates was not available for the dataset used in this investigation.  5.3 Results Comparing CNN, GRU, and VTNet After establishing the superiority of the GRU over the RF model in classifying confusion, we evaluate the CNN as an independent model and then the performance of the VTNet model that combines the two. The result of this comparison is summarized in Table 2. The VTNet has been trained with the same hyperparameter configuration as its corresponding sub-models. We see Model Sens. Spec. Combined GRU 0.75 0.80 0.78 CNN 0.73 0.80 0.77 VTNet 0.79 0.84 0.82  Table 2: Test set performance of neural models. 25  that for all three measures (sensitivity, specificity, and combined accuracy) the VTNet outperforms both the GRU and the CNN.  One-way ANOVA with classifier type (VTNet, GRU, and CNN) as the factor shows a significant effect on all three measures (combined: F3,36 = 47.59, p < .001, 𝜂12= .27; sensitivity: F3,36 = 39.74, p < .001, 𝜂12= .76;	specificity: F3,36 = 9.25, p < .001, 𝜂12= .33). Post hoc testing via Tukey HSD (which adjusts for multiple comparisons) shows that for all three measures, the difference is statistically significant between VTNet and both GRU and CNN, with no significant difference between the latter two. With this, we conclude that VTNet surpasses the performance of both of its constituent parts and is thus a more effective model for classifying confusion from our ET data.      VTNet achieves a 79% sensitivity, which represents a 49% increase over the original RF model. It is also the only one of the three deep learning models to increase specificity (reaching 84%), suggesting that combining temporal and visuospatial information from ET data manages to capture patterns pertaining to the absence of confusion that otherwise go undetected. The fact that the VTNet does not have SMOTE augmented data, yet still outperforms the GRU with augmented data, shows that there is a strong signal for confusion in the scanpath images, which complements the temporal information captured by the GRU.  This suggests that additional confusion signal is present further back in the trial than the 5 seconds processed by the GRU, contrary to what was found in [31].     The performance of the VTNet model is also higher than other published approaches to predicting confusion using RNNs in a different context, namely leveraging the interaction data of users while they study with ITSs [6, 23]. Neither of these previous works reports sensitivity or specificity, but both report Area Under the Curve (AUC) for the model’s ROC, namely an AUC 26  of 0.57 for [6] and AUC of 0.72 for [23]. By comparison, we achieve an AUC of 0.84 with VTNet and eye-tracking data. 27  Chapter 6: Investigating how to Interpret VTNet Performance A common refrain about deep learning is that it only works with tasks for which there are large amounts of data [18]. This makes sense given the nature of deep learning models and the fact that the domains in which deep learning has been most successful are those with lots of data, such as image classification (e.g., [42]) and image caption generation (e.g., [27, 47]). This immediately led us to the question of how our method was able to achieve the results it did, given the relatively small size of our dataset. This question of how a model is making its decisions is a common one, with research around how to answer such questions falling into the broader research field of explainable artificial intelligence (XAI), with a specific focus here on the explainability of machine learning models. This line of our work was undertaken early on in this project and is thus limited to the RNN portion of VTNet, which was the first deep learning model we investigated for classifying confusion. To inform our discussion and organize the questions we had, we utilized the overview presented by Hohman et al. [18] in their survey of visual analytics research for the explainability of deep learning models. In this framework, research is organized by considering six related questions: Why we want an explanation, Who wants the explanation, What is the model learning, How this can be visualized, When during model development do we want answers to our questions, and Where will such explanations be used. As researchers with the goal of making a state-of-the-art classifier with an approach that is known to be difficult to interpret, we are concerned primarily with the What question. That is, we wish to understand what our deep learning models are learning exactly. Using [18] as guidance, we identified three areas related to What the RNN component in the VTNet is learning. The first of these involves signal importance. Here we wanted to know 28  which of the eye-tracking signals (pupil size, head distance, gaze location) are most important for classifying confusion. This task of determining signal importance was also undertaken by Lallé et al. [31], which allows for a nice model to model comparison. The second area is related to understanding what information is contained in the representation learned by the RNN. Specifically, we want to know what the contents of the RNN’s internal state is and if this corresponds to interpretable aspects of the underlying process (for example, is the model learning to look at fixations to classify confusion?). Finally, we look at the decision process used by the RNN in making a classification. This would answer the question as to why a given example is classified one way over the other. The following subsections present our results in each of these areas. 6.1 Determining Signal Importance The approach taken here was to simply train an RNN (described in Chapter 3) with each possible combination of the pupil (P), gaze (G), and head distance (HD) signals, as was done in [31]. We trained and evaluated each model using 10-fold across user cross-validation, using the same setup described in Chapter 5.1. The difference here is that in the interest of time, we performed only one iteration of 10-fold CV. These results are presented in Table 3, which shows each metric as an average of that achieved on the test sets.        29  Signals Sensitivity Specificity Combined P, G, HD 0.63 0.86 0.75 P 0.43 0.64 0.54 HD 0.65 0.64 0.65 G 0.67 0.83 0.75 G, P 0.60 0.83 0.72 G, HD 0.65 0.85 0.75 P, HD 0.41 0.67 0.54  Table 3: RNN performance when trained with each combination of eye-tracking signals. The first thing that we see coming out of these results is that gaze (G) alone does as well (in terms of combined accuracy) as when all three eye-tracking signals are combined (P+G+HD). Looking at sensitivity and specificity, we see that there is a small trade-off, in that when G is used alone, sensitivity increases at the expense of specificity. When all signals are combined, the trade-off happens in the opposite direction. In analyzing these results, we did not test for statistical significance (having performed only one iteration of 10-fold CV), but the trend indicates that using G alone is more beneficial for identifying instances of confusion (the rarely occurring minority class). This result was surprising for two reasons. First, it suggests that, at least for the RNN, the primary signal for confusion is in the gaze pattern. In fact, when gaze is combined with just one other signal, the result remains the same (as in the case of G+HD), or even decreases (as in the case of G+P). The second reason these results are surprising is that for the Random Forest classifier, pupil size was the single strongest signal [31], while (as just mentioned) we see pupil decreasing performance when used alone with gaze. From this analysis, we can conclude that for the RNN, the gaze signal is the most important.  30  Lallè et al. provided a relative comparison of how these different signal combinations affected the performance of their Random Forest classifier [31]. We have reproduced this in Table 4 (but removed mouse events) and added the same comparison for the RNN. The statistical significance of these differences was evaluated in [31], which is signified in Table 4 with an underline. That is, any signal combinations included in the same underline have no statistical difference between them, while the greater-than symbol indicates that any given signal combination had a higher performance than those to the right of the symbol. Note that as we did not complete statistical significance testing, there is no equivalent underlying scheme for the RNN portion of the table, it reports just the trends. In addition to P being the highest performing single signal for the Random Forest (discussed above) but the lowest for the RNN (as seen in Table 3), we also see that for the RF,  P+HD performed as well as G+P+HD did whereas for the RNN P+HD is the lowest performer for sensitivity and the second-lowest for specificity. In conclusion, we see that the RNN and the RF use very different signals when classifying confusion. This is an indication that the RNN and RF trained on separate feature combinations might be used together in an ensemble (with another classifier) to greater effect than either considered alone; a task for future work. Random Forest  Sensitivity G+P+HD > P+HD > P > G > HD Specificity G+P+HD > P+HD > P > G > HD RNN  Sensitivity G > G+HD, HD > G+P+HD > G+P > P > P+HD Specificity G+P+HD > G+HD > G, G+P > P+HD > P, HD                   Table 4: Effect of feature combination on model performance. 31   Since G+P+HD did no worse than G considered alone, we chose to continue using all three signals in our investigation of VTNet, in case these additional signals proved useful with the addition of the CNN sub-model and scanpath input. Future work should conduct an investigation of signal importance for the VTNet, similar to the one presented in this chapter. 6.2 Understanding the Contents of the Learned Representation Understanding the contents of the representation learned by a DL model is one goal of XAI [41]. Most of the work done toward this goal thus far has been directed at CNNs. The two general ways that have been proposed include visualizing the learned parameters of the early convolutional layers (which often correspond to various edge and colour detectors) and visualizing filter activations for those images producing the strongest response [55].   One way that we attempted to understand the contents of the RNN’s learned representation is by comparing the performance of the RNN when trained with raw sequences versus when trained with fixation-based sequences. This can be thought of as a data ablation. Just as an ablation study can reveal the importance of components in a model’s architecture [14], we use data ablation to understand what the RNNs are learning from the eye-tracking data. To understand this, one can think of fixation-based data as the result of removing the raw samples related to small, possibly spurious, shifts in gaze that are not due to actual redirection of attention. By comparing resulting differences in performance between models using raw versus fixation-based data, we may be able to gain some understanding as to the contents of the learned representations.  	 As mentioned in Chapter 5.2, SMOTE increased the performance of the RNNs. We thus conduct this experiment with the LSTM trained with SMOTE augmented data at a rate of 200%; the rate used in [31]. One such LSTM is trained with raw data (LSTM-raw) as before, while the 32  other is trained with fixation-based data (LSTM-fixation). Both sequence types include the last five seconds of the interactions (as in Chapter 4.1). However, the cyclical split that we used to quadruple the size of the raw sequence dataset cannot be used similarly with the fixation-based sequences. The summarizing nature of fixation sequences (some raw samples are clustered into fixations, the remaining samples between fixations are not included) means that the sequences that would result from such a split would all be quite different from each other and would actually split up the overall pattern in the original sequence. Not performing this split means that the fixation-based dataset is a quarter of the size as the raw sequence dataset. Both of these RNNs and the RF were trained as described in Chapter 5.1, namely with 10 iterations of 10-fold CV and nested CV for the RNNs and RF, respectively. The results of this experiment are shown in Table 5. Classifier Sensitivity Specificity Combined RF 0.53 0.80 0.67 LSTM-raw 0.76 0.82 0.79 LSTM-fixation 0.75 0.80 0.78 Table 5: Comparison of RF and LSTM with raw and fixation sequences and SMOTE augmented data. We see in Table 5 that the mean of all three metrics is higher for both LSTMs than for the RF, while the LSTMs have a similar performance. To test the significance of these differences we run a one-way ANOVA for each dependent measure, with classifier type (LSTM-raw+SMOTE, LSTM-fixation+SMOTE, RF+SMOTE) as the factor. The ANOVAs show a  33  Table 6: Comparison of sequence length between fixation-based sequences and raw cyclically split sequences. significant effect of classifier type on sensitivity (F2,27 = 26.18, p < .0005, ηp2 = .66), but not on specificity (F2,27 = 2.80, p = .079, ηp2 = .17). There is also a significant effect on combined accuracy (F2,27 = 20.37, p < .0005, ηp2 = .60). In all cases of significant effects, post hoc testing shows that the difference is between the LSTMs and RF, but not between the LSTMs.  Before considering these results with respect to the goal of understanding the contents of the RNNs learned representations, there are a few remarks that should be made concerning these results and possible avenues of future work related to them. First, it is interesting that the LSTM trained with fixations did as well as that trained on raw sequences when we consider that the latter had a dataset four times as large. Future work should consider using SMOTE at a higher augmentation rate for the fixation-based sequences, in order to bring the size of the dataset to that of the raw sequences and see if this affects relative performance. For the next point, we consider Table 6, which summarizes the length of fixation-based and raw (cyclically split) sequences when considering the full sequences as well as when considering the last 5 seconds only. Fixations sequences are necessarily shorter than their corresponding raw sequences even though the latter has been cyclically split to reduce the length. While we cannot perform a cyclical split on the fixation data (as discussed above), Table 6 shows us that the relatively short average length of fixations-based sequences means that we don’t need to reduce their length Sequence Type Min. Max. Mean Std. Deviation Raw (full) 90 4018 411.0 363.0 Raw (last 5 sec.) 90 150 145.2 13.0 Fixations (full) 1 404 40.4 34.9 Fixations (last 5 sec.) 1 51 18.8 5.7 34  further. The shorter length of fixation-based sequences gives more room in the sequence to incorporate a duration longer than 5 seconds, which may provide more useful information for classifying confusion. Concerning our goal of understanding the representation learned by the RNN, consider that our reason for using raw data was so that no potential discriminators of confusion would be excluded by abstracting to a higher level of data. However, since the fixation-based data necessarily removes intra-fixation details and accompanying interactions, yet still achieves the same results as when using raw data, means that the data lost in the abstraction process arguably contains nothing that can be learned by the RNN for classifying confusion. This suggests that the RNNs trained on raw data are learning to disentangle fixation-based information (explicitly available in fixation-based data), which must then be somehow reflected in the learned representation used for classifying confusion.  Another possibility is that the RNNs trained on raw data learn an independent, equally effective signal for confusion that is different from the signal contained in fixation-based data. If this is the case, the two independent signals could be used to improve accuracy, as in an ensemble method. One possible way to test this hypothesis would be to use LSTM-raw and LSTM-fixation as distinct sub-models of a VTNet, along with the CNN component, where an observed performance increase would be indicative of the independence of the signals learned between the two LSTMs. In any case, the results presented in this section provide a first step towards an interpretation of what the RNNs are learning from the eye-tracking data. 6.3 Understanding the Decision Process The work on visualizing and understanding an RNN’s decision process is a relatively small field, but we found the work of van der Westhuizen and Lasenby [46] to be particularly helpful in 35  providing an approach that could be readily adapted to our problem domain. Their work involves visualizing LSTMs trained to classify Electrocardiograms (ECGs) as belonging to one of four heartbeat classes (normal, or one of three types of irregularity). While previous methods existed for visualizing LSTMs trained on discrete data (e.g., [26, 34, 32]), van der Westhuizen and Lasenby are the first to apply such methods to continuous data like that of ECGs [46], or in our case, raw eye-tracking sequences. Of the methods investigated, they concluded that the most effective was to learn an input deletion mask that optimally reduces the prediction of the ground truth class. The goal of the input deletion mask is to show which parts of an input sequence are most important to the classifier when making its classification decision. The visualizations produced in [46] show examples where the learned mask corresponds with what cardiologists look at when making the same ECG classifications as the LSTM. The right side of Figure 8 is one such example. The left side of Figure 8 shows a simplified generic portion of a single lead EEG signal, accentuated to demonstrate potential points of interest. These points of interest are labelled P through T, which correspond to parts of the signal that cardiologists use to identify irregularities that may or may not show up in a patient’s actual EEG signal. It is this use of the Figure 8: simplified ECG signal and landmarks used by cardiologists in making diagnoses (left) and an example of a learned input mask for a signal with the PVC label (right). 36  landmarks that signifies their importance. The right side of Figure 8 shows the input mask learned over an ECG signal of a patient with a condition called a premature ventricular contraction (PVC). The importance of a portion of the signal is indicated with colour, given in the scale on the far right of Figure 8, where maroon indicates maximal importance and dark blue indicates minimal importance. We see that the signal between landmarks Q and R and the area around landmark S are of the highest importance to the LSTM in this case. A learned input mask, m, is applied to a given input vector, x, by elementwise multiplication, i.e. 𝒎⊙𝒙, where ⊙ is the Hadamard operator. The mask consists of elements from 0 to 1 (inclusive), where a lower value has the effect of removing the corresponding element of the input, the closer that mask value is to 0. Ideally, the mask would contain only 0s and 1s as its two values, but we must allow for all values in between in order to learn the mask via backpropagation. To apply the optimal input mask technique to our data we had to contend with the fact that our sequences contain up to three input signals (P, G, and HD), one of which is 2D (the x and y coordinates making up G). Since we showed above (Chapter 6.2) that gaze is the single most predictive signal for the RNNs, we limit our efforts to this signal alone. To account for the fact that gaze is 2D, we learn a mask, m, of size 2xT, whereas the mask for the ECG sequence is of size 1xT, where T is the length of the sequence. The mask itself is learned by minimizing the loss function, J, defined as follows: 𝐽 = 	𝑎𝑟𝑔𝑚𝑖𝑛𝒎2×?	𝜆#A|𝟏 −𝒎2×?|A# + 𝜆2F|𝒎GH# −𝒎G|?GI# + 𝑠K(𝒎⊙ 𝒙) where 𝑠K is the model’s score with respect to class c. We can see from this function that the first term rewards those masks with minimal non-one-valued elements. This encourages the learning of a minimal mask. The second term, by penalizing a change in consecutive mask values, 37  encourages the effective mask elements to be co-located, so as to provide what is hopefully a more visually interpretable mask. The final term is what encourages the mask to minimize the model’s ability to predict the correct class.  We learn this mask for a given trained RNN, via stochastic gradient descent, using the elements of the test set. We begin the analysis of our results by presenting examples of the masks learned for two correctly and two incorrectly classified datapoints for confusion and not confusion (i.e. four examples in total). The masks are visualized in Figures 9 to 12 where the x and y coordinates of the gaze location are visualized into separate graphs. In these graphs, the value of the coordinate (vertical axis) is plotted over time (horizontal axis). The value of a coordinate at any given time is the value that is given to the RNN at that point in the sequence. The colour of the coordinate value line is indicative of the importance of that portion of the input, as determined when solving J. In the figures, yellow indicates a component of the highest importance while purple indicates no importance. In the examples presented here, there were no mask components with values in between maximal and minimal importance (i.e. all are either purple or yellow).  38   Figure 10: Visualization of input mask learned for an example incorrectly classified as ‘not confused’. Yellow signifies the most important mask components, while purple signifies the least important. Figure 9: Visualization of input mask learned for an example correctly classified as ‘confused’. Yellow signifies the most important mask components, while purple signifies the least important. 39   Figure 11: Visualization of input mask learned for an example correctly classified as ‘not confused’. Yellow signifies the most important mask components, while purple signifies the least important. Figure 12: Visualization of the input mask learned for an example incorrectly classified as ‘confused’. Yellow signifies the most important mask components, while purple signifies the least important. 40   Our hope was that by visually inspecting the masks of correct and incorrectly classified examples in each class, we would be able to determine a discernable pattern of what the model is looking at when making a classification. The approach of visually inspecting visualizations aimed at explaining deep learning methods and then relying on the human capacity for pattern recognition to derive insights is standard in current XAI literature (e.g., [46, 55, 26, 34, 32, 50, 43]).  Unfortunately, we did not identify any such pattern by visually inspecting all of the confused labelled items of the test set and an equal number of randomly selected not-confused labelled items.    As looking at individual images of the learned input masks did not result in any tangible insights, we looked at summary visualizations of the pixel coordinates of mask components. Our motivation was that by viewing the masks in aggregate, information about their commonalities might become apparent. Figure 13 shows the x and y coordinates of the important mask components (yellow components in Figures 9-12) in blue and green (respectively) as a function of time, with items labelled confused on the left and those labelled not confused on the right.   Figure 13: x (blue) and y (green) pixel coordinates of confused (left) and not confused (right) class item mask components over time (horizontal axis). 41    Our thought was that if mask components centred around specific coordinates and/or times, we would have identified a part of the ValueChart visualization interaction that are associated with confusion (or lack thereof). This was not the case however, as we see that mask components are spread across the screen, regardless of time. Additionally, we looked at both classes together by having colour represent the class and considering the x and y coordinates in separate graphs (Figure 14). This approach also did not reveal any clearly relevant pattern. Finally, we consider all important mask components together, regardless of if the component is an x or y coordinate (Figure 15).  While in this case, we see several mask components occurring at the beginning of the interactions (left side of Figure 15) as well as several mask components occurring at a 0 valued coordinate (bottom right of Figure 15), we don’t think that anything definitive can be said about this.    Figure 14: x (left) and y (right) coordinates of confused (blue) and not confused (green) mask components over time (horizontal axis).  42    One difficulty we had in interpreting the results of the learned mask, besides contending with a 2D signal, is that we don’t know a priori, what confusion should look like in the signal. This is in comparison to the single 1D sequence used in [46] to represent the ECG signal. In the ECG case, cardiologists have well-established rules that they use to classify the signal. This simplifies the task of understanding the RNNs decision process, as one can simply check to see if the learned mask components correspond with the rules employed by cardiologists. In the absence of similarly well-established rules, as well as not having obvious or definitive patterns in the masks, we are not able to say anything definitive about the RNNs decision process in classifying confusion.   Figure 15: pixel coordinates of confused (blue) and not confused (green) mask components over time (horizontal axis). 43  Chapter 7: Additional Attempts to Increase Performance   In addition to augmentation provided by the cyclical split (Chapter 3.2), the data-intensive nature of our proposed approach and our small dataset makes augmentation beyond that achieved by the cyclic split desirable. This chapter records two additional attempts in this direction, that did not improve over the results in Chapter 5 but nonetheless form a part of the record of what has been tried to date. All such attempts were made on the RNN alone and occurred as part of early work on this project.   7.1 Add noise to inputs  Adding random values to training data input values, as proposed in [4], is a simple heuristic and a form of data augmentation that can be used with most machine learning methods, including those of deep learning. The purpose of adding a small amount of noise to each input value is to induce the model to ignore such random variation in values, thus reducing overfitting. This occurs because the model cannot fit each datapoint precisely, as when each time it sees the datapoint, it does so with small variations in its value. To apply this method to our dataset (Chapter 3), for a given training set datapoint, we first determine the average value of each column in the datapoint. Then for each element (row) of a given column, we choose a random value between 0 and 0.25, multiply this value by the column mean, and then add the result to the original element value. The result is that each element of the datapoint is changed by a small amount, commensurate with the average value for the column that the element belongs to. The result of our experiment was neutral as there was no substantial change in the RNN’s performance by performing this extra step.   44  7.2 SMOTE Hidden States  We mentioned in Chapter 5.1 that there was some success in applying SMOTE as a way to augment the relatively small dataset of raw sequences used to train the RNN for classifying confusion. The SMOTE algorithm operates by pairing neighbouring input vectors and then creating a new vector by choosing a value for each element that is between the two paired vectors in the corresponding position. The nearness of vectors can be measured in a variety of ways but is most commonly done using the k-Nearest Neighbours algorithm; taking the Euclidean distance between a given vector and all others and ranking nearness accordingly. A possible issue with using SMOTE to augment the sequences used to train the RNN is the long length of the sequences themselves: flattening a raw input sequence of 150 samples (i.e. five seconds of data) of 8 elements each, results in a vector of length 1200. Measuring closeness is the key step in the SMOTE algorithm, yet due to the curse of dimensionality, we know that vector space becomes increasingly sparse as the number of dimensions increase, making the distance between vectors increasingly meaningless as a measure of closeness [21]. As it turns out, this possible issue did not prevent SMOTE applied to the raw sequences from increasing the accuracy of the RNN model, however, here we wanted to see if SMOTE would have even better results if applied to shorter input vectors.  The RNN has a hidden state that has a fixed size. In this work, we used a hidden state of size 256 (as described in Chapter 4.1). Thus, upon processing a raw sequence in the form of a vector of size 150 × 8, the sequence is encoded into the RNN’s hidden state as a vector of 256 elements. Our idea was to take a trained RNN, encode all training examples as vectors of 256 elements by taking the resulting hidden state, and then applying SMOTE to these much shorter sequences. We then trained a new RNN to take these augmented datapoints as input, in hopes of 45  improving performance. The result of this process was a slight decrease in sensitivity, with no change in specificity. We only tried using SMOTE at a rate of 200% (as in [31]), but future work should consider trying other augmentation rates, as well as different ways of pre-SMOTE dimensionality reduction (e.g., an autoencoder). 46  Chapter 8: Conclusion & Future Work In this thesis, we presented a novel approach that leverages deep learning for detecting user confusion from raw sequences of eye-tracking (ET) data. Our work contributes to the research on automatic detection of user affective states, with the long-term goal of creating intelligent interactive systems that can respond to these states to improve user experience. We focus on user confusion because it is a state that is well-known to affect user satisfaction and performance with interactive systems (e.g., [36]), thus it would be highly valuable to empower such systems with the ability to detect confusion and provide appropriate interventions to resolve it.  The approach we presented to detect user confusion from ET data combines the strength of CNNs in spatial reasoning with the strength of RNNs in temporal reasoning. The resulting model (VTNet) outperforms its constituent models considered on their own when tested on a dataset capturing episodes of confusion for users interacting with a visualization-based interactive system for decision support (ValueChart). VTNet also largely outperforms a previous model based on Random Forests, on the same dataset [31], bringing a 22% increase in combined sensitivity and specificity, with the bulk of the increase (49%) being in detecting confusion when it occurs (79% accuracy) which is remarkable considering that our dataset contained only 2% datapoints for confusion.  Deep learning has proven very effective in domains with large datasets, showing, for instance 16-23% improvements when initially applied to speech recognition and a 41% reduction in error rate when applied to object recognition [3]. Our results provide encouraging evidence that deep learning can be useful even with the smaller datasets usually available for predictive tasks involving hard-to-collect interaction data (e.g., ET data) and complex user states (e.g., affective reactions) if there are strong discriminators present in the data. As such, our work 47  extends existing preliminary work on using deep learning approaches for predicting user affect from user interface actions, by predicting the specific affective state of confusion from eye-tracking data.  Our approach also extends previous work on combining CNNs and RNNs, by integrating the two in a manner that suits the specific sequential and visuospatial nature of the ET data, where a temporal sequence of raw samples in a given timeframe can also be represented as a single visual scanpath for that timeframe. Our results provide evidence that there is a benefit to modelling sequential data local to a confusion episode, while having access to an image of the gaze activity over a longer span of interaction prior to confusion, indicating that there are important yet distinct signals in both representations, which when combined, give stronger results than either signal considered alone.    To ascertain how VTNet was able to perform as well as it did, despite our small dataset, we undertook preliminary XAI work with the RNN component. We looked at understanding the importance of the three input signals (gaze location, pupil size, and head distance) by seeing how various combinations of the signals changed the performance of the model. We found that using gaze alone worked as well for the RNN as using all three signals. Next, we sought to understand the contents of representation learned by the RNN by comparing performance between an RNN trained on raw sequences versus one trained on sequences of fixations. Preliminary indications of comparable performance lead us to believe that the raw sequence trained RNN may be learning to look at fixations for confusion discriminators. Finally, we attempted to understand the decision process used by the RNN by learning an input deletion mask that maximally reduces the performance of a trained model by minimally deleting components of the input, thus revealing the most important parts of the input; a technique that was found to work well when applied to 48  ECG signals. Our results here were inconclusive, as we were not able to observe any meaningful patterns in the input masks. Finally, we documented two attempts we made to increase performance beyond the results noted above. First, we performed data augmentation via the addition of a small amount of noise to the input values of the training set. This did not affect performance. Having seen some success applying SMOTE for data augmentation, we observed that a limitation of using this approach in our context is the long length of our sequences. To overcome this problem and obtain shorter input sequences, we encoded our dataset by running each item through a trained RNN and then took the resulting hidden state vector as the new data item. We applied SMOTE on these shorter vectors and then trained a new RNN with the resulting augmented dataset. Doing this resulted in a slight decrease in sensitivity with no change in specificity. In conducting all of our SMOTE experiments, we used a 200% augmentation rate, based on what was found to work in [31]. We think that going forward, the experiments that use SMOTE on input sequences and hidden states should include both higher and lower augmentation rates to get a more complete understanding of the value of SMOTE in this context. Moving forward, we will explore methods for increasing VTNet performance, such as increasing receptive field size via dilated convolutions. We will integrate our predictors of confusion into ValueCharts, and investigate responses designed to mitigate confusion as it is detected during a user’s interaction with the system. We will also investigate whether our results generalize to predicting confusion in other interactive tasks, and to predict other states/traits relevant to ascertain user experience with an interactive task. Along these lines, we plan to test the VTNet approach on other ET datasets that have been used to predict user states and traits, such as cognitive abilities like reading proficiency and visual literacy [2], learning [25], affective 49  valence [30], as well as early stage Alzheimer’s disease [13]. Finally, we believe that our VTNet approach could be applied to other data modalities that have been used for affect detection. For instance, we are interested in looking at speech and EEG data, where VTNet could be adapted to learn from the combination of the temporal signals with the related spectrogram (for speech) or with a heatmap representation of the signal over the brain for EEG. Future work concerning XAI should investigate signal importance for VTNet similar to the one we performed for the RNN. If it is shown that using gaze alone does as well as using all three signals, then performance at predicting confusion may be improved further by creating an ensemble that includes models trained on independent signals, including VTNet trained with gaze only and Random Forest trained with pupil only.  The other XAI component of our work that warrants further investigation is understanding the contents of the representation learned by the deep learning models. We hypothesized that an RNN trained on raw sequences is learning to recognize fixations in the data and then learning to classify confusion from there. This conclusion was based on the identical performance achieved by the RNN when trained on fixation-based sequences versus one trained on raw sequences. We argued that if, on the other hand, the two models were learning independent signals, then an ensemble containing both RNNs would lead to an increase in performance beyond either model considered alone. To further support this argument, future work should extend VTNet to include an additional RNN sub-model, which would be trained on fixation-based sequences. If there is no increase in performance, this would support the argument that the representation learned by the raw sequence trained RNN is fixation based. We believe evaluating the hypothesis like this is preferable to including the raw sequence trained RNN and 50  the fixation-based sequence trained RNN in an ensemble with a third classifier, as this third classifier would confound the results.  Additionally, future attempts at understanding the representation learned by VTNet should take advantage of the work done on this subject for CNNs. Popular visualization techniques used to understand what a CNN has learned can be divided into two areas: first, visualizing the parameters learned by the CNN, and second, visualizing which images lead to the strongest activations at the various layers in the CNN [18]. For visualizing learned parameters, one simply treats the weights as pixel values and inspects the resulting images. What is often found is that in the first layers, the visualizations show various edge, corner, and colour conjunction patterns, while higher layers show instances of more class-specific components (the parts that make up the classes being detected) [55]. For the second method, images are processed by the CNN and the value of the neurons at each layer are summed, giving a number representative of the activation strength of a given image at a given layer. When the images leading to the strongest activations are considered together, the commonalities between the images can be apparent, which tells us what is being detected at the given layer. Using these two techniques together can provide further insight when neither approach provides clarity when considered alone [55]. We believe that applying the technique of parameter visualization to the two layers of the CNN in VTNet will provide some evidence as to what the model is learning to focus on in its two convolutional layers and may provide insights beyond what is achievable when considering the RNN component alone. Finally, as we conducted the raw versus fixation-based sequence experiments with a set five seconds worth of data, this resulted in fixation-based sequences that were considerably shorter than 150 steps of the raw sequences. Future work should consider using more of the 51  interaction in the fixation sequences, which would bring their length in line with that of the raw sequences and possibly allow for the inclusion of valuable information for classifying confusion.  52  Bibliography [1] Mohamed R. Amer, Behjat Siddiquie, Colleen Richey, and Ajay Divakaran. 2014. Emotion Detection in Speech Using Deep Networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3724–3728. [2] Oswald Barral, Sébastien Lallé, Grigorii Guz, Alireza Iranpour, and Cristina Conati. 2020. Eye-Tracking to Predict User Cognitive Abilities and Performance for User-Adaptive Narrative Visualizations. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI’20). [3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence35, 8 (2013), 1798–1828. [4] C.M. Bishop. 1995. Neural networks for pattern recognition. Oxford university press. [5] Robert Bixler and Sidney D’Mello. 2015. Automatic Gaze-based Detection of Mind Wandering with Metacognitive Awareness. In International Conference on User Modeling, Adaptation, and Personalization. Springer, 31–43. [6] Anthony F. Botelho, Ryan S. Baker, and Neil T. Heffernan. 2017. Improving Sensor-free Affect Detection Using Deep Learning. In International Conference on Artificial Intelligence in Education. Springer, 40–51. [7] Giuseppe Carenini and John Loyd. 2004. ValueCharts: Analyzing Linear Models Expressing Preferences and Evaluations. In Proceedings of the Working Conference on Advanced Visual Interfaces. 150–157. [8] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research (2002), 321–357. 53  [9] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078 (2014). [10] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634. [11] Sidney D’Mello, Blair Lehman, Reinhard Pekrun, and Art Graesser. 2014. Confusion can be Beneficial for Learning. Learning and Instruction (2014), 153–170. [12] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. 467–474. [13] Thalia S. Field, Sally May Newton-Mason, Sheetal Shajan, Oswald Barral, Hyeju Jang, Thomas Soroski, Zoe O'Neill, Cristina Conati, and Giuseppe Carenini. 2020. Machine learning analysis of speech and eye tracking data to distinguish Alzheimer's clinic patients from healthy controls. In 2020 Alzheimer's Association International Conference. ALZ, 2020. [14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580–587. [15] Zhichen Gong and Huanhuan Chen. 2016. Model-based Oversampling for Imbalanced Sequence Classification. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1009–308. 54  [16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. [17] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-term Memory. Neural Computation 9, 8 (1997), 1735–1780. [18] F. Hohman, M. Kahng, R. Pienta, DH Chau. 2018. Visual analytics in deep learning: An interrogative survey for the next frontiers. IEEE transactions on visualization and computer graphics.25(8):2674-93. [19] Stephen Hutt, Joseph F. Grafsgaard, and Sidney K. D’Mello. 2019. Time to Scale: Generalizable Affect Detection for Tens of Thousands of Students Across an Entire School Year. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14. [20] Aulikki Hyrskykari, Päivi Majaranta, Antti Aaltonen, and Kari-Jouko Räihä. 2000. Design Issues of iDICT: A Gaze-assisted Translation Aid. In Proceedings of the 2000 Symposium on Eye tracking Research & Applications. 9–14. [21] P. Indyk. and R. Motwani, 1998, May. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (pp. 604-613). [22] Natasha Jaques, Cristina Conati, Jason M. Harley, and Roger Azevedo. 2014. Predicting Affect from Gaze Data During Interaction with an Intelligent Tutoring System. In International Conference on Intelligent Tutoring Systems. Springer, 29–38. [23] Yang Jiang, Nigel Bosch, Ryan S. Baker, Luc Paquette, Jaclyn Ocumpaugh, Juliana Ma Alexandra L. Andres, Allison L. Moore, and Gautam Biswas. 2018. Expert Feature-engineering vs. Deep Neural Networks: Which is Better for Sensor-free Affect Detection? In International Conference on Artificial Intelligence in Education. Springer, 198–211. 55  [24] M.A. Just, P.A. Carpenter. 1976. Eye fixations and cognitive processes. Cognitive Psychology.  [25] Samad Kardan and Cristina Conati. 2015. Providing Adaptive Support in an Interactive Simulation for Learning: An Experimental Evaluation. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems.3671–3680. [26] A. Karpathy, J. Johnson, and L Fei-Fei, 2015. Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078. [27] Andrej Karpathy and Li Fei-Fei. 2015. Deep Visual-semantic Alignments for Generating Image Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137. [28] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. [29] Sander Koelstra, et al., 2012. DEAP: A Database for Emotion Analysis Using Physiological Signals. IEEE Transactions on Affective Computing 3.1: 18-31.  [30] Sébastien Lallé, Cristina Conati, and Roger Azevedo. 2018. Prediction of Student Achievement Goals and Emotion Valence During Interaction with Pedagogical Agents. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, 1222–1231. [31] Sébastien Lallé, Cristina Conati, and Giuseppe Carenini. 2016. Predicting Confusion in Information Visualization from Eye Tracking and Interaction Data. In IJCAI. 2529–2535. [32] J. Lanchantin, R. Singh, B. Wang, and Y. Qi, 2017. Deep motif dashboard: Visualizing and understanding genomic sequences using deep neural networks. In Pacific Symposium on Biocomputing 2017 (pp. 254-265). 56  [33] Sukwon Lee, Sung-Hee Kim, Ya-Hsin Hung, Heidi Lam, Younah Kang, and Ji Soo Yi. 2015. How do People Make Sense of Unfamiliar Visualizations? A Grounded Model of Novice’s Information Visualization Sense Making. IEEE Transactions on Visualization and Computer Graphics, 1, 499–508. [34] J. Li, X. Chen, E. Hovy, and D. Jurafsky, 2015. Visualizing and understanding neural models in nlp. arXiv preprint arXiv:1506.01066. [35] Xiang Li, Dawei Song, Peng Zhang, Guangliang Yu, Yuexian Hou, and Bin Hu. 2016. Emotion Recognition from Multi-channel EEG Data Through Convolutional Recurrent Neural Network. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 352-359. [36] Sucheta Nadkarni and Reetika Gupta. 2007. A Task-based Model of Perceived Website Complexity. Mis Quarterly (2007), 501–524. [37] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. 2016. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. In Advances in neural information processing systems. 3882–3890. [38] Joshua Newn, Ronal Singh, Fraser Allison, Prashan Madumal, Eduardo Velloso, and Frank Vetere. 2019. Designing Interactions with Intention-Aware Gaze-Enabled Artificial Agents. In Human-Computer Interaction – INTERACT 2019 (Lecture Notes in Computer Science). Springer International Publishing, Cham, 255–281. [39] R.W. Picard. 2000. Affective computing. MIT press. [40] Guido Pusiol, Andre Esteva, Scott S. Hall, Michael Frank, Arnold Milstein, and Li Fei-Fei. 2016. Vision-based Classification of Developmental Disorders Using Eye-movements. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 317–325. 57  [41] W. Samek; T. Wiegand.; and K.R. Muller. 2017. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296.  [42] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2014. arXiv preprint arXiv:1409.1556. [43] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmille, 2014. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806. [44] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised Learning of Video Representations Using LSTMs. In International Conference on Machine Learning. 843–852. [45] Ben Steichen, Cristina Conati, and Giuseppe Carenini. 2014. Inferring Visualization Task Properties, User Performance, and User Cognitive Abilities from Eye-gaze Data. ACM Transactions on Interactive Intelligent Systems (TiiS’14), 1–29. [46] J. Van Der Westhuizen and J. Lasenby. 2017. Techniques for visualizing LSTMs applied to electrocardiograms. arXiv preprint arXiv:1705.08153. [47] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning. 2048–2057. [48] J.S. Yi. 2008. Visualized Decision Making: Development and Application of Information Visualization Techniques to Improve Decision Quality of Nursing Home Choice, Georgia Institute of Technology. Ph.D. Thesis.  58  [49] Yilong Yang, et al., 2018. Emotion Recognition from Multi-channel EEG Through Parallel Convolutional Recurrent Neural Network. International Joint Conference on Neural Networks (IJCNN). [50] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, 2015. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. [51] Haiyang Yu, Zhihai Wu, Shuqin Wang, Yunpeng Wang, and Xiaolei Ma. 2017. Spatiotemporal Recurrent Convolutional Networks for Traffic Prediction in Transportation Networks. Sensors’17. [52] Dalin Zhang, Lina Yao, Xiang Zhang, Sen Wang, Weitong Chen, Robert Boots, and Boualem Benatallah. 2018. Cascade and Parallel Convolutional Recurrent Neural Networks on EEG-based Intention Recognition for Brain Computer Interface. In Thirty-Second AAAI Conference on Artificial Intelligence. [53] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep Learning for Sentiment Analysis: A Survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, no. 4: e1253. [54] Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Russ R Salakhutdinov, and Yoshua Bengio. 2016. Architectural Complexity Measures of Recurrent Neural Networks. In Advances in Neural Information Processing Systems.1822–1830. [55] M.D. Zeiler, R. Fergus. Visualizing and understanding convolutional networks. 2014. In European conference on computer vision (pp. 818-833). Springer, Cham. [56] Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2015. Appearance-based Gaze Estimation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4511–4520. 59  Appendices  Appendix A  ICMI 2020 Reviews  Reviewer 1 Rating Definite accept: I would argue strongly for accepting this paper. Relevance to the ICMI Relevant Clarity Very clear Originality Very original Soundness Very solid The Review Overview This paper presents a deep learning approach to detecting confusion from eye-tracking data. By combining an RNN and CNN sub-model, the authors present a novel approach to handling raw eye-tracking data that does not rely on fixation filtering or regions of interest. The authors then proceed to show that this approach outperforms prior work detecting confusion from the same dataset. The paper concludes with a thorough discussion of the approach and potential future work. Strengths 60  The paper is excellently structured, providing clear motivation, and situating the work well within the literature. The paper presents a clear narrative throughout that makes it both easy and interesting to read. The data pre-processing is a creative step that will be of great interest to the ICMI community. By considering raw gaze as opposed to fixation filtered gaze it would be easy to assume that data would be lost or saturated. By applying the creative cyclical split, the authors not only reduce repetition in their dataset but also create more instances, as suitable for a deep learning approach. Model building procedures are well described, with complete detail. In cases where hyperparameters are set as opposed to tuned, justification is provided. The discussion of results is thorough and complete, giving the reader insight into how the model is performing with enough detail to support interpretations. By providing code for this approach the authors add significantly to the contribution of this work by allowing other researchers to build upon it, without the (often tedious) process of recreating from text. Weaknesses Given the dependence of this work on the prior RF model, it may be necessary to provide the reader with a more detailed summary of how that model was trained. Further I would like to see a more detailed discussion of how the authors feel this approach could be applied to other domains and detection environments. As additional context, I would like to see models compared to some kind of chance baseline also. A stratified dummy classifier or similar. Though this work is not strictly multimodal (it considers solely eye gaze) I feel that it will be of considerable interest to the ICMI community. Minor issues 61  Figures and tables often appear before they are referenced in the text. Questions to authors (blank) Reviewer 2 Rating Definite accept: I would argue strongly for accepting this paper. Relevance to the ICMI Relevant Clarity Very clear Originality Quite creative Soundness Reasonable The Review Summary: The paper presents a novel way of using deep learning to predict confusion from eye tracking data. The use of an RNN and CNN to combine visuospatial and temporal information is novel and interesting. The method shows promise by outperforming baselines on this challenging task. For these reasons, I would argue for the paper's acceptance. Strengths: The use of a CNN on a scanpath directly is an interesting way to tackle the spatial information. While the authors claim the temporal information is lost using such an approach, I wonder if 62  there is value in using a colour gradient to encode the temporal signal (e.g green for beginning of session/window and red for end of it with a gradient in between). The paper is clearly written, well-motivated, and easy to follow           Weaknesses: I would love to see more analysis of failure modes. Where does the model still fail? What are the limitations? Also what conditions did the baseline fail in, but the proposed model succeeded, was temporal or spatial information critical in those. It would be great to include an F1 metric for the evaluations as it is a decent (but by no means perfect) way to assess performance on unbalanced data. While not a weakness as such, but I would suggest authors use the terminology of precision/recall which more readers in the community might be familiar with (as opposed to specificity and sensitivity). Not a problem as such but just a suggestion. While not using deep learning a paper that might be of interest to authors is - Vail et al. "Visual attention in schizophrenia: Eye contact and gaze aversion during clinical interactions", as it also uses gaze signal for affect analysis. Questions to authors I would love to see more analysis of failure modes. Where does the model still fail? What are the limitations? Also what conditions did the baseline fail in, but the proposed model succeeded, was temporal or spatial information critical in those.  Reviewer 3 Rating 63  Definite accept: I would argue strongly for accepting this paper. Relevance to the ICMI Relevant Clarity Very clear Originality Very original Soundness Very solid The Review The paper describes how a combination of recurrent networks and CNN can be used to detect confusion in viewers of visualizations in a decision-making tool. I very much like the paper. The described approach is deceptively simple, yet well-motivated and - to my knowledge - novel. The approach also seems to generalize easily to the detection of other concepts of human cognition from gaze data. The research is clearly described and well-motivated. I especially appreciate that the authors often go back to first principles and concrete examples to explain their design choices (e.g. when deriving the network architecture or explaining the evaluation approach), which makes the paper both thorough and accessible and instructive. The authors also promise the release of their source code, which is important for reproducibility. The analysis of the method is compact, but convincing. Questions to authors (blank)  64  Appendix B    We process the data before training our models by filtering out items shorter than 2s or with less than 65% valid rows. We define a row to be invalid if both ValidityLeft and ValidityRight indicate that the eye-tracker is not confident in the data it has captured. These columns are part of the raw data items but are not shown in Figure 3 for simplicity. We minimize the number of invalid values by identifying rows where at least one of ValidityLeft or ValidityRight is True, and then replace the invalid features with the valid ones, as feature values related to the left and right eye are similar at a given point in time. The steps described up to this point mimic part of the process that software used in [31] does to prepare high-level features. Applying these steps, we discard 26 confused and 1328 not confused trials (similar to the number of trials discarded in [31]). All invalid values are replaced with -1 (a value not occurring in our data otherwise).  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394251/manifest

Comment

Related Items