UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Predicting affect in an Intelligent Tutoring System Jaques, Natasha 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2014_november_jaques_natasha.pdf [ 2.92MB ]
Metadata
JSON: 24-1.0135541.json
JSON-LD: 24-1.0135541-ld.json
RDF/XML (Pretty): 24-1.0135541-rdf.xml
RDF/JSON: 24-1.0135541-rdf.json
Turtle: 24-1.0135541-turtle.txt
N-Triples: 24-1.0135541-rdf-ntriples.txt
Original Record: 24-1.0135541-source.json
Full Text
24-1.0135541-fulltext.txt
Citation
24-1.0135541.ris

Full Text

   Predicting Affect in an Intelligent Tutoring System  by  Natasha Jaques  B.Sc. Hon., University of Regina, 2012 B.A., University of Regina, 2012   A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in  The Faculty of Graduate and Postdoctoral Studies  (Computer Science)   THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  August 2014 © Natasha Jaques, 2014   ii  Abstract In this thesis we investigate the usefulness of various data sources for predicting emotions relevant to learning, specifically boredom and curiosity. The data was collect-ed during a study with MetaTutor, an intelligent tutoring system (ITS) designed to pro-mote the use of self-regulated learning strategies. We used a variety of machine learn-ing and feature selection techniques to predict students‘ self-reported emotions from eye tracking data, distance from the screen, electrodermal activity, and an ensemble of all three sources. We also examine the optimal amount of interaction time needed to make predictions using each source, as well as which gaze features are most predictive of each emotion. The findings provide insight into how to detect when students disen-gage from MetaTutor.   iii  Preface  This work is based on an on-going project in S.M.A.R.T. lab, McGill University. The dataset used in this thesis was shared by collaborators from McGill University; we did not participate in designing or conducting the study. The data analysis portion of the project was conducted in UBC‘s Laboratory for Computational Intelligence under the supervision of Cristina Conati. The eye gaze data extraction and validation was original-ly conducted by Daria Bondareva. My role in the project has been in extracting and vali-dating distance and electro-dermal data, and using these in addition to gaze data for conducting machine learning.  A version of the research described in Chapter 8 has been published as: Jaques, N., Conati, C., Harley, J., & Azevedo, R. (2014). Predicting Affect from Gaze Data Dur-ing Interaction with an Intelligent Tutoring System. In Proceedings of ITS 2014, 12th In-ternational Conference on Intelligent Tutoring Systems, Springer.       iv  Table of contents  Abstract ...................................................................................................................ii Preface .................................................................................................................. iii Table of contents ...................................................................................................iv List of tables ......................................................................................................... viii List of figures ......................................................................................................... x Acknowledgements .............................................................................................. xiii 1 Introduction ..................................................................................................... 1 1.1 Thesis goals and approach ...................................................................... 2 1.2 Contributions of the work .......................................................................... 2 1.3 Outline ...................................................................................................... 4 2 Related work ................................................................................................... 5 2.1 Student modeling ..................................................................................... 5 2.2 Affective student modeling ....................................................................... 6 2.3 Eye tracking and affect modeling ............................................................. 8 2.4 Posture and affect modeling ..................................................................... 9 2.5 Electrodermal activity (EDA) and affect modeling .................................. 10 2.6 Affect modeling with multiple modalities ................................................. 11 2.7 Summary ................................................................................................ 12 3 MetaTutor and self-regulated learning .......................................................... 14 3.1 Self-regulated learning ........................................................................... 15 3.2 Overview of the environment .................................................................. 18 3.3 Description of study ................................................................................ 21 4 Eye-tracking data .......................................................................................... 26 v  4.1 Gaze data validation ............................................................................... 27 4.2 Eye-tracking features ............................................................................. 28 4.2.1 Detailed vs. compressed AOI representation .................................... 30 4.2.2 EMDAT scenes.................................................................................. 31 4.3 Distance features ................................................................................... 32 5 Electrodermal activity features ...................................................................... 33 5.1 EDA data validation and feature extraction ............................................ 34 6 Machine learning experiments ...................................................................... 37 6.1 Classification labels ................................................................................ 37 6.2 Machine learning algorithms .................................................................. 37 6.2.1 Support vector machines ................................................................... 38 6.2.2 Random forests ................................................................................. 39 6.2.3 Naïve Bayes ...................................................................................... 40 6.2.4 Logistic regression ............................................................................. 41 6.2.5 Multilayer perceptron ......................................................................... 42 6.2.6 Ensemble classification ..................................................................... 42 6.3 Cross-validation ...................................................................................... 43 6.4 Results measures ................................................................................... 44 6.5 Statistical analysis .................................................................................. 45 7 Feature selection and overfitting ................................................................... 46 7.1 Effects of testing data contamination and overfitting .............................. 47 7.2 Comparison of feature selection techniques .......................................... 50 8 Eye tracking results ....................................................................................... 52 8.1 Effect of time interval on affect prediction using gaze ............................ 52 8.1.1 Boredom ............................................................................................ 53 vi  8.1.2 Curiosity ............................................................................................ 55 8.2 Analysis of eye-tracking features ............................................................ 57 8.3 Effects of self-report time on affect prediction ........................................ 58 8.3.1 Boredom ............................................................................................ 59 8.3.2 Curiosity ............................................................................................ 61 8.4 Results of compressed areas of interest ................................................ 62 8.4.1 Effects of window length with compressed AOIs ............................... 62 8.4.2 Important features for compressed AOIs ........................................... 66 8.4.3 Effects of compressed AOIs on individual reports ............................. 70 8.4.4 Compressed AOI conclusions ........................................................... 75 9 Results of including additional distance features .......................................... 76 9.1 Effect of window length on distance results ............................................ 76 9.1.1 Boredom ............................................................................................ 77 9.1.2 Curiosity ............................................................................................ 79 9.2 Effect of report time on distance results ................................................. 80 9.2.1 Boredom ............................................................................................ 81 9.2.2 Curiosity ............................................................................................ 82 9.3 Conclusions ............................................................................................ 82 10 Electrodermal activity results ..................................................................... 84 10.1 Effect of window length on EDA results .............................................. 84 10.1.1 Boredom .......................................................................................... 85 10.1.2 Curiosity .......................................................................................... 89 10.2 Effect of report time on EDA results .................................................... 90 10.2.1 Effect of report time ......................................................................... 92 10.2.2 Effects of report time and interaction effects ................................... 92 vii  10.3 Conclusions ........................................................................................ 94 11 Summary of individual data sources .......................................................... 95 12 Combining all data sources ....................................................................... 98 12.1 Ensemble classification of individual reports ....................................... 98 12.1.1 Boredom ........................................................................................ 100 12.1.2 Curiosity ........................................................................................ 101 12.2 Ensemble classification of the entire dataset .................................... 102 12.2.1 Boredom ........................................................................................ 104 12.2.2 Curiosity ........................................................................................ 105 12.3 Conclusions ...................................................................................... 106 13 Conclusions and future work ................................................................... 107 13.1 Thesis goals satisfaction ................................................................... 107 13.1.1 Which data source is the most valuable for predicting affect in MetaTutor? .................................................................................... 107 13.1.2 What do gaze features tell us about students‘ attention patterns in MetaTutor? .................................................................................... 109 13.1.3 Can affect be predicted reliably by combining several sources? ... 110 13.1.4 Can curiosity be predicted reliably? ............................................... 111 13.1.5 How much time is needed to detect an affective state? ................. 111 13.2 Limitations ......................................................................................... 112 13.3  Future work ...................................................................................... 113 Bibliography ....................................................................................................... 115 14 Appendix .................................................................................................. 124    viii  List of tables Table 2.1: Previous work in classifying affect from a single modality ............................ 13 Table 2.2: Previous work in classifying affect from multiple modalities ......................... 13 Table 3.1: Likert scale ratings for each emotion ............................................................ 24 Table 3.2: Proportion of 4 and 5 ratings for each emotion ............................................. 25 Table 4.1: Gaze Features .............................................................................................. 29 Table 5.1: EDA Features ............................................................................................... 36 Table 8.1: Effects of window size on classifying boredom using gaze features ............ 54 Table 8.2: Effects of window size on classifying curiosity using gaze features ............. 56 Table 8.3: Effects of report time on classifying boredom using gaze features ............... 59 Table 8.4: Effects of window on classifying boredom using compressed gaze features 64 Table 8.5: Effects of window on classifying curiosity using compressed gaze features 65 Table 8.6: Effects of report on classifying boredom using compressed gaze features .. 71 Table 8.7: Effects of report on classifying curiosity using compressed gaze features ... 73 Table 9.1: Effects of window size on classifying boredom using distance features ....... 77 Table 9.2: Logistic Regression analysis of distance features ........................................ 78 Table 10.1: Effects of window size on classifying boredom using EDA features ........... 86 Table 10.2: Logistic regression analysis of EDA features ............................................. 87 Table 10.3: Statistical analysis of the effects of report time using EDA data ................. 92 Table 11.1: Best results in predicting boredom for each previous window test ............. 95 Table 11.2: Best results in predicting curiosity for each previous window test .............. 96 Table 11.3: Best results in predicting boredom from individual reports ......................... 96 Table 11.4: Best results in predicting curiosity from individual reports .......................... 97 Table 12.1: Individual report classifiers in the ensemble ............................................... 99 Table 12.2: GLM results of ensemble classification of boredom by report .................. 100 Table 12.3: GLM results of ensemble classification of curiosity by report ................... 102 Table 12.4: The best classifier and window length combinations for each emotion and data source ................................................................................................................. 103 Table 14.1: Effects of report time on classifying curiosity using gaze features ............ 124 Table 14.2: Effects of window size on classifying curiosity using distance features .... 124 Table 14.3: Effects of report time on classifying curiosity using distance features ...... 124 ix  Table 14.4: Effects of window size on classifying curiosity using EDA features .......... 124     x  List of figures Figure 3.1: The MetaTutor interface .............................................................................. 18 Figure 3.2: Default layout with image thumbnail ............................................................ 19 Figure 3.3: Full view layout in MetaTutor ....................................................................... 19 Figure 3.4: MetaTutor‘s input mode .............................................................................. 20 Figure 3.5: Dialogue in which a student sets two subgoals ........................................... 21 Figure 3.6: Embedded notepad interface ...................................................................... 21 Figure 3.7: Main Session Timeline ................................................................................ 23 Figure 4.1: A Tobii eye tracker ...................................................................................... 26 Figure 4.2: Eye gaze data features ............................................................................... 28 Figure 4.3: Seven AOIs in the detailed representation .................................................. 30 Figure 4.4: Five AOIs in the compressed representation .............................................. 31 Figure 6.1: SVM decision boundary .............................................................................. 38 Figure 6.2: Example MLP .............................................................................................. 42 Figure 7.1: An overfit decision boundary ....................................................................... 46 Figure 7.2: Difference in boredom accuracy results in performing feature selection on the entire dataset (solid lines), or using nested cross validation (dashed lines) ............ 49 Figure 7.3: Difference in curiosity accuracy results in performing feature selection on the entire dataset (solid lines), or using nested cross validation (dashed lines) .................. 49 Figure 8.1: Boredom accuracy as a function of the amount of interaction time used to train the classifiers......................................................................................................... 53 Figure 8.2: Curiosity accuracy as a function of the amount of interaction time used to train the classifiers......................................................................................................... 55 Figure 8.3: Curiosity kappa as a function of the amount of interaction time used to train the classifiers ................................................................................................................ 55 Figure 8.4: Bored students show less attention on the image ....................................... 57 Figure 8.5: Bored students show less attention on the OLG ......................................... 57 Figure 8.6: Curious students show less attention on the Agent ..................................... 57 Figure 8.7: Curious students (but not bored ones) attend to the TOC ........................... 58 Figure 8.8: Performance of boredom classifiers for each report .................................... 59 Figure 8.9: The features selected by WFS change with the self-report time ................. 60 xi  Figure 8.10: Performance of curiosity classifiers for each report ................................... 61 Figure 8.11: Boredom prediction accuracy achieved by the best classifiers using detailed AOIs (blue) vs. the best classifiers using compressed AOIs (red) ................... 63 Figure 8.12: Curiosity prediction accuracy achieved by the best classifiers using detailed AOIs (blue) vs. the best classifiers using compressed AOIs (red) ................................. 65 Figure 8.13: Depiction of gaze trends detected with compressed AOIs, for engaged (yellow) and disengaged (red) students ........................................................................ 69 Figure 8.14: Best individual report results for predicting boredom from a) the detailed AOI representation (blue) and b) the compressed AOI representation (red) ................. 70 Figure 8.15: Best individual report results for predicting curiosity from a) the detailed AOI representation (blue) and b) the compressed AOI representation (red) ................. 72 Figure 8.16: Curiosity prediction results for individual reports, obtained with compressed AOIs .............................................................................................................................. 74 Figure 9.1: Distance feature accuracy in predicting boredom by window length ........... 77 Figure 9.2: Distance feature accuracy in predicting curiosity by window length ............ 80 Figure 9.3: Distance feature kappa in predicting curiosity by window length ................. 80 Figure 9.4: Distance feature accuracy in predicting boredom by report time ................. 81 Figure 9.5: Distance feature accuracy in predicting curiosity by report time .................. 82 Figure 10.1: EDA feature accuracy in predicting boredom by window length ................ 85 Figure 10.2: EDA feature kappa in predicting boredom by window length .................... 85 Figure 10.3: EDA feature accuracy in predicting curiosity by window length ................. 89 Figure 10.4: EDA feature kappa in predicting curiosity by window length ..................... 89 Figure 10.5: EDA feature accuracy in predicting boredom by report time ..................... 91 Figure 10.6: EDA feature kappa in predicting boredom by report time .......................... 91 Figure 10.7: EDA feature accuracy in predicting curiosity by report time ...................... 93 Figure 10.8: EDA feature kappa in predicting curiosity by report time ........................... 93 Figure 12.1: Ensemble results for classifying boredom from individual reports ........... 100 Figure 12.2 Ensemble results for classifying curiosity from individual reports ............. 101 Figure 12.3: Overall boredom prediction accuracy for the ensemble and each data source ......................................................................................................................... 104 Figure 12.4: Overall boredom kappa scores for the ensemble and each data source 104 xii  Figure 12.5: Overall curiosity prediction accuracy for the ensemble and each data source ......................................................................................................................... 105 Figure 12.6: Overall curiosity kappa scores for the ensemble and each data source . 106 Figure 13.1: Inter- vs. intra-individual relationship between sleep and migraine headaches ................................................................................................................... 113 Figure 14.1: Kappa scores for predicting boredom from eye gaze data ...................... 125      xiii  Acknowledgements I would like to thank Daria Bondareva for her work on the gaze data processing and validation portion of this work. I am extremely grateful to Cristina Conati for her ad-vice and supervision, and for providing me with many valuable opportunities to dissemi-nate my work. This research was supported by the Natural Sciences and Engineering Research Council of Canada, the Microsoft Research Graduate Women‘s Scholarship Program, and the University of British Columbia Affiliated Fellowships program.  I would like to thank my parents, Kevin Jaques and Paula Sostorics, for their pa-tience and understanding, and the encouragement and support they have given me throughout the years. I am incredibly grateful to my grandmother Mavis Jaques for her generous and unfailing support of my education. Finally, I would like to thank Andrew Schonhoffer for the comfort and encouragement he has given me throughout my Mas-ter‘s degree.   1  1 Introduction Emotion plays a critical role in human behavior, interaction, and cognition [81]. Not only does it affect our attention, thoughts, actions, and motivation, but it affects how we re-late to and communicate with others [83] [90]. An intelligent computer system that can recognize and interpret emotional cues may be able to interact more effectively with its users. The desire to cultivate this ability has led to the emergence of Affective Compu-ting [55], which seeks to create interfaces that can react and adapt to clues about the user‘s emotional state. Affect-adaptive systems have already been shown to increase task success [125], motivation [77], and user satisfaction [73]. Affect sensitivity can be especially beneficial in educational contexts, where main-taining positive emotions can lead to increased learning [90] [65] [99] [16].The research presented in this thesis focuses on emotion modeling in the context of an Intelligent Tu-toring System (ITS). ITSs are educational environments capable of adapting to each learner‘s needs in order to provide personalized instruction and promote learning [131]. Since emotions impact learning [91] [60], recognizing and reacting to the user‘s emo-tional state could substantially improve an ITS, leading to a large body of research in this field (e.g. [27] [31] [105] [4]).  This thesis focuses on modeling affect experienced during learner interactions with MetaTutor, an Intelligent Tutoring System (ITS) designed to teach concepts about the human circulatory system, while supporting effective self-regulated learning (SRL) [6]. SRL requires that a student monitor her progress towards learning objectives, and apply strategies to manage cognition, behaviour, and emotion during learning. Studies on MetaTutor have collected a variety of student data, including think-aloud protocols, human-agent dialog, interaction log files, embedded quizzes, facial expressions record-ed using FaceReader 5.0 software, Electrodermal Activity (EDA) collected using an Af-fectiva Q sensor, and eye gaze data collected with a Tobii T60 eye tracker [8]. Out of the data sources collected, so far only facial expressions have been explored in the context of detecting affect [51]. While interaction logs could also be a fruitful avenue of inquiry, by their nature, any findings about specific actions available in the MetaTutor 2  interface will not be able to generalize easily to other systems. Therefore this thesis will examine the effectiveness of using EDA and eye tracking data to predict learner-centric emotions such as boredom and curiosity in MetaTutor.  1.1 Thesis goals and approach The goal of this thesis is to examine the usefulness of various data sources for predict-ing affect with MetaTutor. We will compare the utility of eye tracking data, distance from the screen (an estimate of user posture), and EDA, and examine the type of results that can be achieved by combining these disparate sources. If affect can be predicted relia-bly, it would indicate that building an affect-adaptive ITS which used these sensors is worthwhile. Such a system could use gaze, distance or EDA to detect when the learner disengaged from MetaTutor, and intervene in order to maintain positive affect, thus im-proving the user learning experience. In addition to examining the relative efficacy of each data source, we will conduct tests that address other practical questions relevant to building an ITS that can respond to user affect. These include the interval of interac-tion time that should be used to detect when a student experiences an affective state, as well as whether multiple classifiers need to be trained for different stages of the MetaTutor learning experience.  The research questions we address are as follows: 1. Which data source is the most valuable for predicting affect in MetaTutor? a. Eye gaze? b. Distance from the screen? c. EDA? 2. What do gaze features tell us about students‘ attention patterns in MetaTutor?  3. Can affect be predicted more reliably by combining all sources? 4. Can curiosity be predicted reliably?  5. For each data source, what amount of interaction time is most effective for com-puting features predictive of affect? 1.2 Contributions of the work The main contribution of our work is that we explore the usefulness of gaze data to pre-dict learner affect, both as a single source compared to other common affect prediction 3  sources, and in combination with other sources. The only other research that has inves-tigated using eye-tracking in an affect-adaptive system [86] [124] has been limited to using hand engineered heuristics to generate gaze-based affective interventions [124] (whereas we use machine learning methods), or has focused on non-gaze features such as pupil dilation to distinguish between positive and negative affect [86]. Unlike pupil dilation, gaze features provide insight into the user‘s attention to various interface elements, and are not sensitive to changes in luminosity. We also compare the useful-ness of gaze against EDA, a common data source for affect prediction. We are not aware of any research that has compared the affect prediction performance of gaze da-ta against that of EDA, or distance from the screen. We find that gaze data provides valuable information for predicting learner affect. This conclusion is especially relevant given the fact that software to perform eye track-ing via simple webcams is being released under open-source licenses1, and Tobii (the maker of the eye tracker used in this study) has plans to release smaller and less ex-pensive eye trackers embedded into laptops, or in the form of USB peripherals2. This makes eye tracking a much more viable way to collect data about a widespread popula-tion of users. A second important contribution is that we show how a commercial eye tracker can be easily used to obtain information about the user‘s distance from the screen, and in turn, how this can be used to predict affect. Although a variety of work has shown that affect (in particular, interest) can be detected from posture (e.g. [85] [68] [32] [67]), this previous work has relied on complex matrices of pressure sensors embedded into a specially constructed chair. Such a system can often cost thousands of dollars. Our work shows that simply using distance from the screen allows boredom to be predicted reliably. Only a simple infra-red depth device is necessary to detect this feature, such as the Microsoft Kinect3 ($275). Other contributions relate to goals 2, 3, and 4 outlined above. By uncovering which features are most predictive of each emotion, we gain insights into students self-                                            1 http://sourceforge.net/projects/opengazer/  2 http://arstechnica.com/gaming/2013/01/laser-vision-using-tobiis-gaze-tracker-to-control-games-with-my-eyes/  3 http://www.microsoft.com/en-us/kinectforwindows/  4  regulated learning behaviours‘, as well as effective methods for constructing an affect-adaptive MetaTutor. We test whether combining all three data sources into an ensemble provides a performance benefit. Further, we investigate curiosity, an emotion not fre-quently studied in the affective computing literature. Curiosity is considered an emotion related to interest [109], and was included in the study based on Pekrun‘s research into learner emotions and motivation [82]. We are aware of only a few other studies focusing on curiosity [27] [31] [98]. Finally, we contribute to the affective computing literature by testing the commonly held assumption that a short (20 second) interval should be used for affect labeling and prediction [45]. Rather than rely on this time interval, we test a range of intervals from 8.4 seconds up to 14 minutes, and demonstrate that the often-used 20 second interval may not always be appropriate. 1.3 Outline The remainder of this thesis is organized as follows. Chapter 2 discusses related work on the topics of user modeling, student modeling, affective student modeling, and the use of sources such as EDA and posture for affect modeling. Chapter 3 introduces the MetaTutor learning environment and describes the study that was used to collect the data under investigation. The features we extract from the data are described in Chap-ters 4 (eye gaze features) and 5 (EDA features). Our machine learning classification experiments are described in Chapter 6. Chapter 7 discusses the importance of feature selection methods for high-dimensionality data such as ours, and the consequences of fitting an overly complex model to such data. Chapters 9-11 present the results ob-tained; specifically, 8 presents classification results using gaze data, 9 using distance from the screen, 10 using sEDA data, and finally Chapter 11 gives the final results ob-tained by combining the previous three data sources. In the last chapter, we state our conclusions and discuss possible limitations and future work.    5  2 Related work This research is situated within the broad domain of user modeling, which concerns en-dowing interactive systems with the ability to model characteristics of their users, such as preferences, expertise, goals, cognitive abilities, or affective state [39] [74] [127]. There are numerous benefits to an effective user model. Firstly, the goal of human-computer interaction (HCI) is to facilitate humans‘ use of computers by making usable, useful interfaces [39]. With an effective user model, an interface may adapt itself to the user‘s needs, preferences, and abilities, allowing the user to perform tasks more effi-ciently, easily, and enjoyably [66] [23] [21]. User-adaptive systems also have widely recognized commercial value; for example, recommender systems (e.g. the winner of the $1,000,000 Netflix prize [76]) are designed to recommend products (or advertise-ments for products [57]) that users might enjoy. Even without an adaptive system based on top of the model, user models can still provide insight into the cognitive processes that underlie users‘ behaviour [127].  Although user models have many useful applications, the focus of this thesis is on modeling the user of an educational system, sometimes termed student modeling. Fur-ther, we focus on modeling the user‘s affective states. The rest of this chapter will out-line research in both areas, as well as work that has been done on using eye tracking, posture, and eletrodermal activity (EDA) for the purposes of student and user modeling. For work closely related to our own, we will report statistics related to the classification accuracy or kappa scores (whatever is available). Unless otherwise noted, all scores reported exceed a chance-level baseline. A summary of these scores will be presented in Section 2.7. 2.1 Student modeling Student modeling is a popular area, with research into student models dating back to the early 1980‘s [127]. Often it is the students‘ domain knowledge that is modeled, per-haps with a popular framework such as Bayesian Knowledge Tracing (e.g. [12]). Be-cause student models are a core component of Intelligent Tutoring Systems [122], there are numerous examples of systems that contain a student model; it is beyond the scope of this thesis to review them all. For this reason we simply refer the interested reader to 6  an overview of the topic [122], and limit our discussion to student models that are rele-vant to the research involved in this thesis. Because MetaTutor is an ITS designed to scaffold self-regulated learning (SRL), we will briefly outline some studies related to student modeling and SRL. Highly relevant to our work is that of Bouchet, Azevedo, Kinnebrew and Biswas [20], who model behav-iour in MetaTutor by clustering students based on logs of their interactions with the sys-tem. Kinnebrew and Biswas [71] have also done work examining sequences of student behaviours in an ITS called Betty‘s Brain, which is very similar to MetaTutor; it is a hy-permedia environment designed to scaffold SRL through the use of pedagogical agents. In terms of actually modeling SRL, Sabourin et al. used demographic information and interaction log data to predict students‘ use of SRL strategies, as labelled by human raters [107]. Eye tracking has been used in a variety of contexts related to education, including predicting learning gains [17] [69] [70], motivation [98], problem solving [3], and reading performance [115]. An early study [55] used eye fixations to model the underlying pro-cesses students used in tackling arithmetic word problems; it sought to understand whether the students constructed an internal model of the problem, or simply tran-scribed it directly. Anderson and Gluck [3] give descriptive statistics related to the eye movements generated by students using an ITS designed to tutor algebra. A discussion of how these statistics could be used to build a better, more proactive ITS is provided in [44]. Empirically, eye tracking data has been shown to significantly improve the perfor-mance of a probabilistic model designed to detect students‘ meta-cognitive behaviour [28].  2.2 Affective student modeling Emotions experienced in an academic setting are related to students‘ motivation and academic achievement [21]. Further, the presence of an empathetic and supportive tu-tor or pedagogical agent has been shown to enhance learning [30], and reduce stress [22]. This provides strong evidence that an emotionally supportive ITS can enhance student achievement, motivation and enjoyment.  7  Adapting to boredom could be especially beneficial, because boredom has been associated with decreased learning gains [78] and off-task behaviour [107]. In fact, boredom tends to precede and co-occur with gaming the system, a behaviour so strong-ly linked to decreased learning gains that it has been found to be just as predictive of learning as a student‘s prior knowledge and academic achievement [11]. While bore-dom and disengagement correlate negatively with both user satisfaction and task suc-cess [42], engagement has been linked to increased user satisfaction [41], and de-creased off-task behavior [107].  For these reasons, researchers have begun investigating how to detect and re-spond to learners‘ emotional states. Conati and Maclaren [27] used information about learners‘ personalities and interaction logs to model student emotions that occurred while playing an educational game for learning math. Using a Dynamic Bayesian Net-work, they achieved accuracies of 69% and 70% for predicting joy and distress, respec-tively. They also used the model to predict feelings of admiration or reproach towards a pedagogical agent, as per the Ortony-Collins-Clore (OCC) model of emotions [88], but could not predict reproach with accuracy above chance. Forbes-Riley et al. employed machine learning techniques to the problem of predicting human-created labels of dis-engagement from acoustic and dialog features in a spoken dialog ITS [41]. It should be noted that although the best learning algorithm (a random forest) achieved an accuracy of 83.1%, it did not exceed the majority-class baseline of 83.79% [41].  Physiological sensors have been used to predict affect in an ITS context, includ-ing wireless skin conductance bracelets, pressure sensitive seat cushions, and accel-erometers [4]. By combining several data sources, including heart rate, skin conduct-ance, posture, questionnaires and interaction logs, Sabourin et al. achieved prediction accuracies of 75% for boredom and 85% for curiosity, using a Dynamic Bayesian Net-work [106]. Affect can also be detected with a single sensor; D‘Mello et al. obtained 60%, 64%, and 70% accuracy in predicting boredom in a dialog-based tutor using facial expressions, dialog, and posture, respectively [31]. In comparing the usefulness of dif-ferent types of data for predicting boredom, confusion, delight, flow and frustration, they found that frustration is best detected with dialog features, boredom and flow with pos-ture, and delight and confusion with facial expressions.  8  In a different study however, facial expressions have been found to be more pre-dictive of frustration and boredom than dialog cues or body movement [35]. Interesting-ly, D‘Mello and Graesser also found that the most useful sensor depended on whether the emotion label was collected regularly at a fixed interval, or arose from a spontane-ous display of emotion [36]. For example, classifiers trained using facial expressions were not able to predict emotion with accuracy above the baseline unless the emotion arose spontaneously. A study which used facial expressions to predict boredom, confu-sion and frustration achieved kappa scores of .04, .22 and .23, respectively [19]. Arroyo et al. compared facial expressions detected with MindReader software to a variety of sensors, including a wireless skin conductance bracelet, pressure sensitive seat cush-ions, and accelerometers, for predicting interest, excitement, confidence and frustration [2]. They found that facial expressions and actions taken by the students within the in-terface were the most useful for predicting emotions [2]. Closely related to our study is the work by Harley, Bouchet and Azevedo [51] on correlating the emotions experienced during interactions with MetaTutor with output from FaceReader 5.0 software. Because the FaceReader emotions do not map directly to the emotion self-reports used in the study, the authors had to develop their own map-ping scheme, but still achieved 75.6% agreement. This suggests that the emotion self-reports collected during the MetaTutor study closely matched participants‘ actual behav-ior [51]. However, the necessity of the mapping scheme shows that FaceReader output cannot be directly used to detect the emotions we are interested in, namely boredom and curiosity. An important finding revealed by this study was that positive learning-related emotions (including curiosity) declined over the course of the interaction, demonstrating a need for affective interventions in MetaTutor [51]. 2.3 Eye tracking and affect modeling Eye tracking has been used extensively to understand the cognitive processes involved in searching and reading (e.g. [79] [63] [47]), and in user-adaptive visualization (e.g. [80] [118] [121] [120]), as well as in the educational applications mentioned above. However, so far only a few studies outside of psychology have addressed the relationship be-tween eye gaze and affect, despite the fact that findings from psychological research 9  have suggested that eye gaze signals such as blinking often or a lack of fixations on interface text may be indicative of emotions like boredom [117].  The link between pupil dilation and emotion has been well documented [86] [124]; increased pupil diameter may be indicative of stronger emotion [126]. This finding was incorporated in an affect-sensitive ITS that used eye tracking to respond to the user in real time [124]. When the user displayed signs of boredom such as decreased pupil size or wandering gaze, the system reacted using emotional displays to regain her attention [124]. However, these indications of user boredom were not learned from data, but ra-ther heuristics manually developed by the researchers. This study can be considered a proof-of-concept that a system can react in real-time to user‘s gaze patterns, but wheth-er it will be accepted by users has yet to be fully evaluated. Gaze Tutor [33] is another ITS that reacts to learners in real time using gaze data. It uses heuristics to trigger adaptive interventions designed to help users sustain atten-tion to the ITS. For example, if a student does not look at the tutor or the pedagogical content for ten seconds, Gaze Tutor delivers an intervention message. While these in-terventions did cause students to fixate on the tutor more, students who received too many interventions were slower to re-orient attention, and the interventions did not re-sult in higher learning gains [11].  2.4 Posture and affect modeling In addition to eye tracking data, we also test features related to the students‘ distance from the screen. We expect that this simple measure will provide an approximation of posture, which has been linked to emotion [24], particularly learner interest [85] [68].  Most posture studies (e.g. [85] [31]) use a Body Posture Measurement System (BPMS), which typically consists of a chair equipped with two matrices of pressure sen-sors on the seat and back cushion. Such as system was used by D‘Mello et al. [31] to detect boredom, confusion, flow, and frustration with accuracies of 70%, 65%, 74% and 72%, respectively, in college students using an ITS called AutoTutor. The posture fea-tures were obtained via a Tekscan automated BPMS, and the affect labels were rated by learners and trained judges.   10  Posture may in fact be one of the most effective emotion-detection modalities; Kapoor and Picard [67] found that posture was more informative than both facial fea-tures and information about the state of the interface for predicting interest. Using pos-ture alone, they achieved 60.1% accuracy in detecting the interest level of children per-forming a computer-based learning task. Similar research by Mota and Picard [85] trained a neural network on the pressure data, and achieved 76.5% accuracy in rating three categories of learner interest. This study was later expanded [68] so that data from the chair was used to classify posture into broad categories (such as sitting up-right, leaning back, etc.) and also to classify activity level.  2.5 Electrodermal activity (EDA) and affect modeling The term electrodermal activity (EDA) refers to an increase in the activity of sweat glands, which can be detected by an increase in the conductivity of an electrical circuit attached to the skin [95]. Nerves in the sweat glands respond to arousal and fear [26], as well as a variety of emotional stimuli, making EDA a popular choice in affect model-ing.  Many of the studies we have already described include EDA as one of several modalities used to detect affect. EDA was one of the data sources involved in a study [58] comparing affect detection performance of emotions experienced in an ITS (Au-toTutor), against those solicited by asking participants to view emotionally arousing pic-tures from the International Affective Picture System (IAPS). Another study used EDA and skin temperature to predict mind wandering, achieving a kappa score of 0.22 [15]. In a study of the impact of tutorial feedback, models based on EDA features obtained 52.56% accuracy in predicting whether a student had received positive or negative feedback [97].    Some researchers have experienced difficulty in extracting information from the raw EDA signal. There can be a great deal of inter-individual variation in participants‘ electrodermal response [97]. One study which attempted to distinguish a variety of emo-tions using EDA and other signals found that the physiological patterns within a single emotion could vary more greatly from day to day than the physiological patterns be-tween two emotions [123]. Further, EDA can sometimes appear to be less informative 11  than other modalities [97] [29] [4]. For example, when linear regression was used to se-lect the most informative features out of EDA, facial expression software, and pressure sensors in a study on a geometry ITS [29], the authors found that none of the four EDA features they included (mean, SD, min and max) were selected.  This illustrates the importance of designing meaningful features when using EDA. Healey and Picard [53] found that the derivative of the EDA signal was more informative than the simple mean value, and that EDA tends to gradually increase for arousing emotions, and decrease for ‗peaceful‘ emotions. They were able to combine EDA with other signals such as heartrate and respiration to distinguish between a wide range of emotions including anger, grief, and love. Other studies stress the importance of includ-ing features related to EDA peaks; that is, a sharp spike in the EDA signal [109] [14]. We believe that with careful feature design, EDA is a promising avenue for detecting affect. 2.6 Affect modeling with multiple modalities Rather than predicting affect from a single data source, many researchers have found it fruitful to integrate data from several sources (e.g. [106]). However this integration itself may be challenging, especially when there are numerous features available. Typically, the data sources are combined in one of two ways: feature-level fusion, or decision-level fusion. Feature-level fusion simply refers to combining the features from each data source into one large vector for each participant [68]. Most of the studies we have de-scribed so far which use multiple modalities combine them using a form of feature fu-sion. For example, Hussain et al. [58] combined EDA, heart rate, respiration, and facial muscle activity and to obtain a total of 214 features. Because having such a large num-ber of features was impractical for their problem, they used chi-square feature selection to detect only the most informative features, and discarded the rest.  The other method is decision-level fusion or ensemble classification, which refers to combining the decisions of several classifiers trained on different data sources [37]. The decisions can be combined in a variety of ways; using a simple majority vote, a weighted majority vote, Bayes‘ rule [103], or if the classifiers provide probabilistic esti-12  mates of the class label, using the sum, product, maximum or minimum [72]. In a ma-jority vote, as long as each of the classifiers in the ensemble can predict the classifica-tion label with accuracy slightly above chance, then the performance of the ensemble as a whole should exceed that of any of its members [43]. Kapoor and Picard [67] tested several combination methods including majority vote in predicting interested from facial expressions, posture, and interface state. They achieve an accuracy of 67.8% by weighting the votes of each classifier based on its training error.  D‘Mello and Graesser [36] tested both feature-level fusion and decision-level fu-sion, and found that the results were comparable. In the end they used feature-level fu-sion to combine features from conversational cues, body language, and facial features to classify boredom, engagement, confusion, frustration and delight. An interesting con-tribution of their work is that they investigated whether combining the different data sources gives ―super-additive‖ results; i.e., the results of the combination are superior to a simple additive combination of the sources. They find that the combination of facial features and posture together are super-additive, meaning more information can be ex-tracted from combining the two sources than either offers alone. 2.7 Summary The following tables give a summary of the work reviewed in this chapter and the results obtained. In most cases the authors did not report both a kappa score and an accuracy statistic, so we provide whichever was given in the original work. In some cas-es the models are not able to predict emotion better than a chance-level baseline, which is indicated in red in the left-hand column of the tables.     13  Table 2.1: Previous work in classifying affect from a single modality Author Data Sources Classifying Accuracy Kappa Exceeded baseline? D'Mello et. al. Facial expres-sions Boredom, confu-sion, frustration 60, 76, 74  Yes Dialog, acoustic features Boredom, confu-sion, flow, frustra-tion 64, 63, 74, 70  Yes Posture Boredom, confu-sion, flow, frustra-tion 70, 65, 74, 72  Yes Mota & Picard Posture Interest 76.5  Yes Kapoor & Pi-card Posture Interest 60.1  Yes Pour et. al. EDA Positive or nega-tive feedback 52.6 < 0 No Bosch et. al. Facial expres-sions Boredom, confu-sion, frustration  .04, .22, .23 No, Yes, Yes Sabourin et. al. Text-based sta-tus updates Self regulated learning 54.5   Yes  Table 2.2: Previous work in classifying affect from multiple modalities Author Data Sources Classifying Decision or Feature Fusion? Accuracy Kappa Exceeded baseline? Conati & Maclaren Personality traits,     in-teraction pat-terns Joy,        dis-tress, admi-ration, re-proach Feature 69, 70, 66, 47  Yes, Yes, Yes, No Forbes-Riley et al Text, acous-tic features, de-mographics Disengage-ment Feature 83.1 < 0 No 14  Sabourin et al Heart rate, EDA,      pos-ture, person-ality, de-mographics,          interaction patterns Boredom, curiosity Feature 75, 85  Yes Kapoor & Picard Facial ex-pressions, posture, in-teraction pat-terns Interest Decision 67.8  Yes Healey & Picard EMG, EDA, heart rate, respiration Arousal, va-lence Feature 83.5, 64  Yes Husain et al EDA, heart rate,      res-piration,    facial muscle activity (EMG)  Arousal, Va-lence Feature  0.23, 0.35 Yes D'Mello & Graesser  Conversa-tional cues, body lan-guage, and facial fea-tures Distinguish between boredom, engagement, confusion, frustration and delight Feature 50.6 0.382 Yes Blanchard et al EDA, skin temperature Mind wan-dering Feature  0.22 Yes  3 MetaTutor and self-regulated learning  MetaTutor is an adaptive ITS designed to encourage students to employ meta-cognitive self-regulated learning strategies, while teaching concepts about the human circulatory system [7]. Self-Regulated Learning (SRL) is the ability to actively and efficiently man-15  age learning through monitoring and strategy use, including regulating aspects of cogni-tion, behaviour, emotions and motivation to achieve learning objectives [6]. SRL can be a powerful predictor of students‘ learning gains and academic success [107]. MetaTutor, which was developed by Roger Azevedo‘s research group at McGill University [7], is designed to scaffold SRL by providing tools that allow students to evaluate their under-standing of the content, and assess progress towards learning goals. The data we ana-lyze in this thesis was collected from participants using MetaTutor.   This chapter will give an overview of SRL, followed by an introduction to the MetaTutor learning environment. Finally, we will describe the study used to collect the data analyzed in the rest of the thesis.  3.1 Self-regulated learning Self-regulated learning (SRL) refers to a set of self-directed processes through which students efficiently manage their own learning [133] [48]. A widely accepted definition of SRL is the degree to which students actively participate in the learning process by adapting their cognition, motivation, and behaviour [133]. Self-regulated learners are proactive, and engage in strategy use and self-monitoring in order to attain their learn-ing goals [48]. In general, self-regulated learners can be characterized as displaying initiative, perseverance and adaptive skill [133]. These beneficial traits are obtained in part because of motivational beliefs and metacognitive strategies held by the learner, but are also mediated by the learning context. Essentially, SRL is a constructive pro-cess in which learners use their past experience and information about the current envi-ronment to set goals, and regulate their behaviour to meet these goals [48]. Teaching environments that encourage or scaffold the use of SRL strategies have the potential to be highly beneficial, as a number of studies have shown that SRL leads to improved learning [48]. For both college students [113], and students in the seventh grade [93], a higher ability to monitor the entire learning process (self-monitoring ability) has been linked to better academic performance. Goal setting, particularly when cou-pled with this type of meta-cognitive awareness, also leads to better performance for college students [84] [100]. Even for graduate students, teaching goal setting has bene-ficial effects [114]. 16  Unfortunately, many learners are not effective at monitoring their learning  [48], and do not generate SRL strategies spontaneously [133]. This is problematic, because SRL is especially important for open-ended, personally directed forms of learning, such as discovery learning, self-selected reading, social learning, and seeking information from electronic sources, such as learning online [133]. A 1996 study [132] found that the performance of students who scored poorly on SRL ability decreased when they had more control over the learning process. The study varied how much control students had over how and what they learned using a hypermedia learning environment (one which contains hyperlinks that allow for non-linear navigation through the content). Low SRL students who were given a high degree of control performed worse than both high SRL students, and low SRL students who did not have control over the learning pro-cess. Therefore it is necessary to help students gain the ability to regulate their learning process, so that they are able to guide their own learning from resources such as books and the internet. Fortunately, when students are trained to use SRL strategies, it is ef-fective in producing superior learning [133].  Such research provides excellent motivation for creating Intelligent Tutoring Sys-tems such as MetaTutor, which promotes the use of SRL strategies for students using a hypermedia learning environment. MetaTutor is based on Winne and Hadwin‘s SRL model [129] [128], which has been extended by Azevedo and colleagues [5] [9]. Like many theories of SRL, Winne and Hadwin‘s model includes phases of learning, which are as follows: 1) Task definition; 2) Goal setting and planning; 3) Studying tactics; 4) Adaptations to metacognition; However, the Winne and Hadwin model makes an additional contribution by including a set of processes, based on Information Processing Theory (IPT), which are thought to influence student learning within each phase [48]. These processes include conditions, operations, products, evaluations and standards (COPES). A brief overview of how the COPES factors interact within the phases of the Win-ne and Hadwin [129] [128] model is as follows: conditions include factors related to the 17  task (such as instructional cues, time allotted, resources), and to the student‘s cognitive state (such as motivation, beliefs, and knowledge). These conditions lead the student to develop a set of standards that prescribe the quality of performance on the learning task that the student wishes and expects to achieve. Essentially, standards can be thought of as learning goals. Once the student begins the learning task, she uses a variety of operations - searching, monitoring, assembling, rehearsing, translating (SMART) - which lead her to develop a set of products. The products depend on the learning phase; for example, in phase two they are a set of goals and plans, while in phase three they are attempts at learning. The products are then compared to the student‘s stand-ards or learning goals. Self-monitoring occurs when the student evaluates the fit be-tween the products she developed and her standards. External evaluations may also play a role, but they may also fail to influence the student‘s behaviour if her own stand-ards are met. If the products do meet the standards, then the student will move on to the next phase; otherwise, she will re-cycle through the current and previous phases in order to improve the products. It is important to note that this model is not linear or se-quential; it is fluid, and there is no assumption that phases listed earlier in the model must occur before later phases.  Using Winne and Hadwin‘s model, Azevedo and colleagues constructed the MetaTutor environment to scaffold the use of SRL strategies like monitoring and goal setting in the context of a hypermedia learning environment. The next section will de-scribe MetaTutor in detail.  18  3.2 Overview of the environment  Figure 3.1: The MetaTutor interface MetaTutor (Figure 3.1) is a hypermedia learning environment that contains 38 pages of text and diagrams about the human circulatory system. The content is organized via the table of contents (TOC) on the far left, which allows students to navigate to a new topic by clicking on the topic name. The information related to the topic under study is dis-played in the center of the screen in two panes; the text panel on the left, and the image panel on the right. By default, only a thumbnail version of the image is displayed; in or-der to see the full version, the student must manually click the thumbnail (see Figure 3.2). There is also a full view mode, in which only the text and image contents are dis-played using the entire screen, and there is no access to the other components (Figure 3.3).   19   Figure 3.2: Default layout with image thumbnail  Figure 3.3: Full view layout in MetaTutor MetaTutor has a variety of components designed to scaffold the use of SRL, in-cluding pedagogical agents, an overall learning goal and subgoals completion bar, and the Learning Strategies Palette, which allows the student to initiate a variety of actions related to SRL strategies. There are four pedagogical agents (PAs) which appear in turn in the top right corner of the screen, with one agent present at all times. Each of the agents have a different tutorial role, and the time at which they appear is based on the 20  learner-system interaction. Pam the Planner appears at the beginning of the interaction to help the student choose two learning subgoals, and every time thereafter when the student needs to set new learning goals. Mary the Monitor‘s role is to aid the student in monitoring progress towards the current learning goal, an important task in SRL. Gavin the Guide helps the student to navigate the content, and Sam the Strategizer gives suggestions on how to use the available learning tools. The agents interact with the stu-dent through spoken prompts and feedback. The student can respond by selecting one of several options in a multiple choice question, or by typing text into the input field, as shown in Figure 3.4. When MetaTutor is awaiting typed feedback, no other interface actions are available, and the student must provide an answer be-fore proceeding to the next task. However the relevant agent will provide prompts in order to guide the student to the desired response.   This concept is illustrated in further detail in Figure 3.5 [18]. Goal setting is an important part of the way that MetaTutor scaffolds SRL. Firstly, each student is provided with the same overall learning goal (OLG) (located at the top of the interface) which is set by the system administrator. It tells the student that her goal is to learn as much as she can about the human circulatory system, and outlines specific concepts that the student should address. This could be likened to a task definition in the Winne and Hadwin model. Further, each student is required to set subgoals related to the OLG, which consist of topics personally chosen by the student that she would like to learn about. Figure 3.5 shows a student setting two learning goals with the help of the Pam the Planner agent. The goals are modified to ensure they are neither too specific nor too broad. All students are required to set two subgoals at the beginning of the interaction, but can update or change their goals at any time. Students can also complete a subgoal by passing a quiz. The subgoal completion bar (located at the top of the screen under the OLG), allows students to view their progress toward completing their subgoals.  Figure 3.4: MetaTutor’s input mode 21   Figure 3.5: Dialogue in which a student sets two subgoals The learning strategies palette (LSP) is located beneath the PAs, and allows the user to initiate interactions with the agents that often involve the use of SRL strategies [19]. For example, all students are encouraged to take notes on what they learn using MetaTutor. Note taking is accomplished in two ways: 1) through an embedded note-taking interface (shown in Figure 3.6), and 2) via a digital pen connected to the comput-er. Other actions available in the LSP include writing summaries of the current content, or taking quizzes to evaluate the student‘s current understanding of the content. Stu-dents can also answer questions about the relevance of the content to the current sub-goals. While students can initiate all of these actions themselves, they may also be prompted to perform them by a pedagogical agent.   Figure 3.6: Embedded notepad interface 3.3 Description of study The data used in this analysis was collected from a 2012 study of 67 undergraduate students, (82.8% female, 72.4% Caucasian) conducted at McGill university [17]. The 22  goal of the study was to collect multichannel data to shed light on the role of the cogni-tive, metacognitive and affective processes that occur during learning with MetaTutor. To achieve this goal, participants were recorded using using audio, video, a Tobii T60 eye tracker, and an Affectiva Q skin conductance sensor bracelet [8]. The study was conducting using a desktop computer with a Core 2 Duo 2.80GHz processor, 2GB of RAM, and a 17‖ monitor with a 1024x768 resolution. The MetaTutor application was run in full-screen mode.  The study was designed as an experiment that included two conditions, based on the type of feedback each learner received. Participants were randomly assigned to ei-ther an adaptive feedback or non-adaptive feedback condition. In the former, the prompts and feedback the student received were adapted to suit his or her perfor-mance. In this adaptive condition, the pedagogical agents (PAs) gave more prompts, directing students in when to use the SRL tools. In the non-adaptive condition, students received generic prompts and feedback, the PAs intervened in the learning process less, and students had more independence in selecting when and how to use SRL tools. Apart from feedback from the PAs, all other functionality remained the same. In general, the effects of the experimental condition are not the focus of this work. Earlier work which investigated the eye tracking data from this study [18] found no difference between the students in the adaptive and non-adaptive conditions. Although it is possi-ble that the PAs‘ feedback affected the students‘ emotional states, we do not investigate this research question. Rather, we are interested in detecting the students‘ emotional state, regardless of how it is generated.  The study was conducted over two sessions. In the first, participants took 30 minutes to complete a pre-test on the circulatory system, a demographics question-naire, and several self-report measures, including the trait academic emotions ques-tionnaire [90]. An approximate timeline of the second session, averaged over partici-pants, is shown in Figure 3.7 [18]. The session began with calibrating the recording in-struments for the participant, including the eye tracker. Once the instruments were cali-brated, a welcome video introduced participants to the major components of the MetaT-utor interface. Other tutorial videos could be viewed during later points of the interaction. Students began interacting with MetaTutor by being asked to set two subgoals, as de-23  scribed in the previous section. This took an average of 15 minutes [18]. The bulk of the second session consisted of a 60 min learning session where participants browsed the learning material and interacted with the system features. Each participant was offered a 5 minute break in the middle of the session, and this time was not included in the 60 minute learning period. After the learning session ended, participants completed a post-test that assessed their learning gains.   Figure 3.7: Main Session Timeline The 60 minute duration of the learning session does not include the time taken by participants to complete self-report questionnaires. Students self-reported their concur-rent emotions using an Emotions-Value questionnaire (EVQ) developed by researchers at McGill University [51]. The EVQ consists of 19 basic and learning-centered emotion items, and is based on a subscale of Pekrun‘s Academic Emotions Questionnaire [90]. Each item consists of a statement about an emotion (e.g., ―Right now I feel bored‖), and was rated on a 5-point Likert scale where 1 indicated ―strongly disagree‖ and 5 indicated ―strongly agree‖ [51]. The EV questionnaire was filled out at the beginning of the learn-ing session, and every 14 minutes thereafter during the one hour learning session with Meta Tutor [51], for a total of 5 self-reports per student.  The purpose of this study is to use the multichannel data collected to predict par-ticipants‘ self-reported emotions. Specifically, we will use data collected with the TobiI T60 eye tracker and Affectiva Q bracelet to predict participants‘ self-reported feelings of boredom and curiosity. We focus on these two emotions because they are a) learning centred emotions, highly relevant to students‘ engagement with and ability to learn from MetaTutor, and b) they among the most strongly reported emotions (those most fre-quently rated as 4 or 5 on the Likert scale). Table 3.1 shows the total number of each Likert scale rating that were obtained for each emotion. In Table 3.2, the number and proportion of strong ratings (4 or 5) are shown for each emotion. We can assume that strongly rated emotions were most frequently elicited by the MetaTutor environment.  24  Aside from neutral, curiosity has the strongest rating. Although hope is slightly more strongly rated than boredom, we felt that boredom was an emotion that required more direct intervention in order to improve the learning experience than hope. The average Likert-scale rating for boredom was 2.60 (SD = 0.69), while for curiosity it was 2.93 (SD = 0.71) [51].  Table 3.1: Likert scale ratings for each emotion   1 2 3 4 5 Happy 76 115 275 122 31 Joy 90 149 209 127 44 Hope 72 95 252 144 56 Pride 113 128 249 101 28 Anger 407 92 90 24 6 Frustration 280 120 92 95 32 Anxiety 251 134 120 90 24 Fear 470 99 40 7 3 Shame 424 98 60 27 10 Hopelessness 415 106 68 18 12 Boredom 163 150 124 120 62 Surprise 342 103 105 61 8 Contempt 347 97 100 62 13 Disgust 520 59 35 3 2 Confusion 311 141 104 52 11 Curiosity 100 87 172 203 57 Sadness 484 88 39 4 4 Eureka 354 116 93 51 5 Neutral 77 64 231 123 124 25   Table 3.2: Proportion of 4 and 5 ratings for each emotion   Sum of 4 or 5 Proportion of 4 or 5 Happy 153 0.247173 Joy 171 0.276252 Hope 200 0.323102 Pride 129 0.208401 Anger 30 0.048465 Frustration 127 0.20517 Anxiety 114 0.184168 Fear 10 0.016155 Shame 37 0.059774 Hopelessness 30 0.048465 Boredom 182 0.294023 Surprise 69 0.11147 Contempt 75 0.121163 Disgust 5 0.008078 Confusion 63 0.101777 Curiosity 260 0.420032 Sadness 8 0.012924 Eureka 56 0.090468 Neutral 247 0.399031   26  4 Eye-tracking data The gaze data analyzed in this study was col-lected with a Tobii T60 eye-tracker. The T60 is an unobtrusive eye tracker embedded in a 17‖ monitor. Sixty times a second, the Tobii records a sample that contains a time stamp, and the coordinates on the screen where the user‘s gaze is fixated. At a distance of 65 cm from the user, the Tobii is able to track gaze with an accuracy ranging from 0.4º- 0.7º, and a precision between 0.18º-0.36º [119]. In con-trast with head-mounted eye trackers, the Tobii is completely non-invasive, having the same appearance as a typical 17‖ monitor (see Figure 4.1). This allows the Tobii to track eye gaze while the user interacts normally with the computer interface under investiga-tion. In fact, the user has a 44 x 22 x 30 cm range of motion in which the tracking func-tions. This natural mode of interaction helps the data collected to be a more valid repre-sentation of students‘ actual emotional behaviour in a naturalistic setting. However, this additional user freedom comes at the price of a greater number of noisy or invalid eye gaze samples. After data collection, Tobii Studio software can be used to export raw gaze data in the form of .tsv files. To process these files, we use an open source package for gaze data analysis developed in our lab, the Eye Movement Data Analysis Toolkit (EMDAT)4. EMDAT can compute a variety of aggregate gaze features, as well as perform data val-idation. Section 4.1 will discuss the validation process in more detail, Section 4.2 will explain the gaze features we were able to extract using EMDAT, and Section 4.2.2 will detail a modification that was made to EMDAT for this project, which allows for the computation of features related to the participant‘s distance from the screen.                                               4 http://www.cs.ubc.ca/~skardan/EMDAT/index.html Figure 4.1: A Tobii eye tracker 27  4.1 Gaze data validation Validation is necessary in order to ensure that the gaze data collected for each user is reliable enough to extract the user‘s actual attention patterns. The validation process taken for the gaze data used in this study was originally developed in [69], and later ex-panded and described in detail in [18]. We will briefly summarize it here.  Interaction logs were used to discard portions of the interaction during which the user was taking a break or watching tutorial videos about how to use the interface, with the reasoning that these are irrelevant to research questions about how students learn or express emotion while using MetaTutor. Since Tobii cannot track gaze when the user looks away from the screen, portions of the interaction in which the student was taking notes were also removed, because they involve the student looking away from the screen frequently.    Despite manually removing portions of the interaction that contain problematic behaviours, the gaze data can still contain many invalid samples. EMDAT is able to compensate for some of these invalid samples by inferring the user‘s gaze position. For example, if the user is fixated at a certain point on the screen before and after an invalid sample, it is likely that her gaze remained in that position, and thus the invalid sample is replaced. However, a continuous sequence of invalid samples spanning more than 300ms is automatically removed by EMDAT in a step called autopartioning. The result of this step is a series of usable interaction segments produced by EMDAT.  The full gaze validation process is as follows: first, remove any participants who had less than 75% valid gaze samples overall. The 75% threshold was chosen as a good trade-off between eliminating too many participants, and being left with unreliable data. Then, use EMDAT to autopartition the data into segments, and discard any seg-ments which have less than 75% valid samples, since segments with sparse valid sam-ples are not likely to be reliable. Finally, remove any participant that did not retain at least 80% of their initial gaze data after the autopartitioning and segment removal pro-cess, because if a large proportion of the data for one participant is removed, it is un-likely that the remaining data contains a robust trend. After applying this process to our original 67 participants, we were left with gaze data from a total of 51 participants.  28  4.2 Eye-tracking features  Figure 4.2: Eye gaze data features Eye gaze data takes the form of fixations on a single point, and saccades, which are the paths between two consecutive fixations. Figure 4.2 provides a visualization of these concepts. The circles, such as A, represent fixations, while the line segments (e.g. B) represent saccades. The angles marked x and y in Figure 4.2 correspond to the abso-lute path angle (x), which is the angle between a saccade and the horizontal plane, as well as the relative path angle (y), which is the angle between two consecutive sac-cades.  EMDAT allows us to extract aggregate statistics relating to these features. The fixation- and saccade-related features we chose to include are motivated by work by Goldberg and Helfman [45], and are listed in the first column of Table 4.1. This set of features has been used successfully to predict learning with MetaTutor [18], and learn-ing in an interaction simulation [69].  Tobii also collects data about the user‘s pupil size at each time step, and the abil-ity to extract that information has recently been added to EMDAT. Although there is a documented link between pupil dilation and emotion [86] [124] [126], we did not include features related to pupil dilation in this study because the data was collected in a room with a window, and pupil dilation is more sensitive to luminance than to affect [126]. A major benefit of eye gaze data is that it reveals the user‘s attention patterns to different parts of an interface. In addition to aggregate eye gaze features, we are able to use EMDAT to compute features describing attention to specific Areas of Interest (AOIs) within the MetaTutor interface. The Tobii software allows us to dynamically define these AOIs by describing a polygonal area around some component of the interface (for ex-ample, the text content in the centre of the screen). EMDAT can then compute gaze features related to the user‘s attention to that AOI, as well as features related to gaze  1 2 4 3   x y d B A 29  transfers between each pair of AOIs. If a user fixates on the text, and then fixates on the table of contents (TOC), that constitutes a transfer from text to TOC. Gaze transfer fea-tures and other AOI-specific features that we have included in our dataset are listed in the second column of Table 4.1. Based on previous work [18] [69], we do not include saccade features for each AOI, as this would lead to an explosion in the number of fea-tures. Unlike previous work with this data [17], we also include time to first fixation (the time in milliseconds it takes for the user to fixate on the AOI after it first appears on screen) and time to last fixation. We felt that these features might provide additional in-formation about how user‘s attention is attracted to different parts of the interface. In the next two sections, we will discuss two different approaches for defining AOIs in the MetaTutor interface.  Table 4.1: Gaze Features Basic Gaze Features AOI-specific Gaze Features Fixation rate Number of fixations Mean fixation duration Std. dev. fixation duration Mean saccade length Std. dev. saccade length Mean absolute saccade angles Std. dev. absolute saccade angles Mean relative saccade angles Std. dev. relative saccade angles Fixation rate Number of fixations Longest fixation Proportion of fixations Proportion of time  Time to first fixation Time to last fixation Number of transfers to every other AOI Proportion of transfers to every other AOI   30  4.2.1 Detailed vs. compressed AOI representation  Figure 4.3: Seven AOIs in the detailed representation Our first AOI representation contains seven AOIs, which correspond to the main com-ponents of MetaTutor discussed in Section 3.2. These AOIs are shown in Figure 4.3, and include the Text Content, Image Content, Overall Learning Goal (OLG), Subgoals, Learning Strategies Palette (LSP), Agent, and Table of Contents (TOC). We chose the-se seven components based on their successful use in [17] to predict learning. This rep-resentation, which we will hereafter refer to as the detailed AOI representation, gives us a total of 157 features gaze features.  Although the detailed representation did prove useful in our initial tests, we had several concerns which led us to develop and test an alternative set of AOIs. First, by having so many AOIs we increase the number of features that are input to our model. As Chapter 7 will explain, this increases our chance of overfitting the data, reducing our ability to build a model that is able to generalize to new users. It also makes it more dif-ficult to add features from other sources to the model.  Further, while presenting this work, we received a useful suggestion: why not in-clude an AOI for the Clock in the top left corner of the screen? It seems reasonable that 31  a student who is bored might spend more time watching the clock to see when the ses-sion will be over. However, when we tried simply adding the Clock AOI to our initial de-tailed representation, we found that it negatively impacted our results. We felt that the most likely explanation was that there were simply too many features from which to build a generalizable model. Therefore we sought a sparser representation of the AOIs to which we could more easily add features. We collapsed some of the contiguous AOIs that were conceptually related, resulting in a new representation with five AOIs: Goals, Learning Tools, Table of Contents, Content, and Clock. These AOIs are shown in Fig-ure 4.4. In the compressed AOI representation we have a total of 95 gaze features.  Figure 4.4: Five AOIs in the compressed representation 4.2.2 EMDAT scenes In addition to computing features for each AOI, EMDAT can compute features over dif-ferent time intervals, called scenes. As Section 4.1 explained, segments are time inter-vals that contain valid eye gaze data from which we would like to extract information. Multiple segments can be grouped into one scene, which spans a longer time interval – perhaps several minutes. When EMDAT computes the gaze features for each partici-pant, it will also provide gaze features for each scene. In this experiment, scenes are used to compute features for each time interval that has an emotion classification label. 32  Since participants self-reported their emotions every 14 minutes, we create a scene for each 14 minute interval preceding an emotion self-report. We then attach a classifica-tion label based on the report to the features computed for each scene.  4.3 Distance features In addition to data about the user‘s fixations and saccades, the Tobii T60 also uses in-frared to collect measurements of the distance to the participant‘s eyes from the eye tracker (which is also the technology that allows the Microsoft Kinect to collect distance data). A distance measurement in millimetres is given for each eye at each timestamp. Because this distance to the screen is an indication of whether the participant is leaning forward or sitting back, we felt that it would be a good approximation of his or her pos-ture. Previous research has successfully used posture collected from matrices of pres-sure sensors on the seat and back of a chair to predict learner interest [85]. Although the eye tracker does not provide this level of detailed information, the distance infor-mation provides an estimation of posture that is easy to extract, since it is automatically obtained with the Tobii as the gaze data is being collected.  EMDAT includes capabilities for calculating many features from the raw eye gaze data collected with Tobii. As part of this research, we added functionality to EMDAT that computes additional distance features, and the source code is now freely available on Github5. Distance for each timestamp is computed as the average of the distance to each eye, if Tobii collected valid samples for both eyes. If data could not be collected for one eye, the distance to the remaining eye is used. After the distance has been com-puted for each time stamp, EMDAT creates aggregate distance features for each seg-ment, scene and participant. We have added the ability to calculate the following dis-tance features to EMDAT: mean, standard deviation, maximum, minimum, start, and end (meaning the distance measurement collected at the start and end of the interac-tion). For the purposes of our research, we focus only on the first four features, because we are wary of needlessly increasing the dimensionality of our feature set. Based on previous research [85], we hypothesize that distance, to the extent that it approximates the user‘s posture, will provide enough information to predict user engagement.                                              5 https://github.com/ATUAV/EMDAT/tree/DistanceFeatures/src 33  5 Electrodermal activity features This chapter will describe the electrodermal activity (EDA) data that was collected using the Affectiva Q sensor6, a wrist-worn bracelet which records various physiological sig-nals that may relate to the user‘s emotional state. EDA is a measure of electrical signals that are sent from the brain to the skin, often when a person is experiencing emotional arousal, increased cognitive workload, or physical exertion. Many studies have used EDA data for predicting affect (eg. [58] [53] [123] [4]). EDA is one of the few physiological signals that is controlled purely by activity in the sympathetic nervous system [95]. The sympathetic nervous system works in concert with the parasympathetic nervous system, which together make up the autonomic nerv-ous system (ANS). While the parasympathetic system deals with conserving and restor-ing bodily energy, the sympathetic system responds to external stressors by elevating heart rate and blood pressure, redirecting blood to the muscles, lungs, heart and brain, and of course, increasing sweating or EDA. Colloquially, the parasympathetic nervous system is referred to as ―rest and digest‖, while the sympathetic nervous system is ―fight or flight‖. The hypothalamus, which is influenced by structures in the limbic system, con-trols the arousal of the sympathetic nervous system. Because these structures are in-fluenced by emotion, measuring activation of the sympathetic nervous system allows us to detect emotional changes.  When a person becomes emotionally aroused, nerve fibers from the sympathetic nervous system, which surround eccrine sweat glands, increase their activity and there-by increase sweat secretion [95]. Because sweat is an electrolyte and a good conduc-tor, increased sweat secretion can measurably increase the conductance of an applied current. The measured changes in conductance on the surface of the skin are referred to as electrodermal activity (EDA), which is measured in microSiemens (µS). The data for this study consists of measurements of EDA (in µS) collected every 125 millisec-onds. We compute features from this raw EDA signal, and test these in our classification experiments. The remainder of this chapter will discuss how we performed data valida-                                            6 http://www.qsensortech.com/resources/#eda-data  34  tion for participants who provided EDA data, and how we computed the EDA features that we will use as input to our classification algorithms.    5.1   EDA data validation and feature extraction The average EDA over all participants was 1.78 µS (SD = 3.95). We found two partici-pants who had an average EDA more than two standard deviations from the group av-erage (9.92 and 28.07 respectively), and chose to exclude these participants as outliers. Further, four participants had missing data and could not be included. This left us with a total of 56 participants to be used in the EDA classification experiments. In order to design our EDA features, we surveyed a number of studies on the topic [109] [14] [53] [58] [29] [123]. Following [53], we began by normalizing the EDA signal for each participant as follows:    [ ]     [ ]           where   is the mean EDA for that participant, and Max and Min represent the maximum and minimum values, respectively. Note that this normalization method uses statistics related to the data collected over the entire interaction, whereas ideally we would want the tutor to adapt to changes in a student‘s EDA signal before the interaction was com-plete. Therefore this can be considered a proof of concept; if the system were actually implemented, a baseline measurement would need to be collected from each participant in order to use this technique. Each participant would have to undergo a calibration phase, in which their min, max, and average EDA signal was computed independently of using the tutor.  The first four features we extracted were the mean, standard deviation, minimum and maximum of the normalized signal. Although many studies simply include these four basic features, researchers have found that features relating to the slope or deriva-tive of the signal provide valuable additional information for classifying emotion [53]. Therefore we also calculated the mean, standard deviation, minimum and maximum of the first derivative of the EDA. For our dataset, the average EDA derivative over all us-ers was found to be -.0000377 µS/s, while the median was -.0000152. The negative de-rivative indicates that the participants‘ EDA tended to decay over time, which was also observed in [14] when participants were at rest. 35  Sano et al. [109] suggest computing features relating to EDA peaks, where a peak is defined as a point at which the derivative of the EDA measure exceeds 0.5 mi-croSiemens (µS) per second. However, this peak threshold was used with a dataset that measured EDA over the course of an entire night‘s sleep, and may not be appropri-ate for our application, which concerns a much shorter time frame. Blain et al. [14] sug-gest using a threshold of 0.05 µS in 5 seconds. This is equivalent to a derivative value of 0.01 µS/s. Using Blain‘s threshold, we found that the average number of peaks per participant over the entire interaction was 4847.4 (SD = 2767.4).  At first, we felt that this sort of absolute peak threshold might not be as appropriate as a relative threshold, computed differently for each participant. Therefore we also tested a threshold that defines a peak for a given user as a point at which the EDA de-rivative is greater than one standard deviation above that user‘s mean EDA derivative. However, in our initial tests we found that this user-specific definition did not perform as well as Blain‘s absolute threshold. Therefore we use the 0.01 µS/s threshold for compu-ting peaks in our final results. The peak features we calculate include the number of peaks over the entire inter-action interval. We also compute additional features by dividing the interval into 30-second epochs, as in [58]. We compute the number of peaks for each epoch, and then take the mean, median, standard deviation, minimum and maximum of this peaks/epoch data series.  Using all of these methods, we have the ability to compute 13 EDA features. How-ever, it is not clear that it is necessary to include all of these features as input to our classifiers. Since correlated features can interfere with the performance of several of the classification algorithms we will use (e.g. [61]), we computed a correlation matrix on these 15 features. We found five correlations that had a Pearson‘s r value above .9, and consequently removed the following features to ensure no further strong correlations exist within the data: the mean number of peaks, median peaks, and maximum peaks. This leaves us with a total of 11 EDA features, as shown in Table 5.1. 36  Table 5.1: EDA Features  Mean Std. Dev. Max Min Total Number Normalized Signal      First Derivative      Peaks         37  6 Machine learning experiments This chapter presents the design of the machine learning experiments we conducted for this thesis. We discuss how we frame the classification problem, the algorithms and techniques we use for classification, and the measures and statistical tests we use to report the results.  6.1 Classification labels Intuitively, boredom and curiosity might seem like opposing, mutually exclusive states, and indeed some research efforts have treated them as such (e.g. [40]). However, the data in our study does not support this assumption. While there was a significant nega-tive correlation between the ratings of boredom and curiosity (r = -.333, p < .001), in 18% of the self-reports both curiosity and boredom were rated as present simultaneous-ly, and in 13% they were both absent. Therefore we cannot assume that boredom is the opposite of curiosity in the sense that when one is bored one can never be curious, or vice versa. For this reason we prefer to separate the ratings of boredom and curiosity into two independent, binary classification problems.  Classification labels were based on the EV self-reports that students completed every 14 minutes. We did not include the first round of reports, because they were col-lected before participants began using the learning environment, and thus reflect partic-ipants acclimatizing to the interface rather than learning. Ratings of 3 or higher were labeled as Emotion Present (EP), and ratings of less than 3 were labeled as Emotion Absent (EA), as in [51]. We consider a rating of 3 to be a moderate expression of emo-tion, because statements in the EV questionnaire assert the presence of an emotion (e. g., ―Right now I feel bored‖), thus if participants did not feel the emotion they would chose a rating of 1 or 2 to express disagreement with the statement. 6.2 Machine learning algorithms For the classification experiments described in the next sections, we report results from five algorithms available in the Weka data mining toolkit: Random Forests (RF) [22], Na-ïve Bayes [61], Logistic Regression [25], Support Vector Machines (SVM) [94], and Mul-tilayer Perceptron (MLP) [130]. We initially focused on a broader range of classifiers 38  which were found to be the most effective for predicting learning by previous research on the same dataset [17], but we eliminated the Simple Logistic Regression classifier when it did not show promising performance in initial tests involving eye gaze data. Fur-ther, we eliminate the MLP classifier from gaze tests because its performance was not impressive with this data. We added the SVM classifier based on the multitude of other research that has successfully used support vector machines with physiological data (e.g. [64]) and for affect detection (e.g. [82]). We will now give some theoretical back-ground on each of the algorithms. 6.2.1 Support vector machines Support Vector Machines (SVMs) have recently gained in popularity because they are a powerful classifier that subsumes other types of neural network and polynomial classifiers as special cases, and yet they are a simple enough representation to be amenable to theoretical analysis [54]. A SVM is a binary classifier that corre-sponds to a linear decision function of the form:  ( )      (    ) where w represents a hyperplane or a vector of weights applied to the input data x, and b is the offset of the hyper-plane from the origin. For a given dataset that is linearly separable, there are many pos-sible hyperplanes that could divide the data into positive and negative examples. How-ever, SVM is unique in that it attempts to maximize the margin, or the distance between the decision boundary hyperplane and the point closest to it. Figure 6.1 presents a visu-alization of this concept. Maximizing the margin is desirable because in general, the generalization error decreases as the size of the margin grows larger [54].   Obviously not all data are linearly separable. Fortunately, the SVM algorithm is able to deal with this problem by mapping the data into a higher-dimensional feature space, and finding a maximum margin decision boundary in this space. Thanks to the kernel trick this mapping need not be explicit and no computations in the higher dimen-sional space need occur. SVMs use a kernel function, which can be represented in the following form: Figure 6.1: SVM decision boundary 39   (   )   ( )   ( ) This implies that only the dot products on the input data need to be evaluated. It has been proven that any kernel which can be represented as a positive matrix has this property [54]. The kernels available in Weka include a polynomial kernel of the form:  (   )  (     )  Or a radial basis function (RBF) kernel of the form:  (   )     {            } The final decision function is then:  ( )      (∑    (    )   ) ) The Weka implementation of SVM is based on Platt‘s paper on Sequential Mini-mal Optimization (SMO) [94]. Normally, training a SVM requires finding the solution to a large quadratic programming (QP) optimization problem involving a matrix that has n2 elements, where n is the size of the training set. SMO instead analytically solves a se-ries of QP problems of the smallest possible size. This reduces the memory require-ments so that they are linear in the size of the training set, and produces an efficient SVM implementation, especially for linear SVMs and sparse datasets. 6.2.2 Random forests Random Forests are an ensemble method, in which many weak decision tree classifiers are trained, and vote for the most popular classification label [22]. The trees are con-structed with a certain degree of randomness, hence the name. This can sometimes involve bagging, in which the data points used to construct the tree are randomly sam-pled from the data set without replacement. Another technique is random split selection, in which the split used at each node within a tree is randomly sampled from among the k best splits. The Weka implementation of Random Forests is based on a 2001 paper by Leo Breiman [22], which advocates for the use of bagging along with random feature selection, where a random selection of features is used to split each node in the tree. Breiman‘s method yields forests with error rates that improve upon Adaboost, and are more robust to noise [22]. 40  Breiman also gives bounds on the generalization error of random forests. Firstly, he uses the Law of Large Numbers to show that the generalization error converges as the number of trees in the forest becomes large, and therefore that random forests do not overfit as more trees are added. Secondly, he shows that the upper bound on the generalization error depends on the strength of the individual trees and a lack of corre-lation between them. He gives empirical results demonstrating that trees which use more features (and thus have more decision nodes) do not tend to provide a sufficient increase in strength to compensate for the increased correlation with other trees using the same features. He concludes that in general, trees which involve a smaller number of features lead to superior performance.  However, this may not be the case in a situation where there are a large number of weak features, of which no single feature or small group of features can distinguish between the classes. In this case, a large number of features per tree was found to be more effective. Breiman states that, ―forests seem to have the ability to work with very weak classifiers as long as their correlation is low‖ (p. 18). This lack of correlation can be achieved by having many weak features for the trees to select from. Note that the Weka default is 10 trees which each make use of int(log2(D) + 1) features, where D is the total number of features in the input data.  6.2.3 Naïve Bayes Naïve Bayes is a probabilistic algorithm for supervised induction based on Bayes‘ rule [61]. It makes important simplifying assumptions that lead to efficient computation of the probabilities of each class label given the training data: 1) that the features are condi-tionally independent given the class, and 2) that no hidden attributes influence the pre-diction process. While these assumptions might seem restrictive, empirically Naïve Bayes achieves impressive performance on real-world data. Unlike many Naïve Bayes classifiers, the Weka implementation based on [61] does not assume that each continu-ous feature comes from a Gaussian distribution. Rather, it builds the distribution using a series of Gaussian kernels, one for each training example. This method is known as kernel density estimation. More formally, 41   (       )     ∑ (        ) where i ranges over the training points of attribute x in class c, g is a Gaussian function, and    = xi. The standard deviation     √  , where nc is the number of training instances in class c. The added space and time complexity required by this representation is made worthwhile by the fact that the classifier can model features that do not have a standard Normal distribution. For example, it can easily construct multimodal distribu-tions for the probability density functions of the input features.   In the case were there are many weak features, none of which can distinguish be-tween the classes, but the underlying probability of each class is higher for certain fea-tures, Naïve Bayes should show optimal performance [22]. However, this rests on the assumption that the features are independent of each other, which may not be the case with our data.  6.2.4 Logistic regression Logistic regression (LR) is a method for modelling binary data in which a weight is learned for each input feature [96]. The equation for logistic regression is calculated us-ing the sigmoid or logit function:  (  )      (   )    (     ) where p(Xi) gives the probability that the sample Xi is classified as 1, and   is the vector of weights applied to the input data. The weights are computed using the maximum like-lihood estimate (MLE) of the log-likelihood of this equation. The time complexity of LR can be as high as O(nd3) where d is the number of features, and n is the number of data points.  The implementation of LR in Weka is a slight modification of [25], which uses ridge regression to regularize the weights. Unregularized LR suffers from unstable pa-rameter estimates, and will tend to overfit the data, if the number of features is large compared to the number of data points. Ridge regression corrects for this problem by introducing a penalty term into the MLE of the parameters. The penalty is a constant, gamma, multiplied by the norm of the weight vector. This term forces the coefficients of 42  some features to 0, increasing the bias and decreasing the variance of the classifier, thus reducing overfitting [25].  6.2.5 Multilayer perceptron A multilayer perceptron (MLP) is a feedforward neural network made up of nodes connected in layers, as shown in Figure 6.2 [130]. Each layer is fully connected to the next by a series of weighted edges [108]. Each node or neuron receives as in-put the values of all nodes in the previous layer, multiplied by their edge weights. The node then performs an activation function on this input, which is often sigmoidal in nature, as in logistic regres-sion. The edge weights within the network are learned using backpropagation  [130], in which the network‘s output is compared to the target output, and the differences are used to compute gradients for each weight. The weights are up-dated based on a percentage of the computed gradient, according to a learning rate pa-rameter set for the algorithm. MLPs or artificial neural networks have been shown to be highly effective in classification of high-dimensional data [56].  6.2.6 Ensemble classification As a final step, we combine the decisions of the best classifiers trained on each of our three data sources (gaze, distance from the screen, and EDA) in an ensemble [37]. For each data point, each classifier in the ensemble provides a decision. The decisions are pooled, and the ensemble classifier chooses the classification label that has the most votes. This is known as a majority vote, and has been shown to provide accuracy ex-ceeding that of any of the members of the ensemble, provided each member has classi-fication accuracy that is slightly better than chance [43]. We also performed tests of feature-fusion, which involves using all of the available features from every data source to train one classifier. Although we used feature selec-Figure 6.2: Example MLP 43  tion to reduce the number of features, we found that the results were considerably worse than those obtained with the ensemble.   6.3 Cross-validation In order to make claims about how well an affect-adaptive system based on our model would perform for new users, we need to be able to test our model on data that has not been used in building it. Best machine learning practice typically involves reserving a dataset for model validation and model testing. However, training an effective model requires as much data as possible, and with so few participants in this study we cannot afford to remove enough records to form a separate testing dataset. In this case, a common approach is to rely on resampling techniques such as k-fold cross validation (CV) [75].  The process of k-fold CV involves splitting the data into k equal subsets called folds. A classifier is trained on k-1 of the folds and tested on the remaining fold. This process is repeated k times, so that each fold is used as the testing dataset once. In this way, no classifier is ever tested on data which has been used to train it, but all the data can be used for training at least one of the classifiers.   A naïve approach to k-fold CV could result in testing sets with a very different distribution of class labels than the training set. In order to compensate for this problem, we use stratified random CV [75]. Not only do we randomize which data points are placed in which folds, but we ensure that an even distribution of class labels is present in each fold. This is accomplished by dividing the datapoints into two sets, one for each class, and randomly distributing each set evenly over the folds.   Research has shown that as opposed to more computationally expensive meth-ods such as leave-one-out CV (in which each data point is used as a fold) 10-fold CV is the most effective for model selection [75]. Therefore we use 10-fold CV for selecting features, tuning the parameters of the model, and finally training the classifier. For each of the 10 training folds, we perform feature selection and tune the parameters of the model by finding which features and which settings lead to the lowest training error. We then test those settings on the corresponding testing fold. At no point is testing data 44  used to select features or tune parameters, or are the selected features and settings shared between different training folds.  In the majority of our experiments (those for distance features, EDA features, and ensemble classification), we conduct 10 rounds of 10-fold CV, creating the training and testing sets randomly each time. This repeated procedure assures robust estimates of the performance for each test, since any variance relating to getting a ―lucky‖ or ―un-lucky‖ partitioning of the data into folds will be averaged out. We use the results ob-tained from one round of 10-fold CV, which describes the average performance of the 10 classifiers trained in that round, as a single data point in our statistical tests. Howev-er, for our experiments with eye gaze data, the nested wrapper feature selection pro-cess (which will be discussed in Chapter 7), is far more time consuming. Therefore we conduct only 5 rounds of 10-fold CV. This technique could lead to a reduction in statisti-cal power, but it is far more computationally tractable.  6.4 Results measures  We report classification results in terms of both accuracy (percentage of correctly classi-fied data points), as well as Cohen‘s kappa, a measure of classification performance that accounts for correct predictions occurring by chance [8]. Kappa scores are 1 when classification labels exactly match the ground truth values, and less than or equal to 0 if the predictions were no more accurate than chance. A good kappa score for trained human judges rating emotion might be .5 [41] or .6 [30], while a typical score for a ma-chine predicting emotion might be .3 [30] [59] [2] or .2 [15] [19]. 45  6.5 Statistical analysis In order to determine whether the differences found in our experiments are due to ran-dom chance, or real, replicable effects, we use General Linear Models (GLMs), often referred to as an ANOVA, or analysis of variance [38]. We conduct post-hoc analysis using Tukey‘s Honestly Significant Difference (HSD) test, which allows us to make mul-tiple pairwise comparisons without increasing the risk of Type I error [38]. The Tukey HSD test is significant when     (     )          where    and   are the two means being compared, SE is the standard error of all the examples, and qcrit is the critical value obtained from a Studentized Range Distribution table. This critical value increases with the number of comparisons being made, thus compensating for alpha inflation. The post-hoc analysis compares the levels of each factor in the GLM; for example, if ‗classifier‘ is a factor in the model, Tukey‘s test would be used to compare each pair of classifiers to detect significant differences. We also use Tukey‘s HSD to compare the best result against the baseline for each test. In the case of an interaction effect, sometimes additional pairwise comparisons are required. In this event, we use t-tests with a Bonferroni correction, as this is known to be a more conservative procedure than Tukey‘s HSD [38].   46  7 Feature selection and overfitting Due to the low cardinality and high dimension-ality of our data, our classifiers will tend to over-fit the data [52] without an effect feature reduc-tion method. Overfitting occurs when the com-plexity of a model is so great that it begins to fit the idiosyncrasies of the particular data sample, rather than capturing a generalizable trend [10]. Figure 7.1 shows a decision boundary that is overfit to the particular dataset. It perfectly describes the training data, in that all of the examples belonging to the first class (represented by the green triangles) have been completely separated from those in the second class (red circles). However, the ex-tremely complex line that has been drawn to accomplish this task is likely to generalize poorly to new data. The single green triangle that has fallen on the right side of the space is likely an outlier, however the model has fit the decision boundary to that point. When a new data point appears in the right side of the screen near that triangle, it may be classified as belonging to the first class, when in all likelihood it is a member of the second. In general, it is possible to detect when a model has overfit the data by examin-ing the classification error on the training set (training error) and on the test set (general-ization error). If the training error is very low, but the generalization error is high, then it is likely the model has overfit.  Our eye gaze dataset contains a large number of features (157), but very few da-ta points (approximately 50-200, depending on the machine learning experiment, which are described in greater detail in Chapter 8). Models trained on this data are extremely prone to overfitting, because the high ratio of features to data points means it is easy to build an overly complex model that perfectly describes the training data, but cannot generalize to new users. For this reason, we seek an effective method to reduce the number of features used as input to our classification algorithms. This section will dis-cuss various feature reduction techniques and the implications of overfitting. Figure 7.1: An overfit decision boundary 47  Initially, we used wrapper feature selection to remove uninformative or redundant features. Unlike simple filter methods which are classifier-independent and only exam-ine the correlation between features and output labels, wrapper feature selection (WFS) is classifier-specific, and assesses which subsets of features will be most useful in combination with each other, by treating the classifier as a black box [50]. Greedy methods work well with wrapper feature selection because they can quickly search the space of all subsets and are robust to over-fitting [50]. In order to obtain more robust feature sets, we used 10-fold CV, conducting 10 rounds of wrapper feature selection and selecting only those features that appeared in more than 10% of the folds. This process was initially performed on the entire dataset, and treated as a separate step from training the classifiers. Therefore only one set of features was found for each clas-sifier and this same feature set was used in training and testing on each cross-validation fold. 7.1 Effects of testing data contamination and overfitting Unfortunately, our initial process which performed WFS on the entire dataset led to test-ing data contamination. When we performed 10-fold CV for classification, the data from each testing fold had already been involved in selecting the features that the classifiers were using. It would appear that this methodological error is somewhat commonplace in the affective computing community; many researchers appear to perform feature selec-tion on the entire dataset as a preliminary step, or at least fail to mention using cross-validation or setting aside a validation set in order to select features (e.g. [58]). Although it is acknowledged that feature selection performed on the entire dataset could artificial-ly inflate classification results, general opinion seems to be that this effect is not very severe. In our research we have found evidence that strongly contradicts this assumption. To correct for our initial error, we modified the wrapper feature selection (WFS) method in order to prevent testing data contamination. Rather than using the whole dataset to perform 10-fold cross-validated WFS, we conducted this process on each training set used in classification. For each of the CV folds used in classification, we perform WFS on the training data only. The training set is subdivided into a further 10 ―nested‖ sub-48  folds, which are used to conduct our original 10-fold cross-validated WFS process. This method is known as nested cross validation. As before, we select those features that appear in more than one of the folds, which are in this case the nested sub-folds. This process produces a different set of features for each of the ten training sets. The com-puted feature sets are then used as input to the classifiers, which are trained on the same training data used to select the features, and tested on novel data in the test set. In this way, the data from the testing set was not involved in the feature selection pro-cess.  The effect of this correction on our results was extreme. For one set of tests, our accuracy in predicting boredom dropped by an average of 11.57% (SD = 5.57), while the accuracy in predicting curiosity dropped by an average of 10.24% (SD = 4.47). Worse yet, a second test saw accuracies drop by an average of 21.33% (SD = 8.19) for boredom, and 19.21% (SD = 6.86) for curiosity. This dramatic difference can be seen in Figure 7.2 and Figure 7.3, where solid lines represent the results obtained by improperly performing feature selection and contaminating the testing data, and the dashed lines represent the new, more rigorous feature selection process. The difference in results here can be seen as an extreme example of overfitting; the selected features were able to perfectly describe the specific quirks of our dataset, but could not generalize to un-seen data [10]. It is clear that at least for our dataset, attention to proper feature selec-tion methodology is extremely important for the generalizability of the results. 49   Figure 7.2: Difference in boredom accuracy results in performing feature selec-tion on the entire dataset (solid lines), or using nested cross validation (dashed lines)  Figure 7.3: Difference in curiosity accuracy results in performing feature selection on the entire dataset (solid lines), or using nested cross validation (dashed lines)  304050607080901 2 3 4Accuracy Self-Report Boredom Accuracy Decrease Logistic-oldLogistic-newRF-oldRF-newNaiveBayes-oldNaiveBayes-newBaseline304050607080901 2 3 4Accuracy Report Individual Reports - Curiosity Accuracy Decrease Logistic-oldLogistic-newRF-oldRF-newNaiveBayes-oldNaiveBayes-newBaseline50  Although the results of our more rigorous feature selection procedure are not as impressive, they have the important characteristic of being representative of the results we would achieve on novel data – in other words, they are generalizable. Our research question pertains to the performance we can expect from an affect detection system trained on gaze, and the new results give a more realistic answer. However these re-sults do highlight the importance of a proper feature selection technique in predicting affect from gaze. Therefore, we decided to investigate several other feature reduction techniques, which are described in the next section.   7.2 Comparison of feature selection techniques In addition to Wrapper Feature Selection (WFS), we examined several methods for re-ducing the high dimensionality of our feature set. As described above, we prevent con-tamination of testing data by incorporating the feature selection process into the stand-ard CV process we used to construct our classifiers.   One common feature-reduction technique that we tried is Principal Component Analysis (PCA), which looks for existing structure in the feature space by seeking un-derlying components that explain subsets of the features. PCA finds highly correlated subsets of features, and creates new components from a combination of those features, effectively reducing the dimensionality of the data [38]. The computational complexity of PCA is significantly less than that of WFS.    Researchers have investigated the effectiveness of manually selecting features using domain knowledge [110], or using a priori knowledge from previous research [36]. For these reasons, we also attempted to perform feature selection by hand, by remov-ing features that were conceptually related to one another. For example, we removed features relating to the number of gaze transitions between two AOIs, if the proportion of transitions was also a feature. We expected this method would improve our results, because although the wrapper feature selection process is designed to remove corre-lated features, it might inconsistently choose between feature types in different folds in the nested CV process. Note that PCA would also address this problem.  Finally, we investigated a random projection technique called the Fast Johnson-Lindenstrauss Transform [1]. This method is based on the famous Johnson-51  Lindenstrauss (JL) lemma [62], which states that an arbitrary dataset of dimensionality d can be projected into a feature space of dimension k << d such that the lengths of vec-tors and the distances between them can be approximately preserved in Euclidean space. This is accomplished simply by multiplying the original matrix of data points by a matrix with entries sampled from a standard Normal distribution. More formally, the lemma states, (   )          |  ( )   ( ) |   (   )         where F(x) is the result of applying the JL transform to the vector x, and   is a small er-ror term introduced into the Euclidean norms under the transformation. The Fast JL Transform (FJLT) computes a JL embedding efficiently, using a sparse transformation matrix and a Fast Fourier Transform.  The results of testing the different feature selection techniques were mixed. The performance of the manually created feature set was not notably better than performing no feature selection at all. However the other three methods (PCA, FJLT, and WFS), all led to markedly higher performance. The FJLT random projection had surprisingly high performance, in some cases dramatically surpassing other methods. This is interesting given the fact that FJLT does not use any information about the dataset or classifiers to reduce the dimensionality of the feature set. In contrast, PCA looks for underlying struc-ture within the data, and WFS takes the performance of each classifier into account when selecting features. However, the performance of FJLT was highly unreliable, in some cases showing accuracy that was approximately 20% worse than the other meth-ods, including no feature selection. For this reason we do not feel it is a good choice of feature reduction technique for an affect-adaptive system. Both WFS and PCA offer consistently good results. The methods are comparable, in that for some tests PCA of-fered better performance, while for others it was WFS. We choose to focus on WFS for the remainder of the paper, because the features it selects are more interpretable and intuitively understandable than the components generated by PCA.    52  8  Eye tracking results In this section we present the results of several classification experiments. We begin by training classifiers using all available self-reports regardless of when they were generat-ed, and gaze features computed using various time intervals preceding each report. We discuss the features chosen as most predictive by WFS in this classification task. Next we report results on training separate classifiers for each of the four self-report periods, to ascertain whether time of self-report affects classification accuracy.  The first three sections present results related to using the seven detailed AOIs (Text Content, Image Content, Overall Learning Goal (OLG), Subgoals, Learning Strat-egies Palette (LSP), Agent, and Table of Contents (TOC)). The final portion of this chapter (Section 8.4) presents the results of using the compressed AOI representation (Content, Goals, Learning Tools, TOC, and Clock). For more details about the two rep-resentations please see Section 4.2.1. 8.1 Effect of time interval on affect prediction using gaze As part of this experiment, we wanted to ascertain how much of the interaction is need-ed to reliably predict affect. Many studies make use of an interval of 20 seconds for af-fect labeling [49]. In a study of the same dataset, Harley et al., [51] used only 10 se-conds of facial expression data preceding the self-report to analyze whether the emo-tions expressed were congruent with those reported. Therefore we wished to determine the optimal observation window length; that is, the amount of gaze data preceding the self-report that should be used for prediction. We tested window lengths ranging from 100% of the available data (14 minutes), to 1% (8.4 seconds). We used a 5 (classifier) x 6 (window length) General Linear Model (GLM) to ana-lyze the results, treating the accuracy (or kappa score) obtained for one round of 10-fold CV as a single data point. We include the baseline as a classifier in the ANOVA in order to test whether the other classifiers significantly surpass it at any of the windows. Note that the baseline corresponds to the accuracy expected from always guessing the most frequently occurring class: a majority-class baseline. For this reason, the baseline de-pends on the distribution of class labels in the data, which may change as the window length gets smaller. This is because not all participants have valid gaze data for some of 53  the smaller time intervals, and their self-reports are thus removed from the analysis at those windows, changing the baseline. The number of data points available for each emotion at each time window, in decreasing order of window length, are 204, 203, 203, 198, 170, and 112. In total, we ran four GLM models, one with each of boredom accuracy, boredom kappa, curiosity accuracy, and curiosity kappa as the dependent variables. We apply a  Bonferroni correction to adjust for family-wise error, and report the significance values after the adjustment has been applied. In cases where the kappa results are consistent with those found for accuracy, we present only the accuracy results. We will present the results for each emotion separately.  8.1.1 Boredom   Figure 8.1: Boredom accuracy as a function of the amount of interaction time used to train the classifiers Figure 8.1 shows the results of different window lengths on the accuracy of each classi-fier (the kappa scores followed a similar trend; the kappa figure is included in the Ap-pendix). The GLM results for boredom are shown in Table 8.1. There is a main effect of 4045505560657014 10.5 7 3.5 1.4 0.14Accuracy Window Length (mins) Boredom - Accuracy by Window Length Logistic RFNaiveBayes SVMBaseline54  window length, indicating that the length of interaction used to compute gaze features is an important consideration when designing an affect-sensitive MetaTutor. There are three windows at which the best classifier achieves accuracy which significantly ex-ceeds the baseline: 14 minutes, 10.5 minutes, and 1.4 minutes. However, the best win-dow lengths are the longest; windows of 10.5 minutes and 14 minutes achieved the best results, and were significantly better than the three smallest windows. These results are interesting, given the fact that a large body of previous affect prediction research has focused on using a 20 second interval for affect labeling [49]. Although certain classifi-ers (like RF) can still achieve good performance with a small interval of data, in general it seems that more gaze data generates better results. Therefore this study provides empirical evidence that a 20 second interval may not always be appropriate for predict-ing boredom, depending on the data under investigation.  Table 8.1: Effects of window size on classifying boredom using gaze features Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Window F(5,96) = 13.012 η2 = .404 p < .001 Classifier F(4,96) = 8.670 η2 = .265 p < .001 Window*Classifier F(20,96) = 5.301 η2 = .525 p < .001 Boredom kappa Window F(5,96) = 10.441 η2 = .352 p < .001 Classifier F(4,96) = 8.776 η2 = .268 p < .001 Window*Classifier F(20,96) = 5.359 η2 = .528 p < .001   There is also a main effect of classifier, which is qualified by an interaction effect of larger effect size (see Table 8.1). The interaction effect results from the fact that some classifiers significantly exceed the baseline at certain windows, but not others. The best result of 57.38% (kappa = .139) was obtained with the SVM classifier at a win-dow of 14 minutes, and this significantly exceeds the majority-class baseline for this window, t(4) = 6.94, p < .01. Therefore we can conclude that gaze data contains enough information for the classifiers to build a model which can predict boredom with accuracy that significantly exceeds guessing, provided that the right amount of data is used to make the predictions.   55  8.1.2 Curiosity  Figure 8.2: Curiosity accuracy as a function of the amount of interaction time used to train the classifiers  Figure 8.3: Curiosity kappa as a function of the amount of interaction time used to train the classifiers 4045505560657014 10.5 7 3.5 1.4 0.14Accuracy Window Length (mins) Curiosity - Accuracy by Window Length Logistic RFNaiveBayes SVMBaseline-0.2-0.100.10.20.30.4Kappa score Window Length Curiosity - Kappa by Window Length Logistic RFNaiveBayes BaselineSVM56  The accuracy and kappa scores obtained for curiosity are shown in Figure 8.2 and Fig-ure 8.3, while the results of the GLM are given in Table 8.2. Once again we see that the window length has a strong and significant effect on both accuracy and kappa scores. The longest window of 14 minutes achieved the best accuracy (M = 61.16%, SD = 2.84), which was significantly better than all other windows. However for kappa scores, the best window of 14 minutes (M = .133, SD = .069) was statistically equivalent to the second best window of 8.4 seconds (M = .129, SD = .074). This effect is due to the fact that the baseline accuracy is lower at the 8.4 second window, which affects the kappa scores because they are a measure of how much the predictions exceed chance (the baseline). For both the 14 minute and 8.4 second window, the best classifier was able to achieve accuracy which significantly exceeded the baseline.  Table 8.2: Effects of window size on classifying curiosity using gaze features Outcome Measure Effect F-Ratio  Effect Size Sig. Value Curiosity accuracy Window F(5,96) = 14.071 η2 = .423 p < .001 Classifier F(4,96) = 5.073 η2 = .174 p < .005 Window*Classifier F(20,96) = 3.230 η2 = .402 p < .001 Curiosity kappa Window F(5,96) = 20.016 η2 = .510 p < .001 Classifier F(4,96) = 3.808 η2 = .137 p < .05 Window*Classifier F(20,96) = 2.917 η2 = .378 p < .001   There is also a main effect of classifier. As with boredom, this main effect is qualified by an interaction effect of greater magnitude, which results from the fact that the classifiers exceed the baseline only at a window of 14 minutes. Although the previ-ous findings on window length suggested that an 8.4 second window did not have sig-nificantly worse kappa scores than the 14 minute window, we found that no classifier had accuracy exceeding the baseline at this small window length. In constrast, the best results of 63.9% (kappa = .215), obtained with a 14 minute window and the RF classifi-er, were significantly better than the baseline, t(4) = 4.107, p < .05. For this reason we conclude that for curiosity as well as boredom, a long interval of gaze data is most ef-fective. Since the classifiers can exceed the majority-class baseline, we conclude gaze data can provide enough information to predict curiosity in MetaTutor, provided a long 57  enough interval of data is used.  8.2 Analysis of eye-tracking features In this section we examine the features that were selected by the WFS process. We fo-cus on the features selected most frequently for the windows that achieved the best per-formance, reasoning that these features must have been the most informative.  The first trend we found evidence for is that bored students do not make use of the image AOI. Compared to stu-dents who do not indicate feeling bored, bored students spend a smaller propor-tion of time looking at the image, have fewer fixations on the image, have fewer gaze transfers within the image AOI, and have fewer transfers from the text to the image. These features are depicted in Figure 8.4, where arrows indicate gaze transi-tions and circles indicate features related to the AOI itself (circle size increases with the number of features found).  Bored students also do not attend to the Overall Learning Goal (OLG) as much as their non-bored counterparts. Their maximum fixation length of the OLG is shorter, and they have fewer image-to-OLG, text-to-OLG, and OLG-to-subgoals transfers. While bored students may not be paying at-tention to the image and OLG, curious students are not paying at-tention to the Agent AOI. As compared with stu-dents who do not report feeling curious, curious students have a shorter average fixation time on the Agent. The also transfer their gaze less fre-quently between the Agent and the image; they Figure 8.4: Bored students show less at-tention on the image Figure 8.5: Bored students show less attention on the OLG Figure 8.6: Curious students show less attention on the Agent 58  have fewer image-to-Agent and Agent-to-image transfers. This finding is especially in-teresting given results from [6] obtained with the same dataset, which showed that the Agent was the only AOI not predictive of learning gains. Finally, we found that while curious students attend to the Table of Contents (TOC), bored stu-dents do not. Curious students have an increased fixation length on the TOC, and more TOC-to-TOC transfers. Conversely, bored students have a lower fixation rate within the TOC, have fewer OLG-to-TOC transfers, and fewer Learning Strategies Pal-ette (LSP)-to-TOC transfers. This may suggest that use of the TOC is indicative of engagement, insofar as students who are curious may be actively using the TOC to seek out new learning material, while bored students do not.  8.3 Effects of self-report time on affect prediction In our initial tests, the data from all self-reports was pooled together and used in the training set with no indication of the point during the interaction when the report oc-curred. In this section we treat each self-report time as its own classification task, to see if and how this timing information affects prediction accuracy. In other words, we predict the 1st self-report values using only eye gaze data from approximately 1-15 minutes into the interaction, the 2nd using minutes 16-30, etc. Previous work has found it benefi-cial to train separate classifiers for distinct phases of the tutoring interaction [89], and we would like to investigate whether that approach is viable here. Based on the findings from the previous section, we use the full 14 minute interval preceding each report to compute the gaze features. This leaves a total of 51 data points for each report time. Finally, we use a similar 5 (classifier) x 4 (report time) General Linear Model to analyze the data. We will group our discussion of the results into sections for each emotion be-ing predicted. Figure 8.7: Curious students (but not bored ones) attend to the TOC 59  8.3.1 Boredom  Figure 8.8: Performance of boredom classifiers for each report  The results of predicting boredom at each self-report are shown in Figure 8.8, and the statistical analysis is presented in Table 8.3.  Table 8.3: Effects of report time on classifying boredom using gaze features Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Report F(3,64) = 6.390 η2 = .231 p < .005 Classifier F(4,64) = 11.355 η2 = .415 p < .001 Report*Classifier F(12,64) = 6.499 η2 = .549 p < .001 Boredom kappa Report F(3,64) = 6.102 η2 = .222 p < .005 Classifier F(4,64) = 12.173 η2 = .432 p < .001 Report*Classifier F(12,64) = 5.905 η2 = .525 p < .001  There is a main effect of report time, which suggests that classification is easier at some self-report times than others. Classification at self-report three shows higher per-3540455055606570751 2 3 4% Accuracy Self Report Boredom - Accuracy by Report Time Logistic RF NaiveBayesBaseline SVM60  formance across both measures, although both self-report 3 and 4 produced results that were significantly better than the baseline. The variance in the ability to detect boredom across reports suggests that there is a more consistent relationship between gaze pat-terns and boredom at different times during the MetaTutor interaction. There is a main effect of classifier, however there is no classifier that consistently outperforms the others for both accuracy and kappa. There is an interaction between classifier and report time that modulates the main effect. While the classifiers do not significantly exceed the baseline at every self-report, the interaction indicates that at some self-reports this difference does reach significance. Specifically, boredom predic-tion accuracy reaches a height of 69.0% (kappa = .379) at self-report 3 with the Naïve Bayes classifier, significantly surpassing the majority-class baseline, t(4) = 8.634, p < .001. It is also able to significantly exceed the baseline at self-report four.  It is interesting to note that the aver-age results obtained for boredom by re-stricting focus to a single self-report are markedly higher than those obtained when all report times are classified together, as in the previous section. Overall, the results of this section seem to indicate that the rela-tionship between gaze and boredom varies with progress through MetaTutor. If this were true, we would expect that different gaze features would be more informative at different report times. Indeed, we examined the features chosen by WFS for each re-port, and found that there was considerable variability (see Figure 8.9). We have grouped the features into categories depending on whether they pertain to a specific AOI, or are application-independent statistics relating to fixations or saccades. The rele-vant features change along with time spent with MetaTutor; for example, the image con-tent becomes decreasingly relevant for predicting boredom over time. The fact that the features change depending on the amount of time that has passed demonstrates that different patterns of behavior are indicative of boredom at different times.  Figure 8.9: The features selected by WFS change with the self-report time 61  Perhaps this relates to the fact that different content is displayed in the MetaTutor environment at different times. Although different students do not necessarily view the same content in the same order, it could be that the pedagogical content displayed af-fects the meaning of the relationship between certain gaze features and affective states. For example, if the student pays little attention to the image at the beginning of the in-teraction, this may not be a problem, as the first part of the interaction involves setting learning goals. However, once the student has set goals and is beginning to browse through the learning material, if she is still paying little attention to the image it may indi-cate that she is bored and not engaged with the learning material. A future research di-rection would be to look at combining gaze features with interaction logs, to help uncov-er this type of context-dependent effect. The results in this section suggest that it may be worthwhile to look at training classifiers specific to particular phases of the interaction with MetaTutor (e.g. the setting subgoals phase).  8.3.2 Curiosity  Figure 8.10: Performance of curiosity classifiers for each report As Figure 8.10 shows, training the classifiers on each self-report separately impaired the ability to predict curiosity. There were no classifiers that offered performance signifi-62  cantly exceeding the baseline in terms of accuracy or kappa score at any report. A sim-ple explanation for this phenomenon is that in the individual self-report tests we restrict the dataset to 1/4th its original size, leaving us with little data with which to perform ma-chine learning. It appears that for curiosity, the benefits obtained by restricting focus to a single self-report are not outweighed by the lack of data. For this reason we present the rest of the GLM results in the Appendix 8.4 Results of compressed areas of interest In this section we discuss the results obtained using gaze features extracted from fewer AOIs. Rather than the seven AOIs described above, we now describe results obtained with 5 compressed AOIs, which are described in detail in Section 4.2.1. The new AOIs encompass essentially the same regions, except that they include an additional Clock AOI. Note that since this section uses the same raw data as the previous section on de-tailed AOIs, we have the same number of data points in each experiment. In order to assess whether there are significant advantages to choosing one AOI representation over the other, we include an addition factor in our statistical models: AOI set. We con-duct both the window length and individual report tests (as in the previous sections) with the new AOIs, in order to determine if our previous findings still hold. 8.4.1 Effects of window length with compressed AOIs As in Section 8.1, we test window lengths ranging from 14 minutes to 8 seconds. We analyse the results with a 2 (AOI type) x 4 (classifier) x 6 (window) GLM. We will group our discussion of the results into sections based on the emotion predicted. 63  8.4.1.1 Boredom  Figure 8.11: Boredom prediction accuracy achieved by the best classifiers using detailed AOIs (blue) vs. the best classifiers using compressed AOIs (red) The best results obtained for each AOI representation are shown in Figure 8.11, in which classifiers trained using the detailed AOI representation are shown as blue lines, and the compressed AOI classifiers are shown as red lines. In order to simplify the presentation, we do not show all of the classifiers that were tested. Rather, we focus on those classifiers that achieved the highest accuracies and exceeded the majority-class baseline. The statistical analysis, however, was conducted using all the data, and the significant effects that were detected are presented in Table 8.4. Note that exactly the same effects were detected for both accuracy and kappa.   There is a main effect of window length. The two longest window lengths (14 and 10.5 minutes) are significantly better than every other window length, and produce re-sults significantly better than the baselineThis replicates the findings from the previous section, which indicated that a long interval of time is most appropriate when computing gaze features. However, unlike with the previous representation, a window of 8.4 se-conds is also able to produce accuracy significantly exceeded the baseline using com-pressed AOIs.   35404550556065701 0.75 0.5 0.25 0.1 0.01% Accuracy Window Length Boredom - Detailed vs. Compressed AOIs RF-7AOIsSVM-7AOIsLogistic-5AOIsSVM-5AOIsBaseline64  Table 8.4: Effects of window on classifying boredom using compressed AOIs Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Window F(5,192) = 17.17 η2 = .309 p < .001 Classifier F(4,192) = 5.70 η2 = .106 p < .001 Window*Classifier F(20,192) = 6.27 η2 = .395 p < .001 AOI set*Classifier F(4,192) = 5.97 η2 = .111 p < .001 AOI set*Classifier*Window F(20,192) = 2.23 η2 = .189 p < .05 Boredom kappa Window F(5,192) = 14.14 η2 = .269 p < .001 Classifier F(4,192) = 5.41 η2 = .101 p < .001 Window*Classifier F(20,192) = 6.48 η2 = .403 p < .001 AOI set*Classifier F(4,192) = 6.03 η2 = .112 p < .001 AOI set*Classifier*Window F(20,192) = 2.18 η2 = .185 p < .05   The main effect of classifier is weak, and heavily qualified by several interaction effects involving classifier. Once again we see an interaction effect between classifier and window length with a large effect size, indicating that the choice of window length has a strong impact on whether the classifiers achieve accuracy exceeding the base-line.   Although there is no main effect of AOI representation, this factor is involved in two interaction effects. The first is an interaction between the AOI representation and classifier. As is evident in Figure 8.11, different classifiers give the best results, depend-ing on the AOI set. For example, RF achieves excellent results when using the detailed representation, but it cannot offer competitive performance with SVM and LR when the compressed AOIs are used. Perhaps this is because RF is better equipped to deal with large feature sets [22]. The second interaction effect involves all three factors: AOI rep-resentation, classifier, and window length. There are significant differences in classifica-tion performance between the two AOI representations, but only at certain windows with certain classifiers. For example, the compressed representation significantly exceeds the detailed representation at a window of 8 seconds. Further, the compressed repre-sentation offers better peak performance, reaching a height of 58.12% (kappa = .159) with LR at a window of 14 minutes (although this difference does not reach signifi-cance). 65  In light of the fact that the compressed AOI representation has fewer features it is therefore also less computationally expensive. Although the two representations have comparable performance for boredom prediction, in terms of computational efficiency the simpler representation may be the better choice. 8.4.1.2 Curiosity  Figure 8.12: Curiosity prediction accuracy achieved by the best classifiers using detailed AOIs (blue) vs. the best classifiers using compressed AOIs (red) Table 8.5: Effects of window on classifying curiosity using compressed AOIs Outcome Measure Effect F-Ratio  Effect Size Sig. Value Curiosity accuracy Window F(5,192) = 10.40 η2 = .213 p < .001 Classifier F(4,192) = 7.34 η2 = .133 p < .001 Window*Classifier F(20,192) = 6.38 η2 = .399 p < .001 Window*AOI set F(5,192) = 5.83 η2 = .132 p < .001 Curiosity kappa Window F(5,192) = 22.37 η2 = .368 p < .001 Classifier F(4,192) = 11.63 η2 = .195 p < .001 Window*Classifier F(20,192) = 6.21 η2 = .393 p < .001 Window*AOI set F(5,192) = 6.19 η2 = .139 p < .001   35404550556065701 0.75 0.5 0.25 0.1 0.01% Accuracy Window Length Curiosity - Detailed vs. Compressed AOIs RF-7AOIsNaiveBayes-7AOIsNaiveBayes-5AOIsBaseline66  Figure 8.12 shows the classifiers that achieved the best accuracy exceeding the base-line for predicting curiosity. The significant effects found via the GLM are given in Table 8.5, and once again they are consistent for both accuracy and kappa scores.  As before there are main effects of both classifier and window length, but the ef-fect of window length is slightly different than our previous results. In this case, the best window lengths are the two longest (14 and 10.5 minutes), as well as the shortest (8.4 seconds). For kappa scores, the 8.4 second window is significantly better than all oth-ers. This difference from the previous results is due to the fact that the compressed AOI representation does not offer performance exceeding the baseline at any window other than 8.4 seconds, as can be seen in Figure 8.12. It seems that the best choice of win-dow length depends on AOI representation for curiosity.  There is an interaction effect of AOI set and window length, which indicates that there are significant differences between the two AOI sets, but only at certain window lengths. One example is a window length of 14 minutes; the detailed representation ob-tained an average of 61.16% (SD = 2.78) for this window, which was significantly better than the compressed representation‘s 55.89% (SD = 3.19), t(19) = 5.575, p < .001. It appears that when a full interval of 14 minutes is used, the detailed AOI representation contains information that is necessary to predict curiosity with accuracy exceeding the baseline, while the compressed representation does not. The same interaction between window and classifier as described in the previous section was found again here.  The peak accuracy overall was 63.89% (kappa = .215), and was obtained with the detailed representation and a window of 14 minutes. The compressed AOI repre-sentation cannot offer this type of performance, and it is also significantly worse than the detailed representation at this window. Therefore the additional complexity inherent in the detailed representation may be necessary in order to achieve the best curiosity prediction performance, although the compressed representation can offer accuracy exceeding the baseline at certain windows. 8.4.2 Important features for compressed AOIs Using the same process as described in section 8.2, we analyzed which features were selected most frequently by the wrapper feature selection process. We focus only on those window lengths that show the best performance with compressed AOIs (14 67  minutes for boredom and 8.4 seconds for curiosity), reasoning that features extracted using these window lengths are the most informative.  Interestingly, we found that for both boredom and curiosity, the most frequently selected feature was generic, and did not relate to a specific AOI. Therefore these pat-terns could potentially generalize to any system. For boredom, the most frequently se-lected feature was the mean absolute path angle, which is the angle between a saccade and the horizontal plane. We found that it was smaller for bored students, suggesting they tend to make more horizontal eye movements. For curious students, the most fre-quently selected feature was the mean relative path angle, or the angle between two consecutive saccades. It was greater for curious students, and also smaller for bored students, suggesting that students who are engaged appear to have less linear gaze patterns. Other generic features were selected as well, and the trends related to them are as follows: bored students had more fixations, while curious students had a longer mean fixation duration and a shorter fixation rate, meaning that curious students attend to one point for longer, while bored students change their fixation point frequently. Final-ly, curious students have a smaller saccade length standard deviation, which shows that the distance between the points they fixate on doesn‘t vary as greatly as it does for stu-dents who do not report feeling curious.   Confirming our findings from section 8.2, we find once again that curious students are making more use of the Table of Contents (TOC). The longest fixation on the TOC was far greater for curious students, as was the time to the last fixation. This shows that curious students spend more time attending to the TOC, and check it more frequently. Further, curious students spent a higher proportion of time on the TOC, and had a high-er number and proportion of gaze transfers from the TOC to the content. This is likely because they are actively using the table to navigate through the content in order to learn the material. However, we also found that bored students had more transfers back and forth between the TOC and the content. This finding is difficult to interpret, although it could be because students who are bored with the current content use the TOC to change it, look at the new content, and change it again if they are still bored. Adding information on student actions may help clarify this point. These overlapping findings demonstrate both that boredom and curiosity cannot be considered opposite emotions 68  in this study, but also the difficulty inherent in using gaze alone to distinguish between different emotions.  Yet in general, the patterns detected in the frequently selected features were con-sistent between boredom and curiosity. For instance, we found that both bored students and students who were not curious (hereafter referred to as disengaged students) at-tended frequently to the Learning Tools (LT) AOI, which includes the Agent. We found that while bored students had a larger number of transfers from the LT back to the LT again, this same statistic was smaller for curious students (as compared to not curious students). Bored students also had a higher number of fixations on the LT in general. We do not have the evidence to claim that the learning tools in MetaTutor are generat-ing feelings of boredom in students; rather, these students may simply be allowing their gaze to wander over the interface, and their interest in the LT could be viewed as at-tending to ‗seductive details‘ rather than the text [104]. The features selected from the detailed AOI representation showed that curious students fixated less on the Agent (a component of the LT AOI), but did not detect the opposite trend related to bored stu-dents attending more to elements of the LT. It appears that collapsing the Agent and Learning Strategies Palette into the LT AOI made these trends more prominent, and thus relevant for classification.  Another consistent finding was that engaged students (curious and not bored) at-tend to the content (which includes both the text and image). Curious students spend a higher proportion of fixations on the content, while bored students spend a smaller pro-portion of time on the content, and transfer gaze from the content to another point within the content AOI less frequently. This intuitive finding demonstrates that curious students are attending to the learning material more than bored students. Although section 8.2 had several findings that related to curious students attending to the text and image, the compressed AOIs have allowed us to disambiguate the many detailed features chosen previously, and detect a clear and obvious engagement pattern. Finally, we found that features related to the clock were chosen as important, showing that including the clock AOI was in some sense useful, even if it did not lead to significant performance gains at all windows. Actually, those classifiers that showed the best performance with the compressed AOIs (SVM and LR) were also those for which 69  clock AOI features were selected most frequently. However, the findings are somewhat counter-intuitive. The longest fixation on the clock was found to be shorter on average for bored students. Further, curious students had more transfers from the content to the clock, while bored students had fewer transfers from the TOC to the clock. This seems to suggest that it is engaged students who actually make the most use of the clock. In looking frequently from the TOC or the content to the clock, perhaps they are trying to manage their time in order to learn the material they need to complete their subgoals. Although this is conjecture, perhaps it is evidence of the type of management of the learning process expected from self-regulated learners.  Figure 8.13: Depiction of gaze trends detected with compressed AOIs, for en-gaged (yellow) and disengaged (red) students In summary, the findings we have discussed above are shown in Figure 8.13. As before, the width of the lines and circles indicates the number of findings related to this trend. In general the patterns are consistent with those detected using the detailed AOI representation; engaged students attend to the TOC and content (specifically, the im-age), while disengaged students attend to the LT (or the Agent). The findings related to the clock are obviously new. In contrast with the detailed AOIs, here we did not find that features related to the goals AOI were selected frequently. Perhaps this is because the findings related to the subgoals vs. the Overall Learning Goal (OLG) were inconsistent; 70  therefore when these two AOIs were collapsed into one there was no longer a meaning-ful relationship between the Goals and affect. 8.4.3 Effects of compressed AOIs on individual reports In this section we perform the individual self-report tests that were conducted in Section 8.3 using data from the compressed AOI representation. We compare the results to those of the original detailed representation using a 2 (AOI set) x 4 (classifier) x 4 (re-port time) GLM, and group the presentation of the results into sections for each emotion predicted.  8.4.3.1 Boredom  Figure 8.14: Best individual report results for predicting boredom from a) the de-tailed AOI representation (blue) and b) the compressed AOI representation (red) Figure 8.14 shows the classifiers that achieve the highest accuracy which exceeds the baseline at any report, for both the detailed AOI representation (shown in blue), as well as the compressed representation (in red). The effects detected by the GLM are pre-sented in Table 8.6. 303540455055606570751 2 3 4% Accuracy Report Boredom Accuracy by Report - Compressed AOIs  NaiveBayes-7AOIsSVM-7AOIsLogistic-5AOIsRF-5AOIsBaseline71  Table 8.6: Effects of report on classifying boredom using compressed gaze fea-tures Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Report F(3,128) = 23.23 η2 = .352 p < .001 Classifier F(4,128) = 4.38 η2 = .120 p < .01 AOI set F(1,128) = 6.60 η2 = .049 p < .05 Report*Classifier F(12,128) = 11.02 η2 = .508 p < .001 Report*AOI set F(3,128) = 4.31 η2 = .092 p < .05 AOI set*Classifier F(4,128) = 8.99 η2 = .219 p < .001 AOI set*Classifier*Window F(12,128) = 3.34 η2 = .238 p < .001 Boredom kappa Report F(3,128) = 23.23 η2 = .352 p < .001 Classifier F(4,128) = 4.38 η2 = .120 p < .005 Report*Classifier F(12,128) = 11.02 η2 = .508 p < .001 Report*AOI set F(3,128) = 4.31 η2 = .092 p < .01 AOI set*Classifier F(4,128) = 8.99 η2 = .219 p < .001 AOI set*Classifier*Window F(12,128) = 3.34 η2 = .238 p < .001  There is a strong main effect of report time, indicating that the relationship be-tween gaze and affect is much clearer at certain times than at others. The best times occur at report 1 and report 3, which are significantly better than the other two reports, and produce results that are significantly better than the baselineThere are also weak main effects of classifier and AOI set, which are qualified by a series of interaction ef-fects. There is an interaction effect between report time and classifier, indicating that there are significant differences between the classifiers and the baseline, but whether those differences exist depends on the self-report. For example, the compressed AOI representation is worse than the baseline at self-report two, but achieves an accuracy of 68.4% (kappa = .362) at self-report one, which significantly exceeds the baseline, t(4) = 4.33, p < .05. It also exceeds the baseline at self-report three. The interaction between AOI set and classifier shows that the choice of classifier depends on the AOI set, as was the case when all reports were pooled in Section 8.4.1.1. As Figure 8.14 shows, Naïve Bayes is the best classifier for the detailed AOI set, while Logistic is the best classifier for the compressed AOIs. There is an interaction between report time and AOI 72  set, showing that the AOI set offering the best performance changes, depending on the report.  Finally, there is an interaction between all three factors in the GLM: report, classi-fier, and AOI set. This demonstrates that there are significant differences between the two AOI sets that are only present for specific classifier-report combinations. For exam-ple, the best accuracy achieved by the detailed representation at self-report two (using the SVM classifier) significantly exceeds that of the compressed representation‘s best classifier for this window, t(4) = 2.82, p < .05, but the best detailed classifier does not exceed the best compressed classifier at self-report three.  The lack of clear consistent statistical effects demonstrating the superiority of one AOI representation is likely due to the high variance between the different self-reports. Both offer accuracy exceeding the baseline for two of the four reports. Therefore the decision of one representation over the other may depend on other concerns, such as computational complexity.  8.4.3.2 Curiosity  Figure 8.15: Best individual report results for predicting curiosity from a) the de-tailed AOI representation (blue) and b) the compressed AOI representation (red) Figure 8.15 gives a graphical representation of the best curiosity-prediction results for the two AOI sets, and Table 8.7 shows the effects detected by the GLM. Note that there are two effects which were found for the accuracy results but not for the kappa scores. 40455055606570751 2 3 4% Accuracy Report Curiosity Accuracy - Compressed AOIs   NaiveBayes-7AOIsRF-7AOIsRF-5AOIsSVM-5AOIsBaseline73  The differences are due to the skewed and variable nature of the baseline for this test. To clarify the discussion of the findings, we include additional graphs of the accuracy and kappa scores obtained for curiosity using the compressed representation in Figure 8.16.   Table 8.7: Effects of report on classifying curiosity using compressed gaze fea-tures Outcome Measure Effect F-Ratio  Effect Size Sig. Value Curiosity accuracy Report F(3,128) = 23.23 η2 = .352 p < .001 Classifier F(4,128) = 4.38 η2 = .120 p < .05 Report*Classifier F(12,128) = 11.02 η2 = .508 p < .001 Report*AOI set F(3,128) = 4.31 η2 = .092 p < .001 AOI set*Classifier F(4,128) = 8.99 η2 = .219 p < .001 AOI set*Classifier*Window F(12,128) = 3.34 η2 = .238 p < .05 Curiosity kappa Report F(3,128) = 9.21 η2 = .178 p < .001 Report*Classifier F(12,128) = 5.41 η2 = .337 p < .001 Report*AOI set F(3,128) = 15.56 η2 = .267 p < .001 AOI set*Classifier F(4,128) = 6.81 η2 = .175 p < .001  There is a main effect of report for both accuracy and kappa, demonstrating that the ability to predict curiosity is highly variable between self-report times. There is a small main effect of classifier on accuracy, but there is no classifier that shows consist-ently superior performance for the two measures. 74  -0.2-0.100.10.20.30.40.51 2 3 4Kappa score Self Report Curiosity - Kappa by Report Time Logistic RFNaiveBayes BaselineSVM   There is a strong interaction effect between classifier and report time for both kappa and accuracy, showing that the performance of the classifiers relative to each other and the baseline is dependent on the self-report. As before, there are interactions between report time and AOI set, and between AOI set. Finally, there is a triple interac-tion effect involving AOI set, classifier, and report time. There are significant differences that exist between certain classifiers and the baseline, which are only present for certain AOI sets at certain reports. From section 8.3.2 we know that there were no results from the detailed representation with curiosity prediction accuracy exceeding the baseline. However, with features from the compressed representation, the RF classifier reaches an accuracy of 71.30% (kappa = .398) at self-report four, significantly exceeding the baseline, t(4) = 4.34, p < .05. At self-report three, a different classifier (SVM) is able to achieve accuracy significantly exceeding the baseline. Therefore the ability to predict 3540455055606570751 2 3 4% Accuracy Self Report Curiosity - Accuracy by Report Time Logistic RFNaiveBayes SVMBaselineFigure 8.16: Curiosity prediction results for individual reports, obtained with compressed AOIs 75  curiosity with accuracy exceeding the baseline is dependent on all three factors: the re-port, the classifier in question, and the AOI representation.  Only the compressed AOI representation allows curiosity to be predicted with ac-curacy exceeding the baseline from individual reports. Further, the peak accuracy achieved is higher than any other test involving curiosity that has been conducted so far. This suggests that using compressed AOIs and individual reports to predict curiosity may be a fruitful avenue of inquiry for designing an affect-adaptive MetaTutor.  8.4.4 Compressed AOI conclusions Whether or not to use a compressed representation to calculate the gaze features is dependent on not only the emotion being predicted, but also the test. When individual reports are considered, compressed AOIs should be used to predict curiosity, but it is not clear that they offer a significant advantage for boredom. When all of the available data is used, the trend flips; compressed AOIs may be slightly better for boredom, but impair curiosity prediction performance. When choosing a representation, not only will these factors have to be taken into account, but also the added computational complexi-ty of using a larger, detailed feature set. Because it uses fewer features, the com-pressed representation also has the advantage of making it easier to add additional fea-tures from other sources to the classifiers, in a process known as feature fusion [36]. It is also worthwhile to point out that feature selection on the compressed representation was able to detect more consistent, interpretable trends, which became muddled by the large number of inconsistent features selected from the detailed representation.     76  9 Results of including additional distance features In this section we present the results of using only four features related to the partici-pants‘ distance from the screen over a relevant time interval (the mean, standard devia-tion, max and min) to predict boredom and curiosity. Since the data was collected with the eye tracker, the number of data points in each of these experiments is consistent with the number of data points in the eye tracking experiments above. Because we had so few features, we did not perform wrapper feature selection or principle component analysis to reduce the dimensionality. In addition to the four classifiers used with eye gaze data described in the previous section, we also include the Multilayer Perceptron (MLP) algorithm in these tests. We felt that distance features might be sufficiently differ-ent from gaze to cause different classifiers to exhibit the best performance, and MLP classifiers have been shown empirically to have state-of-the-art performance in some domains [56]. We conduct ten rounds of 10-fold CV, and report the final results aver-aged over the ten trials. Once again, we present only the accuracy results in cases where the accuracy and kappa results are analogous. 9.1 Effect of window length on distance results We begin by investigating the optimal amount of interaction time that should be used to compute distance features in order to effectively predict affect. It is quite possi-ble that a different window length will be more appropriate with this new dataset; per-haps the standard length of 20 seconds often used in affective computing literature [49] will be more effective here. We use the same GLM design that was introduced in the previous gaze chapter, except that we include MLP as an additional classifier. This gives us a 6 (classifier) x 6 (window length) model. We will group our discussion of the results into two sections, one for each emotion.  77  9.1.1 Boredom  Figure 9.1: Distance feature accuracy in predicting boredom by window length The accuracies for predicting boredom with distance features calculated from various window lengths are graphed in Figure 9.1. The results of the GLM computed from the same data are presented in Table 9.1. Table 9.1: Effects of window size on classifying boredom using distance features Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Window F(5,270) = 31.93 η2 = .372 p < .001 Classifier F(5,270) = 19.48 η2 = .265 p < .001 Window*Classifier F(25,270) = 7.50 η2 = .410 p < .001 Boredom kappa Window F(5,270) = 25.84 η2 = .324 p < .001 Classifier F(5,270) = 21.92 η2 = .289 p < .001 Window*Classifier F(25,270) = 8.07 η2 = .428 p < .001  There is a main effect of window length for both boredom kappa and accuracy. The best window length (significantly better than any other) is 50% of the data, or 7 minutes. Once again, we have found evidence that a short interval of a few seconds may not be effective for boredom prediction, although for distance the optimal interval is several minutes shorter than for gaze. Further, distance appears to be more flexible to 45.0050.0055.0060.0065.0070.001 0.75 0.5 0.25 0.1 0.01% Accuracy Window Length Boredom - Accuracy by Window  Logistic RF NaiveBayesBaseline SVM MLP78  the choice of window length than gaze, since accuracies significantly exceeding the baseline were obtained at every window except the smallest window of 8.4 seconds.  There is a main effect of classifier for both kappa and accuracy. Over all windows, the Random Forests classifier is significantly better than all other classifiers, including the baseline. There is also a significant interaction between classifier and window, which shows that the relative rankings of the classifiers and the baseline change, de-pending on the window length. For example, the Logistic Regression classifier does not exceed the baseline at an 8.4 second window, but at a window of 7 minutes it reaches a peak of 60.30% (kappa = .197), which does significantly exceed the baseline, t(9) = 18.366, p < .001. It appears that distance from the screen provides sufficient information determine whether a student is feeling bored, with accuracy well above chance. This finding is supported by previous work showing that a student‘s posture can be used to predict boredom [85]. Therefore it is reasonable to conclude that distance from the screen approximates posture closely enough that the information can still be used to detect boredom. To better describe how distance from the screen relates to boredom and possibly posture, we would like to see which features are the most informative for predicting boredom. Since there are only four distance features we did not conduct WFS; instead we used a binary logistic regression analysis to detect which features contributed signif-icantly to predicting boredom, and present the results in Table 9.2. The LR analysis al-lows us to assess whether any of the 4 distance features differed significantly between students who report feeling bored and those that do not. The chi-squared statistic for the model was significant,    = 11.042, p < .05, and the Nagelkerke R2 value was .071, meaning that the LR model built with these features is significantly better than chance at detecting boredom. The column marked B in Table 10.2 gives the weights applied to each feature in the logistic regression model. They are in log-odds, and give the ex-pected amount of increase (or decrease) in the log-odds that a student will be bored, given a one unit increase in the feature.  Table 9.2: Logistic Regression analysis of distance features Distance feature B se Wald Significance Maximum -.008 .003 5.362 .021 79  Mean .011 .004 8.758 .003 Minimum -.003 .003 1.395 .238 Standard deviation -.002 .010 .029 .864 Constant .736 1.516 .236 .627  The last column of Table 10.2 gives the significance value of the Wald chi-square tests for each of the features, which essentially tells us whether the feature contributed significantly to the model. We see that only the maximum distance and the mean dis-tance reach significance. We investigated the values for these features and found that the average distance from the screen was 58.6 cm for bored students, but 56.6 cm for students who are not bored. Similarly, the maximum distance from the screen for any bored student was 85.6 cm, but it was only 82.7 cm for students who were not bored. Taken together, these values suggest that bored students tend to sit farther back from the screen than other students; perhaps they are leaning back because they are disin-terested. This may provide a simple and effective way for an affect-adaptive system to detect and respond to boredom.  9.1.2 Curiosity  4550556065701 0.75 0.5 0.25 0.1 0.01% Accuracy Window Length Curiosity - Accuracy by Window  Logistic RF NaiveBayesBaseline SVM MLP80  Figure 9.2: Distance feature accuracy in predicting curiosity by window length Figure 9.3: Distance feature kappa in predicting curiosity by window length Unlike boredom, curiosity cannot be predicted with accuracy exceeding the majority-class baseline using distance, as is evident in Figure 9.2. The effects detected by the GLM are therefore of little practical interest; for this reason we do not present the other curiosity results here, but instead include them in Table 14.2 of Appendix. 9.2 Effect of report time on distance results In this section we present the results obtained by using distance to predict each self-report individually, as in Section 8.3. Once again we use a 6 (classifier) by 4 (report time) GLM to analyze the results. Section 9.2.1 will present the results obtained for classifying boredom, and section 9.2.2 the results for curiosity -0.2-0.100.10.20.31 0.75 0.5 0.25 0.1 0.01Kappa score Window Length Curiosity - Kappa by Window Length Logistic RF NaiveBayesSVM MLP Baseline81  9.2.1 Boredom   Figure 9.4: Distance feature accuracy in predicting boredom by report time The effect of classifier itself was not significant, and there were no classifiers which significantly exceeded the baseline over all reports. Unlike in the previous test, there was no classifier and report combination that produced results significantly ex-ceeding the baseline. Therefore we do not discuss the results of the analysis in detail. It is interesting that restricting focus to a single self-report worsened the predic-tion accuracies of models built using distance features, but improved those of the gaze models. Normally, machine learning algorithms benefit from more data; if there is a con-sistent, learnable concept within the data, the generalization error should drop towards 0 as the size of the data set approaches infinity [87] [111]. Therefore it should not come as a surprise to see reduced accuracy as the dataset size drops from 204 sample points in the window tests in the previous section, to 51 for each report in these tests. What is surprising is the fact that this did not happen for gaze; rather, we saw improved accura-cy in the individual report tests in Section 8.3. This may suggest that there is no single gaze pattern that is always indicative of boredom, but rather that these patterns change depending on the interaction time, and possibly the content, of MetaTutor. In contrast, 35404550556065701 2 3 4Kappa score Self Report Boredom - Accuracy by Report Time LogisticRFNaiveBayesSVMMLPBaseline82  distance from the screen may contain a consistent pattern that is better learned with more examples. 9.2.2 Curiosity  Figure 9.5: Distance feature accuracy in predicting curiosity by report time The results of the curiosity GLMs were similar; there was no combination of classifier and report that could provide results exceeding the baseline. For this reason we do not discuss the results in detail, but present the remainder in Table 14.3 of AppendixA. 9.3 Conclusions It would appear that with enough data points, distance from the screen contains suffi-cient information to predict boredom with accuracy above chance. However this same result is not true of curiosity, for which distance features do not appear to be informa-tive. It is possible that distance from the screen provides an approximation of posture, which has previously been linked to boredom [32]. We found that in general, bored stu-dents tend to sit farther away from the screen, perhaps because they are leaning back. In contrast to gaze, focusing on a single self-report worsens the performance of the dis-tance-based classifiers. Perhaps the relationship between gaze and affect is more unique to the current progress through MetaTutor; this would make sense, considering 35404550556065701 2 3 4Kappa score Self Report Curiosity - Accuracy by Report Time LogisticRFNaiveBayesSVMMLPBaseline83  that the visual contents of MetaTutor change along with the student‘s progress, thus potentially greatly affecting gaze patterns. The same cannot be said for distance from the screen.    84  10 Electrodermal activity results This chapter will present the results obtained with the Electrodermal Activity (EDA) fea-tures introduced in Chapter 5. We have a total of 11 features: the mean, standard devia-tion, maximum and minimum of the normalized EDA signal in microSiemens (µS), the mean, standard deviation, maximum and minimum of the first derivative of the signal in µS/s, and the standard deviation, minimum, and total number of peaks in the derivative. Once again, there are too few features to warrant feature selection or principal compo-nent analysis. Instead we simply perform classification using ten rounds of 10-fold CV, and report the results averaged over the 10 rounds. Note that because the participants with invalid or missing data differ between the EDA and gaze/distance datasets, the baseline is slightly different for these results. There were 56 participants available for this experiment, however one participant was missing one self-report. This led to a total of 223 data points for the window length experiments, and 53-54 data points for each self-report time.  10.1 Effect of window length on EDA results As in the previous chapter, we assess how much interaction time is needed to compute EDA features that will result in good classification accuracy, because this information is necessary for anyone seeking to build an affect-sensitive ITS using EDA features. We use a 6 (classifier) x 6 (window length) GLM to assess which classifier and window length choices would lead to the most effective system. We will organize our discussion of the results into two sections, one for each emotion being classified.  85  10.1.1 Boredom  Figure 10.1: EDA feature accuracy in predicting boredom by window length  Figure 10.2: EDA feature kappa in predicting boredom by window length Figure 10.1 and Figure 10.2 show the accuracy and kappa of predicting boredom with 404550556065701 0.75 0.5 0.25 0.1 0.01% Accuracy Window Length Boredom - Accuracy by Window  Logistic RF NaiveBayesSVM MLP Baseline-0.2-0.100.10.20.31 0.75 0.5 0.25 0.1 0.01Kappa score Window Length Boredom - Kappa by Window Length Logistic RF NaiveBayesSVM MLP Baseline86  EDA features computed using various time intervals, and Table 10.1 gives the results of the GLM. There is a main effect of classifier for both accuracy and kappa. The Random Forests (RF) classifier is significantly better than all other classifiers, and significantly better than the baseline, even when all windows are considered. We can conclude not only that RF would be a good choice of classifier when constructing a system to predict boredom from EDA, but also that such a system is worth building, as it would be able to predict when a student felt bored with accuracy exceeding simple majority-class guess-ing.  Table 10.1: Effects of window size on classifying boredom using EDA features Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Window F(5,270) = 9.34 η2 = .147 p < .001 Classifier F(5,270) = 47.17 η2 = .466 p < .001 Window*Classifier F(25,270) = 3.77 η2 = .259 p < .001 Boredom kappa Window F(5,270) = 7.96 η2 = .128 p < .001 Classifier F(5,270) = 41.24 η2 = .433 p < .001 Window*Classifier F(25,270) = 4.12 η2 = .276 p < .001  There is a main effect of window length, and the three longest windows (14, 10.5, and 7 minutes) significantly exceed the other three windows. In every experiment we have conducted using various data sources, a longer interval of time is best for calculat-ing features that are predictive of self-reported boredom in MetaTutor, rather than the 20-second interval frequently used in the affective computing literature [49].  There is also an interaction effect between classifier and window length, which in-dicates that the performance of the classifiers relative to each other and to the baseline varies with the window length. For example, the SVM classifier does not exceed the baseline at the smaller window lengths, but at a window of 14 minutes it reaches 56.64% (kappa = .095), which does significantly exceed the baseline, t(9) = 3.67, p < .01. Overall, the best accuracy was 59.91% (kappa = .198), found with RF at a 7 minute window.  We are interested in which EDA features differ between students who report feel-87  ing bored and those who do not. We used a binary logistic regression analysis to as-sess whether any of the 12 EDA features differed significantly between the two groups at the best window of 7 minutes, and found that they did not. Table 10.2 gives the re-sults of the logistic regression analysis, for which the chi-squared statistic    = 10.578, p > .05, and the Nagelkerke R2 value was .062, which shows that the LR model did not provide a better-than-chance ability to detect boredom.  Table 10.2: Logistic regression analysis of EDA features EDA feature B se Wald Significance Mean -2.235 2.237 .923 .337 Std. dev. 1.168 6.553 .032 .859 Min -2.577 2.676 .928 .335 Max 1.772 1.593 1.238 .266 Deriv. mean 688.217 1478.038 .217 .641 Deriv. std. dev. 1.595 17.704 .008 .928 Deriv. min -.455 .348 1.703 .192 Deriv. max -.137 .321 .182 .669 Number of peaks .001 .001 .487 .485 Peaks std. dev. -.025 .082 .097 .756 Peaks min -.105 .190 .308 .579 Constant -.616 .256 5.793 .016  Once again, the B column in Table 10.2 gives the weights applied to each feature in the logistic regression model, in log-odds. Normally we could claim that a higher B value meant that the EDA feature contributed more to the logistic regression model, thus suggesting it was important in discriminating between students who are bored and those who are not. However, as can be seen from the last column of Table 10.2, none of the Wald chi-square tests were significant for any of the features. Essentially this means that the differences found between these features could be attributed to chance. Taken together, these results suggest that none of the features in isolation provide a great deal of information about whether a student is bored. However it appears that the classifiers are still able to use these features to build a function that can discriminate between students who are bored and those who are not, perhaps by building complex 88  classification rules involving multiple features (e.g. as in Random Forests).  Although we cannot be sure that any of the features differ significantly between the two groups, we would still like to provide some insight into how the values differ. We found that the mean EDA signal was lower for bored students (M = -.024, SD = .121) than it was for students who remained engaged (M = -.006, SD = .117). Since EDA is linked to arousal [95], this finding is not surprising. Similarly, the minimum EDA re-sponse also tended to be lower for bored students (M = -.143, SD = .166), than for en-gaged students (M = -.100, SD = .126). The minimum of the EDA derivative was lower for bored students (M = -.715, SD = .947), than for engaged students (M = -.594, SD = .979) as well, suggesting that bored students experienced steeper declines in their EDA signal than engaged students. However we also found that bored students experienced more peaks (M = 201.41, SD = 328.01), than engaged students (M = 165.60, SD = 314.56). Normally a peak in the EDA signal is considered a moment of arousal, so the cause of this finding is not immediately obvious. Perhaps it is explained by the fact that bored students experienced sharper declines in EDA, so when their arousal returned to normal it was more often recorded as a peak. 89  10.1.2 Curiosity  Figure 10.3: EDA feature accuracy in predicting curiosity by window length  Figure 10.4: EDA feature kappa in predicting curiosity by window length 404550556065701 0.75 0.5 0.25 0.1 0.01% Accuracy Window Length Curiosity - Accuracy by Window  Logistic RF NaiveBayesSVM MLP Baseline-0.2-0.100.10.20.31 0.75 0.5 0.25 0.1 0.01Kappa score Window Length Curiosity - Kappa by Window Length Logistic RF NaiveBayesSVM MLP Baseline90  As is plainly evident in Figure 10.3, no classifiers were able to predict curiosity with ac-curacy exceeding the baseline. Therefore we present the rest of the results in Table 14.4 of the Appendixrather than discussing them here. As with many of the previous tests, it appears that students‘ self-reported feelings of curiosity in MetaTutor cannot be predicted with accuracy exceeding the baseline, even using EDA features. This lack of success for most of the experiments may suggest that the construct validity of the curiosity measure is lacking. The measure is intended to assess whether the student is feeling engaged with the material during a short interval of time. However, the baseline is usually quite high, with a majority of students reporting that they feel curious. Further, these reports tend to remain fairly static for each partici-pant. This may suggest that what the scale is really measuring is a more stable person-ality trait, i.e. whether the student sees themselves as a curious person in general. This could explain why the changing dynamics of students‘ gaze, posture, and EDA are not helpful in predicting whether the student is ‗curious‘.  10.2 Effect of report time on EDA results In this section we describe the results of treating each self-report as its own classifica-tion problem. Note that the baselines vary drastically for these tests, so it may be more informative to attend to kappa figures in determining the performance of the classifiers. The same 6 (classifier) x 4 (report time) GLM, which was introduced in previous chap-ters, was conducted on this data as well. To save repetitious description, we present the results for both emotions in Table 10.3. We group our discussion of the results into sec-tions based on the effects we are discussing; Section 10.2.1 will present the effects of report time, while Section 10.2.2 will present the effects of classifier and the interaction effect.  91   Figure 10.5: EDA feature accuracy in predicting boredom by report time  Figure 10.6: EDA feature kappa in predicting boredom by report time 35404550556065701 2 3 4Kappa score Self Report Boredom - Accuracy by Report Time LogisticRFNaiveBayesSVMMLPBaseline-0.35-0.25-0.15-0.050.050.150.250.351 2 3 4Kappa score Self Report Boredom - Kappa by Report Time LogisticRFNaiveBayesBaselineSVMMLP92  10.2.1 Effect of report time There is a main effect of report time for both performance measures and for both bore-dom and curiosity. Once again, we see that the time spent with MetaTutor has an effect on the relationship between a physiological measure and affect, in this case EDA. For curiosity, self-report 2 was significantly better than the other reports. For boredom, we found that both self-report 3 and self-report 4 had significantly higher accuracies than the other reports, but only self-report 3 had a significantly better kappa. This discrepan-cy is likely due to the skewed baseline; the high baseline accuracy at self-report 4 is causing the accuracy to increase, without a corresponding increase in kappa.  Table 10.3: Statistical analysis of the effects of report time using EDA data Family Component F-ratio Effect Size Significance Boredom  Accuracy Classifier F(5,180) = 9.416 η2 = .207 p < .001 Report  F(3,180) = 38.354 η2 = .390 p < .001 Interaction F(15,180) = 9.905 η2 = .452 p < .001 Curiosity  Accuracy Classifier F(5,180) = 3.554 η2 = .090 p < .05 Report  F(3,180) = 27.826 η2 = .317 p < .001 Interaction F(15,180) = 8.543 η2 = .416 p < .001 Boredom  Kappa Classifier F(5,180) = 12.554 η2 = .259 p < .001 Report  F(3,180) = 14.742 η2 = .197 p < .001 Interaction F(15,180) = 9.498 η2 = .442 p < .001 Curiosity  Kappa Classifier F(5,180) = 6.293 η2 = .149 p < .001 Report  F(3,180) = 9.716 η2 = .139 p < .001 Interaction F(15,180) = 7.300 η2 = .378 p < .001 10.2.2 Effects of report time and interaction effects For both boredom and curiosity, there is a main effect of classifier, which is in some cases quite weak (e.g. for curiosity accuracy), and there are no classifiers that offer ac-curacy significantly exceeding the baseline over all reports. However there is an interac-tion effect between classifier and report time which provides further insight into this ef-fect. Although no classifier was able to significantly exceed the baseline over all reports, at certain reports the difference did reach significance. For boredom, we found that Lo-gistic Regression reached an accuracy of 61.96% (kappa = .225) at self-report three, 93  which significantly exceeded the baseline, t(9) = 12.710, p < .001. Curiosity reached a peak of 60.36% (kappa = .196) at self-report four with the RF classifier, once again sig-nificantly exceeding the baseline, t(9) = 5.901, p < .001. There were no other reports at which the classifiers exceeded the baseline.   Figure 10.7: EDA feature accuracy in predicting curiosity by report time  Figure 10.8: EDA feature kappa in predicting curiosity by report time These results do not provide a clear answer to the question of whether it is better to build an affect-sensitive MetaTutor by training the classifiers on small subsets of EDA data. Using this method, it is possible to train classifiers that can distinguish between 35404550556065701 2 3 4Kappa score Self Report Curiosity - Accuracy by Report Time LogisticRFNaiveBayesSVMMLPBaseline-0.35-0.25-0.15-0.050.050.150.250.351 2 3 4Kappa score Self Report Curiosity - Kappa by Report Time LogisticRFNaiveBayesBaselineSVMMLP94  both emotions with accuracy exceeding chance. For boredom, those classifiers that were successful achieved slightly better prediction accuracies than in the first tests, in which all self-reports are trained together. The fact that no classifier was significantly better than the baseline over all reports means that this method will fail at certain times during the interaction with MetaTutor, which may be undesirable for detecting and re-sponding when students disengage from the learning session. However, training using only a single self-report provides the only means of distinguishing curious students from EDA features with accuracy exceeding simple guessing.  10.3 Conclusions Overall, we see that when all self-reports are combined, EDA provides enough infor-mation to predict boredom with accuracy above the baseline, using the RF classifier, and a longer portion of interaction time to compute the features. In contrast, curiosity once again cannot be predicted with reliable accuracy. Using the data from only a single self-report (or a restricted portion of the interac-tion) to train the classifiers leads to mixed results; sometimes it can improve perfor-mance for boredom, and in other cases harm it. For curiosity, however, some tests ac-tually do exceed the baseline when only one report time is considered, unlike when all the reports are considered together. Perhaps the relationship between EDA and curiosi-ty is less consistent across self-reports.    95  11 Summary of individual data sources In this chapter we will review the results obtained in each of the previous three chapters, from the eye tracking, distance from the screen, and electrodermal activity features. After summarizing the results, we will briefly compare the efficacy of each data source, and provide some discussion that may help to explain the findings.  Table 11.1 and Table 11.2 show the best results obtained in each of the window tests for boredom and curiosity, respectively. The ―best windows‖ column shows the window lengths that were found to be significantly better than the others, and the col-umn to the right of it provides any additional windows that produced accuracies that sig-nificantly exceeded the baseline. For predicting boredom from gaze, the longest window of 14 minutes is consistently the most effective, whereas for the other two data sources the middle window length of 7.5 minutes proved best. In both cases the best window is a great deal longer than the standard 20-second window used in the literature [49]. For the most part, the same finding is true of curiosity; a 14 minute or 7.5 minute window was most effective when used with EDA, distance, and gaze – but only the gaze fea-tures computed with the detailed AOI representation. In contrast, the shortest window of only a few seconds was most effective with the compressed AOI representation.  Table 11.1: Best results in predicting boredom for each previous window test Data Source Peak Accura-cy Peak Kap-pa Best  Windows Additional windows above base-line Best  Classifier Exceeds base-line? Eye Gaze - Detailed AOIs 57.38% 0.139 14 mins, 10.5 mins 1.4 mins SVM Yes Eye Gaze - Compressed AOIs 58.12% 0.159 14 mins, 10.5 mins 8 secs Logistic Regression Yes Distance from the screen 60.30% 0.197 7.5 mins 14 mins, 10.5 mins, 3.5 mins, 1.4 mins Logistic Regression Yes EDA 59.91% 0.198 14 mins, 10.5 mins, 7 mins none  Random Forests Yes  96  Table 11.2: Best results in predicting curiosity for each previous window test Data Source Peak  Accuracy Peak Kappa Best Windows Additional windows above baseline Best  Classifier Exceeds baseline? Eye Gaze - Detailed AOIs 63.89% 0.215 14 mins 8 secs Random Forests Yes Eye Gaze - Compressed AOIs 61.01% 0.229 8.4 secs7 none Naïve Bayes Yes Distance from the screen 60.30% 0.136 7.5 mins none Random Forests No EDA 60.22% 0.058 14 mins none SVM No  At this point it appears that all data sources provide enough information to predict boredom with accuracy above the baseline, although distance from the screen and EDA provided better kappa scores than gaze. However for curiosity, only gaze provided suf-ficient information to achieve accuracy that significantly exceeded the baseline. Neither distance nor EDA alone was sufficient to be able to predict curiosity.   A similar trend occurred with the individual report tests, the results of which are shown in Table 11.3 and Table 11.4. Boredom could be consistently predicted with ac-curacy significantly exceeding the majority-class baseline from each of the three data sources (with gaze, in this case, offering the highest performance at a single self-report). However curiosity was difficult to predict once again; neither the detailed gaze representation nor distance from the screen provided enough information to predict cu-riosity at any of the self-reports.  Table 11.3: Best results in predicting boredom from individual reports Data Source Peak Accuracy Peak Kappa Best Reports Additional reports above baseline Best Classifier Exceeds baseline? Eye Gaze - Detailed AOIs 69.00% 0.379 3 4 Naïve Bayes Yes Eye Gaze - Compressed AOIs 63.50-68.40% 0.262-0.362 1, 3 none Logistic Re-gression Yes Distance from the screen 59.22% 0.177 3 none Logistic Re-gression No EDA 61.96% 0.225 3 none Naïve Bayes Yes                                             7 While windows of 14 and 10.5 minutes are statistically equivalent to this window in terms of accu-racy, they did not produce results that significantly exceeded the baseline 97   Table 11.4: Best results in predicting curiosity from individual reports Data Source Peak Accuracy Peak Kappa Best Reports Additional reports above baseline Best Classifier Exceeds baseline? Eye Gaze - Detailed AOIs 66.33% -0.01 1 none SVM No Eye Gaze - Compressed AOIs 63.77-71.30% 0.273-0.398 3, 4 none Random For-ests Yes Distance from the screen 66.67% 0 1 none SVM No EDA 60.36% 0.196 48 none SVM Yes  It is possible that curiosity, as assessed in this study, is not a valid construct. Perhaps when students are asked about ‗curiosity‘, they think of a more stable trait, ra-ther than a momentary affective state. This could explain why the baseline for curiosity is unusually skewed toward positive answers, and why it is so difficult to use students‘ transient physiological signals to predict curiosity. This could be a limitation of our work.  Note that although the results of Table 11.3 and Table 11.4 seem impressive compared to those obtained with all the reports pooled together, they represent only the best self-report for that test. In some cases, only one individual report showed better performance than the pooled reports. Therefore when actually building an affect classi-fier, we believe a hybrid approach to training the models may be most effective. For por-tions of the interaction that can provide a clearer model during a particular phase (such as self-report 3 when predicting boredom from gaze), the system should use a classifier trained on only the data from that portion of the interaction. However, since these phase-specific models sometimes fail, when they cannot provide accuracy exceeding the classifier trained on the pooled self-reports, the system should switch to using this as the default classifier. This mixed approach would allow for the highest chances of detecting students‘ affective states.                                               8 Self-report 2 actually had significantly better accuracy here, but did not produce results which significantly exceeded the baseline, whereas self-report 4 did 9 Note that because of time constraints we only used the detailed AOI representation for gaze, so that the 98  12 Combining all data sources The focus of this chapter is on building the best possible affect detection system for this dataset. In light of this goal, we train separate classifiers on each of the available data sources (gaze9, distance, and EDA), and perform ensemble classification by combining their decisions in a majority vote [37]. Although we also tested using feature-fusion, we found that the results were not promising, and have therefore decided to focus on en-semble classification. The results from the previous chapters are used to determine the optimal classifier and window length for each dataset. The idea is to demonstrate the level of performance that can be expected from an affect-adaptive system trained for use with MetaTutor from these data sources. Unfortunately however, the external validity of the results is limited by the small size of the dataset. In order to combine data from all three sources of information in our classification system, we have to include only those participants that have valid data for each source. After computing information for all participants, we found that only 39 had valid gaze, distance, and EDA data, as well as emotion self-reports. This extremely small sample size is obviously not optimal for machine learning, so the results obtained for ensemble classification should be taken as a proof-of-concept, rather than as a de-finitive answer to the question of what level of performance can possibly be achieved by combining these data sources.  This chapter is organized as follows. Section 12.1 will present the results obtained by predicting each individual report separately using ensemble classification, and will be organized according to the emotions being predicted. Section 12.2 will follow by pre-senting the best results that we have achieved on this dataset, by combining all data sources, and using our previous findings to construct the optimal classifiers.   12.1 Ensemble classification of individual reports  We first present the results of treating each self-report time as a separate classification problem. The motivation for conducting this test is to determine if, when constructing an                                             9 Note that because of time constraints we only used the detailed AOI representation for gaze, so that the ensemble experiments could be run in parallel to the compressed AOI experiments.  99  affect-adaptive MetaTutor, it is better to train several classifiers for different portions of the tutoring interaction, rather than simply one overall emotion classifier.  To construct the optimal classifier for each data source, we examined the results that were obtained on the individual self-report tests in each of the previous experi-ments. We chose the classifiers that had achieved the highest accuracy scores over all reports. The classifiers chosen for each emotion and data source are shown in Table 12.1. Note that in our original tests, not all of these classifiers offered performance ex-ceeding the baseline (for example, the distance classifier cannot predict curiosity with accuracy above chance). Since the ability of an ensemble to provide an improvement in performance is dependent on each of the classifiers in the ensemble being able to pre-dict the classification label with accuracy slightly better than random guessing [43], the low-performing classifiers in our ensemble could potentially be very detrimental. How-ever, in this experiment we are also using a substantially different dataset; there are on-ly 39 participants with data available for all three sources, and consequently only 39 da-ta points at each self-report. Since many of the previous participants have been elimi-nated, we felt that the data might be sufficiently different to make the experiment worth performing. Table 12.1: Individual report classifiers in the ensemble  Boredom Curiosity Gaze Naïve Bayes Random Forests Distance Logistic Regression Random Forests EDA Multilayer Perceptron Support Vector Machines    However, we would still like to assess whether the ensemble classifier offers sig-nificant improvements in the ability to detect boredom and curiosity. Therefore, we use a 5 (classifier) x 4 (report time) GLM to analyze the results. The classifiers included in the model are the baseline, the classifiers trained separately on each data source (gaze, distance, and EDA), and the ensemble classifier. We present the results for each emo-tion separately.  100  12.1.1 Boredom The results of performing ensemble classification of boredom on the individual reports are shown in Figure 12.1. The classifier for each data source is trained on the restricted dataset used in ensemble classification, and their classification of each data point is combined in a majority vote, shown as the bold line in Figure 12.1. The analysis of the results is shown in Table 12.2. Table 12.2: GLM results of ensemble classification of boredom by report Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Report F(3,144) = 17.64 η2 = .269 p < .001 Classifier F(4,144) = 12.88 η2 = .264 p < .001 Report*Classifier F(12,144) = 2.81 η2 = .190 p < .01 Boredom kappa Report F(3,144) = 12.73 η2 = .210 p < .001 Classifier F(4,144) = 13.62 η2 = .275 p < .001 Report*Classifier F(12,144) = 3.02 η2 = .201 p < .005  There is a main effect of report time. This replicates findings from each of the previous experiments, con-tributing to the evidence that the rela-tionship between behaviour and bore-dom changes over the course of the interaction with MetaTutor. In this case it is clearest at self-report four, which has significantly better results than the other reports. There is a main effect of classifi-er. The gaze classifier (trained only with gaze data) has significantly higher accuracy and kappa scores than both the EDA classifier and the distance classifier. Note however that both the Figure 12.1: Ensemble results for classify-ing boredom from individual reports 303540455055606570751 2 3 4% Accuracy Self Report Boredom - Accuracy by Report Time Distance GazeEDA BaselineEnsemble101  EDA and distance features had better results in the previous individual report experi-ments described in sections 9.2.1 and 10.2. The decreased performance is likely due to the decreased sample size, as described above. The performance of the ensemble classifier did not differ significantly from any of the other classifiers. While a majority vote has been shown to improve classification performance, this improvement depends on the condition that each classifier in the ensemble can provide performance that is slightly better than random guessing [43]. As Figure 12.1 shows, both the EDA and dis-tance classifier perform markedly worse than random guessing, dragging down the per-formance of the ensemble. However, in self-report 4 we see that when the classifiers do meet this basic accuracy assumption, the ensemble has performance exceeding that of any of its members. As a result, there is a significant interaction effect between classifier and report time. The ensemble classifier only significantly surpasses the baseline at report four, where it achieves an accuracy of 65.15% (kappa = .307), t(9) = 2.800, p < .05. It is not significantly better than the other classifiers.  12.1.2 Curiosity  Figure 12.2 Ensemble results for classifying curiosity from individual reports  5055606570751 2 3 4% Accuracy Self Report Curiosity - Accuracy by Report Time Distance Gaze EDABaseline Ensemble102  The results of ensemble classification of curiosity from individual self-reports are graphed in Figure 12.2, and the statistical analysis is presented in Table 12.3. As usual, there was a main effect of report time, although the effect size was smaller in this experiment. A smaller effect size is encouraging, since it means that the performance is less variable between reports, and therefore that an affect-adaptive classifier trained in this manner could be more consistent and reliable. There was a main effect of classifier, qualified by an interaction between classifier and report of larger effect size. Significant differences between the classifiers are not present at all reports. It is only at self-report four that the ensemble classifier exceeds the baseline, reaching an accuracy of 67.08% (kappa = .339), which significantly ex-ceeds both the baseline, t(9) = 8.16, p < .005, and the EDA classifier, t(9) = 3.72, p < .05, but not the gaze or distance classifiers.  While the ensemble classifier is able to predict curiosity with more reliable perfor-mance than those we have seen in the previous tests (averaging 63.75% (kappa = .129) over all reports and never dropping below 56%), it only offers performance that signifi-cantly exceeds the baseline at one self-report. Further, the performance of the ensem-ble does not significantly exceed that of its members, suggesting that the effort expend-ed in collecting and combining three data sources may not be worthwhile. However this negative result may be due to the size of the dataset.  Table 12.3: GLM results of ensemble classification of curiosity by report Outcome Measure Effect F-Ratio  Effect Size Sig. Value Boredom accuracy Report F(3,144) = 7.41 η2 = .134 p < .001 Classifier F(4,144) = 5.40 η2 = .130 p < .001 Report*Classifier F(12,144) = 4.24 η2 = .261 p < .001 Boredom kappa Report F(3,144) = 7.79 η2 = .140 p < .001 Classifier F(4,144) = 12.44 η2 = .257 p < .001 Report*Classifier F(12,144) = 5.02 η2 = .295 p < .001 12.2 Ensemble classification of the entire dataset In this section we attempt to classify data from all of the available self-reports using our ensemble of classifiers trained on gaze, distance, and EDA. Once again we use the findings from the previous sections to build the optimal classifiers. For each data 103  source, the classifier and window combination that achieved the highest results for each emotion is used in the ensemble, and these combinations are shown in Table 12.4. Since Section 8.4.1 suggested that the compressed AOIs might be more effective for predicting boredom, we tested using them in the ensemble, however we found that it did not improve the results. Therefore the ensemble is once again based on the detailed AOI representation.   Table 12.4: The best classifier and window length combinations for each emotion and data source Emotion Data source Classifier Window Length Boredom Gaze SVM 14 mins Distance Logistic Regression 7 mins EDA Random Forests 7 mins Curiosity Gaze Random Forests 14 mins Distance Random Forests 7 mins EDA SVM 14 mins  It is important to realize however that the best classifier found in the previous ex-periments may not remain the best on the ensemble dataset. The ensemble dataset contains fewer participants than those of the previous experiments, since it contains on-ly those that have data for all three data sources. Rather than over 200 data points as in previous experiments, it has only 156. The smaller size could impair classification per-formance, but it may also now be lacking data points that were outliers, possibly improv-ing classification performance. For these reasons, we see that the accuracies obtained by each classifier in the ensemble vary somewhat from the accuracies reported in the previous experiments. For this reason we felt it was worthwhile to include classifiers that did not necessarily achieve better-than-baseline performance in the previous experi-ments, in case the new data led to different results.    In the following two sections we report the ensemble results achieved for bore-dom and curiosity, respectively. We compare the results of the ensemble to the base-line, and to the results achieved by using each of the three data sources independently, using a 5 (data source) multivariate GLM, which included boredom accuracy, boredom kappa, curiosity accuracy, and curiosity kappa as dependent variables.  104  12.2.1 Boredom  Figure 12.3: Overall boredom prediction accuracy for the ensemble and each data source  Figure 12.4: Overall boredom kappa scores for the ensemble and each data source The ensemble classifier had 60.15% accuracy (kappa = .186) in predicting bore-dom. There is a strong main effect of data source, F(4,37) = 13.985, η2 = .602, p < .001. The ensemble classifier significantly exceeds the baseline, indicating that combining all data sources with features computed from the appropriate window lengths does provide a model that can perform reliable classification of boredom. However, the results of the 485052545658606264Baseline Distance EDA Gaze Ensemble% Accuracy Data source used to train classifier Boredom Accuracy - Ensemble  00.050.10.150.20.25Distance EDA Gaze EnsembleKappa Data source used to train classifier Boredom Kappa - Ensemble  105  ensemble classifier did not differ significantly from those obtained by using either dis-tance information or EDA data alone. In fact, the distance classifier actually had slightly higher accuracy than the ensemble: 62.11% (kappa = .235). This can be explained by the fact that the gaze classifier, when trained on the limited dataset used for this test, did not achieve accuracy exceeding the baseline. In order for the ensemble to outper-form its members, each member needs to perform better than chance [43]. 12.2.2 Curiosity  Figure 12.5: Overall curiosity prediction accuracy for the ensemble and each data source  50525456586062646668Baseline Distance EDA Gaze Ensemble% Accuracy Data source used to train classifier Curiosity Accuracy - Ensemble  106   Figure 12.6: Overall curiosity kappa scores for the ensemble and each data source For curiosity, the ensemble classifier reached 62.88% accuracy (kappa = .134. As with boredom, there is a powerful main effect of data source, F(4,37) = 21.82, η2 = .702, p < .001. Unlike boredom, however, the ensemble classifier for curiosity does not signif-icantly exceed the baseline accuracy of 58.97%. This result is not surprising, given a) the skewed nature of the baseline itself, and b) the fact that both the EDA and gaze classifiers in the curiosity ensemble provide performance that is worse than the base-line, and in the case of gaze, significantly worse. The only classifier that significantly outperformed the baseline was the distance classifier, which reached an accuracy of 65.76% (kappa = .269).  12.3 Conclusions Formal results have established that a majority vote of weak classifiers can achieve accuracy exceeding that of the individual classifiers in the ensemble [43]. A weak classifier is one that provides accuracy that slightly exceeds chance. Unfortunate-ly, our ensembles above do not meet this criterion; up to two of the classifiers in the en-semble do not exceed the baseline. Therefore they may effectively drag down the per-formance of the better-performing classifiers, and we see that the ensemble as a whole does not outperform its best member. -0.0500.050.10.150.20.250.3Distance EDA Gaze EnsembleKappa Data source used to train classifier Curiosity Kappa - Ensemble  107  13 Conclusions and future work We have examined the effectiveness of a variety of data sources for predicting learning-related affective states in MetaTutor, an ITS designed to scaffold SRL strategies. Our main contribution is that we examine the value of eye gaze data in predicting affect, both in combination with other sources and alone. To our knowledge, no other studies have performed machine learning on eye gaze attention patterns in order to predict af-fect in an ITS. Further, we show that distance from the screen is an effective means for predicting affect, in addition to being easy and inexpensive to collect.  13.1 Thesis goals satisfaction  We conducted a variety of experiments to answer five main research questions related to the value of each data source (alone and in combination), the gaze features that are predictive of boredom and curiosity, the ability to predict curiosity itself, and the time in-terval that should be used to train affect classifiers in MetaTutor. The answers to these questions are discussed in the following sections.   13.1.1 Which data source is the most valuable for predicting affect in MetaTutor? We conducted four experiments to determine which data source provides the most use-ful information for predicting affect in MetaTutor; testing each data source independent-ly, and then combining and comparing them in the final experiment in Chapter 12. Which data source provided the best results depended on the experiment and the emo-tion being classified. For predicting boredom using data points from all self-reports, we found that dis-tance features led to the highest performance, but are not significantly better than EDA. When we tested each data source independently (using the set of participants with data for that source), we obtained the following boredom prediction accuracies: distance reached 60.29% (kappa = .197), EDA reached 59.91% (kappa = .198), and gaze reached 58.12% (kappa = .159). These results are comparable to previous research (listed in Table 2.1 and Table 2.2), which showed accuracies ranging from 60-70% in predicting boredom from other data sources, and slightly lower accuracies when focus-ing on EDA as a single data source. In the final experiment in Chapter 12 we tested each data source using a reduced set of participants that were common to all three of 108  the previous experiments. We obtained the following results: distance reached 62.11% (kappa = .235), EDA reached 56.24% (kappa = .123), gaze reached 53.79% (kappa = .048). From this experiment we learned that the classifier trained with distance features significantly exceeds that of gaze, although it was not significantly better than EDA. Therefore we conclude that both distance and EDA are the best choices for predicting boredom using data collected throughout the interaction with MetaTutor.  It is interesting that only four simple features related to a crude measurement of posture (distance from the screen) can be used to predict boredom so effectively. Alt-hough previous research has established that affect can be predicted from posture [85] [68] [32] [67], the posture features are usually collected from expensive, proprietary equipment, whereas distance can be collected easily and cheaply using commonly available infrared depth devices. Our results are comparable to some of the accuracies obtained using Body Posture Measurement Systems in our research, as shown in Table 2.1.  Although gaze was not the most valuable resource when all self-reports were combined, we also discovered that the amount of time that has passed in the interaction has a strong effect on how gaze relates to affect, and that this effect is not as strong for the other two data sources. This is likely because the content changes along with the phase of the interaction; when the interface displays different content, the meaning of gaze features representing attention to different parts of the interface also changes. By computing a new gaze model for each phase of the interaction (or self-report), boredom prediction from gaze can reach an accuracy as high as 69.0% (kappa = .379), although this is not consistent across reports. By using these more specific models for portions of the interaction where the gaze/affect relationship is clearer, it would be possible to build a more accurate boredom classifier than by using only the model trained on the full da-taset. When it comes to predicting curiosity, we see that gaze data becomes an invalua-ble resource. Neither distance nor EDA was able to provide enough information to pre-dict curiosity with accuracy exceeding the baseline when all reports were considered. Only gaze data, which provided an accuracy of 63.9% (kappa = .215), could predict cu-riosity reliably. Even when only one report was used to predict curiosity, gaze data still 109  offered the best performance, reaching a height of 71.30% (kappa = .398) at self-report 4. Distance does not allow curiosity to be predicted with accuracy exceeding the base-line at any of the individual reports. EDA does, but only at one report, and only reaching an accuracy of 60.36% (kappa = .196). Somewhat confusingly, we see that in section 12.2.2, the accuracy obtained with distance and EDA on the reduced dataset used in ensemble classification exceeds that of gaze. The best explanation for this phenomenon is that because the reduced dataset is so much smaller, overfitting becomes more of a problem. Because there are only four distance features, it makes the distance classifier less prone to overfitting. Despite the fact that we performed wrapper feature selection on the gaze features, typically 10-15 features were selected, giving the gaze classifier more opportunity to overfit the data, ultimately reducing its accuracy on the test set. Overall, we see that the best data source depends on the emotion being predict-ed, and varies between distance and gaze. We have found that gaze data is a valuable resource for predicting affect; it provides accuracy exceeding the baseline for predicting both emotions, using either all reports or a single report, and in some cases is particu-larly valuable when other resources fail. It is also important to note that the distance fea-tures, which showed surprisingly good performance, were actually collected with the eye tracker. Taken together, this means that the Tobii eye tracker provides enough in-formation to predict both boredom and curiosity with the highest accuracies we have achieved.  13.1.2 What do gaze features tell us about students’ attention patterns in MetaTu-tor? Using the gaze features selected by the wrapper feature selection process, we exam-ined the differences between students who reported feeling bored vs. those who did not, and students who reported feeling curious vs. those who did not. For the most part, find-ings were consistent between curious and ‗not bored‘ students (hereafter referred to as engaged), and bored and ‗not curious‘ students (hereafter referred to as disengaged).  Some of the findings were intuitive; for example, we found that engaged students focus frequently on the learning material (the text and image content containing infor-mation about the circulatory system). Engaged students also made more frequent use 110  of the Table of Contents (TOC), most likely because they used the TOC to actively se-lect the material they were interested in learning about. Other trends appear counterin-tuitive; for example, students who fixate longest on the clock, and look most frequently to the clock from the TOC and content, actually report feeling engaged. Perhaps this is because they are using the clock to manage their time effectively. Because the TOC, clock, and text and image content, are features that are likely to be part of a wide range of learning environments, we hope that these findings can generalize to other systems. Even more generalizable are the findings related to general eye tracking features that do not pertain to a specific MetaTutor AOI. For example, we found that engaged students tended to have longer fixations, while disengaged students had shorter, more frequent fixations. Further, disengaged students had greater variation in the distance between two consecutive fixations. Taken together, these findings seem to suggest that disengaged students have more erratic fixation patterns, while engaged students are more focused and systematic.  We also detected patterns specific to MetaTutor‘s components that are designed to scaffold self-regulated learning. We found that increased attention to the Agent and Learning Strategies Palette (LSP) tends to be indicative of disengagement. We are cau-tious about using these findings to conclude that the learning tools are ineffective. Since these tools are located on the periphery of the screen, they may simply be a symptom of disengaged students‘ wandering gaze [104]. Or, because the LSP contains a variety of options for progressing through the content, perhaps bored students are using it once they are no longer interested in a topic. However, it is possible that looking at the Agent does not provide educational value to the students; a similar study of gaze in MetaTutor found that it was the only AOI not predictive of learning gains [17]. 13.1.3 Can affect be predicted reliably by combining several sources? Unfortunately, the results of our ensemble classification experiments were disappoint-ing. While in general we should expect that the performance of the ensemble classifier as a whole outperforms that of its member classifiers, this guarantee rests on the as-sumption that each member of the ensemble can provide better-than-chance prediction accuracy [43]. For our dataset this was not the case, likely because of the extremely limited number of participants that had valid data for both the eye tracker and the skin 111  conductance bracelet. We will further discuss these limitations in section 13.2. However, it is worth noting that both boredom and curiosity can be predicted with accuracy ex-ceeding the baseline using the ensemble; for boredom, using one model trained on all available data, and for curiosity, using separate models for each individual report.  13.1.4 Can curiosity be predicted reliably? Throughout this work we have experienced difficulty in constructing a model that can predict curiosity with accuracy above the baseline. It is possible that this is due to the self-report measure of curiosity having poor construct validity. Rather than measuring a temporary state of learner interest that changes along with the participant‘s learning ex-perience, it is possible that it measures a long term trait. When asked if they are curi-ous, perhaps students choose their answer based on their perception of themselves as a curious person, more than how they are feeling at that moment. This would explain the low within-subjects variability for the curiosity reports, the high baseline, and the fact that the changing EDA and distance measures are not predictive of curiosity as it was reported using this measure. 13.1.5 How much time is needed to detect an affective state? Many studies have, seemingly by convention, continued to use a 20-second interval for affect identification (eg. [34] [13] [46] [102] [101] as cited in [49]). We have continuously found that this interval is not appropriate for predicting self-reported emotion in MetaTu-tor, whether we are predicting boredom or curiosity, or using gaze, distance, or EDA data as a source. In predicting boredom using gaze we found that the optimal interval was the full 14 minutes, suggesting that the more gaze data is available, the better. For the other two sources (distance and EDA), we found that an interval of 7 minutes was most appropriate for predicting boredom. For curiosity, the only data source that could provide accuracy exceeding the baseline was gaze, and once again we found that a 14 minute interval was most effective. The only exception to this rule was found when pre-dicting curiosity using compressed AOIs; in this case an approximately 10 second inter-val could provide performance exceeding the baseline. However the performance was still worse than the full 14 minute interval used with detailed AOIs. Overall these results 112  suggest that researchers should be cautious in assuming that a 20-second interval will be most appropriate for predicting affect. As mentioned above, we also found that the relationship between a data source and the affective state it is being used to predict may change as the time spent in the learning session increases. Focusing on only a single time interval (or self-report) can in some cases provide much better accuracy than is possible from predicting affect using data from the entire interaction. This effect was found to be particularly strong for the gaze results, where the important features changed with the self-report. It is likely that this relationship is due to the fact that gaze attention patterns are related to the content displayed on the screen, which changes over the course of the learning session. Build-ing models for each time interval can result in much better accuracy in some cases, but this improvement does not occur consistently for each interval. Therefore the most ef-fective affect classifier would be trained using a hybrid approach. The default classifier would be trained using all the data, but for those time intervals that show better perfor-mance than this default, the individual report classifier could be substituted.  13.2 Limitations The greatest limitation of this study is the small sample size. Machine learning is de-pendent on having a large data set from which it is possible to detect generalizable trends; the generalization error has been shown to decrease as the number of data points increases [87] [111]. Our dataset is so small as to make effective machine learn-ing extremely difficult in some cases. For example, in the ensemble tests we have only 39 participants contributing data. This problem is compounded by the fact that classifi-ers are much more likely to overfit the data when the number of samples is low, espe-cially if there are a large number of features available [10]. As explained in section 13.1.4 above, predicting curiosity with accuracy exceeding the baseline proved difficult. Only gaze data was able to provide reliable curiosity pre-diction performance, while EDA and distance for the most part did not outperform chance. Therefore the construct validity of curiosity may have been a limitation of our study; further research into the validity of curiosity as a transient affective state may be required. 113  Finally, affect prediction in general is a dif-ficult problem. As Pekrun explains in a 2006 pa-per [92], sometimes when the relationship be-tween affect and a behavioural signal is calculat-ed across individuals it is not representative of the actual relationship that exists within the data points for each individual. A well known-example [112] is the relationship between frequency of migraine headaches and duration of sleep (pic-tured in Figure 13.1). If this relationship is examined only across individuals, it appears to be positive: the longer an individual sleeps, the more headaches she appears to suf-fer. This could lead to the erroneous advice that sleeping less will lead to fewer head-aches. However, when we look at the data points for each individual alone, we see that the relationship is actually strongly negative: as each participant sleeps more, s/he ex-periences fewer headaches. This discrepancy between inter-individual and intra-individual differences makes learning a general-izable pattern extremely difficult in some domains [112], including affect detection [92].  13.3  Future work In order to improve this research, we would like to collect data from more participants. As explained above, machine learning is best performed with as much data as possible. Beyond data collection, there are several other ways to improve the current re-search. Firstly, we found that our ensemble classifier showed disappointing perfor-mance. One way to address this problem would be to use the ensemble weighting method described in Kapoor and Picard‘s 2004 paper [67], in which each classifier is given a weight based on the error it produced on the training data. We would also like to explore methods for dealing with the problem of inter-individual vs. intra-individual differences in affect expression. Multi-Task Multi-view Learning [18] is a method for training models not just for each data source, but for each participant as well. The models share parameters, and the degree to which the parame-Figure 13.1: Inter- vs. intra-individual relationship between sleep and migraine headaches 114  ters are shared is representative of the degree of inter-individual commonality. Since there is a model for each participant, it can account for the intra-individual relationships. This technique could be particularly beneficial for detecting affect of multiple participants across multiple modalities.  There are also a number of possible extensions to this work. For example, partici-pants reported a total of 19 emotions that we could potentially study. While most of those emotions were infrequently reported, hope was actually reported with nearly the same frequency as boredom and curiosity, and the ability to predict hope could be stud-ied using the same framework and code base we have developed for this project. Other somewhat frequently reported emotions include frustration, happiness, joy, pride, confu-sion, and neutral. There are also additional data sources that were collected during the study that could be leveraged to predict these emotions. Interaction logs were recorded that include students‘ goal setting, interaction with pedagogical agents, and quizzes. Using features calculated from this data could be a fruitful avenue of inquiry, as student actions such as overly fast responses have been shown to be linked to boredom [41]. The most important future direction of this research is to leverage our findings and predictive models to build an affect-adaptive version of MetaTutor. Once we are able to reliably detect student affect, we would need to research and develop interventions that the system could deploy if it sensed that the user was bored or disengaged. If these in-terventions were successful in maintaining student engagement they would lead to in-creased task success and user satisfaction.    115  Bibliography 1. Ailon, N. and Chazelle, B. The fast Johnson-Lindenstrauss transform and ap-proximate nearest neighbors. SIAM Journal on Computing 39, 1 (2009), 302–322. 2. AlZoubi, O., D‘Mello, S., and Calvo, R. Detecting naturalistic expressions of non-basic affect using physiological signals. (2012). 3. Anderson, J.R. and Gluck, K. What role do cognitive architectures play in intelli-gent tutoring systems. Cognition & Instruction: Twenty-five years of progress, (2001), 227–262. 4. Arroyo, I., Cooper, D.G., Burleson, W., Woolf, B.P., Muldner, K., and Chris-topherson, R. Emotion sensors go to school. Proceeding of the 2009 conference on Artificial Intelligence in Education, July 6th-10th, Brighton, UK, IOS Press, (2009), 17–24. 5. Azevedo, R., Behnagh, R., Duffy, M., Harley, J., and Trevors, G. Metacognition and self-regulated learning in student-centered leaning environments. Theoretical foundations of student-centered learning environments, (2012), 171–197. 6. Azevedo, R., Harley, J., Trevors, G., et al. Using trace data to examine the com-plex roles of cognitive, metacognitive, and emotional self-regulatory processes during learning with multi-agent systems. In International Handbook of Metacog-nition and Learning Technologies. Springer, 2013, 427–449. 7. Azevedo, R., Johnson, A., Chauncey, A., and Burkett, C. Self-regulated learning with MetaTutor: Advancing the science of learning with MetaCognitive tools. In New Science of Learning. Springer, 2010, 225–247. 8. Azevedo, R., Landis, R.S., Feyzi-Behnagh, R., et al. The effectiveness of peda-gogical agents‘ prompting and feedback in facilitating co-adapted learning with MetaTutor. Intelligent Tutoring Systems, (2012), 212–221. 9. Azevedo, R., Moos, D.C., Johnson, A.M., and Chauncey, A.D. Measuring cogni-tive and metacognitive regulatory processes during hypermedia learning: Issues and challenges. Educational Psychologist 45, 4 (2010), 210–223. 10. Babyak, M.A. What you see may not be what you get: a brief, nontechnical intro-duction to overfitting in regression-type models. Psychosomatic medicine 66, 3 (2004), 411–421. 11. Baker, R.S., Corbett, A.T., Koedinger, K.R., and Wagner, A.Z. Off-task behavior in the cognitive tutor classroom: when students game the system. Proceedings of the SIGCHI conference on Human factors in computing systems, (2004), 383–390. 12. D Baker, R.S., Corbett, A.T., and Aleven, V. More accurate student modeling through contextual estimation of slip and guess probabilities in bayesian knowledge tracing. Intelligent Tutoring Systems, Springer (2008), 406–415. 13. Baker, R.Sj., D‘Mello, S.K., Rodrigo, M.M.T., and Graesser, A.C. Better to be frustrated than bored: The incidence, persistence, and impact of learners‘ cogni-tive–affective states during interactions with three different computer-based learning environments. International Journal of Human-Computer Studies 68, 4 (2010), 223–241. 116  14. Blain, S., Mihailidis, A., and Chau, T. Assessing the potential of electrodermal activity as an alternative access pathway. Medical engineering & physics 30, 4 (2008), 498–505. 15. Blanchard, N., Bixler, R., Joyce, T., and D‘Mello, S. Automated Physiological-Based Detection of Mind Wandering during Learning. Intelligent Tutoring Sys-tems, Springer (2014), 55–60. 16. Bless, H., Clore, G.L., Schwarz, N., Golisano, V., Rabe, C., and Wölk, M. Mood and the use of scripts: Does a happy mood really lead to mindlessness? Journal of personality and social psychology 71, 4 (1996), 665. 17. Bondareva, D., Conati, C., Feyzi-Behnagh, R., Harley, J.M., Azevedo, R., and Bouchet, F. Inferring Learning from Gaze Data during Interaction with an Envi-ronment to Support Self-Regulated Learning. Artificial Intelligence in Education, (2013), 229–238. 18. Bondareva, D. Eye-tracking as a source of information for automatically predict-ing user learning with MetaTutor, an intelligent tutoring system to support self-regulated learning. (2014). 19. Bosch, N., Chen, Y., and D‘Mello, S. It‘s Written on Your Face: Detecting Affec-tive States from Facial Expressions while Learning Computer Programming. Springer (2014). 20. Bouchet, F., Azevedo, R., Kinnebrew, J.S., and Biswas, G. Identifying Students‘ Characteristic Learning Behaviors in an Intelligent Tutoring System Fostering Self-Regulated Learning. International Educational Data Mining Society, (2012). 21. Boyle, C. and Encarnacion, A. MetaDoc: an adaptive hypertext reading system. In Springer Netherlands, 1998, 71–89. 22. Breiman, L. Random forests. Machine learning 45, 1 (2001), 5–32. 23. Brusilovsky, P. and Pesin, L. Adaptive navigation support in educational hyper-media: An evaluation of the ISIS-Tutor. Journal of computing and Information Technology 6, 1 (1998), 27–38. 24. Bull, P. Body movement and interpersonal communication. Wiley Chichester, 1983. 25. Le Cessie, S. and Van Houwelingen, J.C. Ridge estimators in logistic regression. Applied statistics, (1992), 191–201. 26. Christie, I.C. Multivariate discrimination of emotion-specific autonomic nervous system activity. 2002. http://scholar.lib.vt.edu/theses/available/etd-05172002-160205/. 27. Conati, C. and Maclaren, H. Empirically building and evaluating a probabilistic model of user affect. User Modeling and User-Adapted Interaction 19, 3 (2009), 267–303. 28. Conati, C. and Merten, C. Eye-tracking for user modeling in exploratory learning environments: An empirical evaluation. Knowledge-Based Systems 20, 6 (2007), 557–574. 29. Cooper, D.G., Muldner, K., Arroyo, I., Woolf, B.P., and Burleson, W. Ranking fea-ture sets for emotion models used in classroom based intelligent tutoring sys-tems. In User Modeling, Adaptation, and Personalization. Springer, 2010, 135–146. 117  30. D Mello, S. and Graesser, A. Mind and body: Dialogue and posture for affect de-tection in learning environments. Frontiers in Artificial Intelligence and Applica-tions 158, (2007), 161. 31. D‘Mello, S., Graesser, A., and Picard, R.W. Toward an affect-sensitive AutoTu-tor. Intelligent Systems, IEEE 22, 4 (2007), 53–61. 32. D‘Mello, S. and Graesser, A. Automatic detection of learner‘s affect from gross body language. Applied Artificial Intelligence 23, 2 (2009), 123–150. 33. D‘Mello, S., Olney, A., Williams, C., and Hays, P. Gaze tutor: A gaze-reactive intelligent tutoring system. International Journal of Human-Computer Studies 70, 5 (2012), 377–398. 34. D‘Mello, S. Monitoring affective trajectories during complex learning. In Encyclo-pedia of the Sciences of Learning. Springer, 2012, 2325–2328. 35. D‘mello, S.K., Craig, S.D., Gholson, B., Franklin, S., Picard, R., and Graesser, A.C. Integrating affect sensors in an intelligent tutoring system. Affective Interac-tions: The Computer in the Affective Loop Workshop at, (2005), 7–13. 36. D‘Mello, S.K. and Graesser, A. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Modeling and User-Adapted Interaction 20, 2 (2010), 147–187. 37. Dietterich, T.G. Ensemble methods in machine learning. In Multiple classifier sys-tems. Springer, 2000, 1–15. 38. Field, A. Discovering statistics using SPSS. Sage publications, 2009. 39. Fischer, G. User modeling in human–computer interaction. User modeling and user-adapted interaction 11, 1-2 (2001), 65–86. 40. Florea, A. and Kalisz, E. Embedding emotions in an artificial tutor. Symbolic and Numeric Algorithms for Scientific Computing, 2005. SYNASC 2005. Seventh In-ternational Symposium on, IEEE (2005), 6–pp. 41. Forbes-Riley, K., Litman, D., Friedberg, H., and Drummond, J. Intrinsic and ex-trinsic evaluation of an automatic user disengagement detector for an uncertain-ty-adaptive spoken dialogue system. Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2012), 91–102. 42. Forbes-Riley, K. and Litman, D. Adapting to multiple affective states in spoken dialogue. Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue, (2012), 217–226. 43. Freund, Y. Boosting a weak learning algorithm by majority. Information and com-putation 121, 2 (1995), 256–285. 44. Gluck, K., Anderson, J., and Douglass, S. Broader Bandwidth in Student Model-ing: What if ITS Were ―Eye‖ TS? Springer-Verlag (2000), 504–513. 45. Goldberg, J.H. and Helfman, J.I. Comparing information graphics: a critical look at eye tracking. (2010), 71–78. 46. Graesser, A.C., McDaniel, B., Chipman, P., Witherspoon, A., D‘Mello, S., and Gholson, B. Detection of emotions during learning with AutoTutor. Proceedings of the 28th Annual Meetings of the Cognitive Science Society, Citeseer (2006), 285–290. 47. Granka, L.A., Joachims, T., and Gay, G. Eye-tracking analysis of user behavior in WWW search. Proceedings of the 27th annual international ACM SIGIR con-118  ference on Research and development in information retrieval, ACM (2004), 478–479. 48. Greene, J.A. and Azevedo, R. A theoretical review of Winne and Hadwin‘s model of self-regulated learning: New perspectives and directions. Review of Educa-tional Research 77, 3 (2007), 334–372. 49. Gutica, M. and Conati, C. Student Emotions with an Edu-Game: A Detailed Anal-ysis. (2013). 50. Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. The Journal of Machine Learning Research 3, (2003), 1157–1182. 51. Harley, J.M., Bouchet, F., and Azevedo, R. Aligning and Comparing Data on Emotions Experienced during Learning with MetaTutor. Artificial Intelligence in Education, (2013), 61–70. 52. Hawkins, D.M. The problem of overfitting. Journal of chemical information and computer sciences 44, 1 (2004), 1–12. 53. Healey, J. and Picard, R. Digital processing of affective signals. Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE Internation-al Conference on, IEEE (1998), 3749–3752. 54. Hearst, M.A., Dumais, S.T., Osman, E., Platt, J., and Scholkopf, B. Support vec-tor machines. Intelligent Systems and their Applications, IEEE 13, 4 (1998), 18–28. 55. Hegarty, M., Mayer, R.E., and Monk, C.A. Comprehension of arithmetic word problems: A comparison of successful and unsuccessful problem solvers. Jour-nal of educational psychology 87, 1 (1995), 18. 56. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, (2012). 57. Hoyle, M.D. Computer interface method and apparatus with targeted advertising. Google Patents, 2000. 58. Hussain, M.S., AlZoubi, O., Calvo, R.A., and D‘Mello, S.K. Affect detection from multichannel physiology during learning sessions with AutoTutor. Artificial Intelli-gence in Education, Springer (2011), 131–138. 59. Hussain, M.S. and Calvo, R.A. Multimodal affect detection from physiological and facial features during ITS interaction. Artificial Intelligence in Education, Springer (2011), 472–474. 60. Ingleton, C. Gender and learning: Does emotion make a difference? Higher Edu-cation 30, 3 (1995), 323–335. 61. John, G.H. and Langley, P. Estimating continuous distributions in Bayesian clas-sifiers. Proceedings of the Eleventh conference on Uncertainty in artificial intelli-gence, Morgan Kaufmann Publishers Inc. (1995), 338–345. 62. Johnson, W.B. and Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics 26, 189-206 (1984), 1. 63. Jones, M.W., Obregón, M., Louise Kelly, M., and Branigan, H.P. Elucidating the component processes involved in dyslexic and non-dyslexic reading fluency: An eye-tracking study. Cognition 109, 3 (2008), 389–407. 64. Kandemir, M. Learning Mental States from Biosignals. (2013). 119  65. Kanfer, R. and Ackerman, P.L. Motivation and cognitive abilities: An integra-tive/aptitude-treatment interaction approach to skill acquisition. Journal of applied psychology 74, 4 (1989), 657. 66. Kaplan, C., Fenwick, J., and Chen, J. Adaptive hypertext navigation based on user goals and context. User Modeling and User-Adapted Interaction 3, 3 (1993), 193–220. 67. Kapoor, A., Picard, R.W., and Ivanov, Y. Probabilistic combination of multiple modalities to detect interest. Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, IEEE (2004), 969–972. 68. Kapoor, A. and Picard, R.W. Multimodal affect recognition in learning environ-ments. Proceedings of the 13th annual ACM international conference on Multi-media, ACM (2005), 677–682. 69. Kardan, S. and Conati, C. Exploring gaze data for determining user learning with an interactive simulation. Springer (2012), 126–138. 70. Kardan, S. and Conati, C. Comparing and Combining Eye Gaze and Interface Actions for Determining User Learning with an Interactive Simulation. In User Modeling, Adaptation, and Personalization. Springer, 2013, 215–227. 71. Kinnebrew, J.S. and Biswas, G. Identifying Learning Behaviors by Contextualiz-ing Differential Sequence Mining with Action Features and Performance Evolu-tion. International Educational Data Mining Society, (2012). 72. Kittler, J., Hatef, M., Duin, R.P., and Matas, J. On combining classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20, 3 (1998), 226–239. 73. Klein, J., Moon, Y., and Picard, R.W. This computer responds to user frustration:: Theory, design, and results. Interacting with computers 14, 2 (2002), 119–140. 74. Kobsa, A. Generic user modeling systems. User modeling and user-adapted in-teraction 11, 1-2 (2001), 49–63. 75. Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI, (1995), 1137–1145. 76. Koren, Y. The bellkor solution to the netflix grand prize. Netflix prize documenta-tion, (2009). 77. Kort, B., Reilly, R., Mostow, J., and Picard, R. Experimentally augmenting an in-telligent tutoring system with human-supplied capabilities: adding human-provided emotional scaffolding to an automated reading tutor that listens. Pro-ceedings of the 4th IEEE International Conference on Multimodal Interfaces, (2002), 483. 78. Lehman, B., Matthews, M., D‘Mello, S., and Person, N. What are you feeling? Investigating student affective states during expert human tutoring sessions. In-telligent Tutoring Systems, (2008), 50–59. 79. Liversedge, S.P. and Findlay, J.M. Saccadic eye movements and cognition. Trends in cognitive sciences 4, 1 (2000), 6–14. 80. Loboda, T.D. and Brusilovsky, P. User-adaptive explanatory program visualiza-tion: evaluation and insights from eye movements. User Modeling and User-Adapted Interaction 20, 3 (2010), 191–226. 81. Mandler, G. Mind and body: Psychology of emotion and stress. WW Norton New York, 1984. 120  82. McDuff, D., El Kaliouby, R., Kodra, E., and Picard, R. Measuring Voter‘s Candi-date Preference Based on Affective Responses to Election Debates. Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Confer-ence on, IEEE (2013), 369–374. 83. Molinari, G., Chanel, G., Betrancourt, M., Pun, T., and Bozelle, C. Emotion Feedback during Computer-Mediated Collaboration:  Effects on Self-Reported Emotions and Perceived Interaction. 10th International Conference on Computer Supported Collaborative Learning, ((to appear)). 84. Morgan, M. Self-monitoring and goal setting in private study. Contemporary Edu-cational Psychology 12, 1 (1987), 1–6. 85. Mota, S. and Picard, R.W. Automated posture analysis for detecting learner‘s interest level. Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference on, IEEE (2003), 49–49. 86. Muldner, K., Atkinson, R., and Burleson, W. Investigating the utility of eye-tracking information on affect and reasoning for user modeling. User Modeling, Adaptation, and Personalization, (2009), 138–149. 87. Niyogi, P. and Girosi, F. On the relationship between generalization error, hy-pothesis complexity, and sample complexity for radial basis functions. Neural Computation 8, 4 (1996), 819–842. 88. Ortony, A. The cognitive structure of emotions. Cambridge university press, 1990. 89. Paquette, L., Baker, R.S., Sao Pedro, M.A., et al. Sensor-Free Affect Detection for a Simulation-Based Science Inquiry Learning Environment. Intelligent Tutor-ing Systems, Springer (2014), 1–10. 90. Pekrun, R., Goetz, T., Titz, W., and Perry, R.P. Academic emotions in students‘ self-regulated learning and achievement: A program of qualitative and quantita-tive research. Educational psychologist 37, 2 (2002), 91–105. 91. Pekrun, R. The impact of emotions on learning and achievement: Towards a the-ory of cognitive/motivational mediators. Applied Psychology 41, 4 (1992), 359–376. 92. Pekrun, R. The control-value theory of achievement emotions: Assumptions, cor-ollaries, and implications for educational research and practice. Educational Psy-chology Review 18, 4 (2006), 315–341. 93. Pintrich, P.R. and De Groot, E.V. Motivational and self-regulated learning com-ponents of classroom academic performance. Journal of educational psychology 82, 1 (1990), 33. 94. Platt, J.C. Fast training of support vector machines using sequential minimal op-timization. (1999). 95. Poh, M.-Z., Swenson, N.C., and Picard, R.W. A wearable sensor for unobtrusive, long-term assessment of electrodermal activity. Biomedical Engineering, IEEE Transactions on 57, 5 (2010), 1243–1252. 96. Pohlman, J.T. and Leitner, D.W. A comparison of ordinary least squares and lo-gistic regression. (2003). 97. Pour, P.A., Hussain, M.S., AlZoubi, O., D‘Mello, S., and Calvo, R.A. The impact of system feedback on learners‘ affective and physiological states. Intelligent Tu-toring Systems, Springer (2010), 264–273. 121  98. Qu, L. and Johnson, W.L. Detecting the learner‘s motivational states in an inter-active learning environment. IOS Press (2005), 547–554. 99. Raghunathan, R. and Trope, Y. Walking the tightrope between feeling good and being accurate: mood as a resource in processing persuasive messages. Journal of personality and social psychology 83, 3 (2002), 510. 100. Ridley, D.S., Schutz, P.A., Glanz, R.S., and Weinstein, C.E. Self-regulated learn-ing: The interactive influence of metacognitive awareness and goal-setting. The Journal of Experimental Education 60, 4 (1992), 293–306. 101. Rodrigo, M.M.T., Baker, R.Sj., Agapito, J., et al. The Effects of an Interactive Software Agent on Student Affective Dynamics while Using; an Intelligent Tutor-ing System. Affective Computing, IEEE Transactions on 3, 2 (2012), 224–236. 102. Rodrigo, M.M.T., Rebolledo-Mendez, G., Baker, Rsj., et al. The effects of motiva-tional modeling on affect in an intelligent tutoring system. Proceedings of the In-ternational Conference on Computers in Education, (2008), 57–64. 103. Rokach, L. Ensemble-based classifiers. Artificial Intelligence Review 33, 1-2 (2010), 1–39. 104. Rowe, J.P., McQuiggan, S.W., Robison, J.L., and Lester, J.C. Off-Task Behavior in Narrative-Centered Learning Environments. AIED, (2009), 99–106. 105. Sabourin, J., Mott, B., and Lester, J. Computational Models of Affect and Empa-thy for Pedagogical Virtual Agents. Standards in Emotion Modeling, Lorentz Cen-ter International Center for Workshops in the Sciences, (2011). 106. Sabourin, J., Mott, B., and Lester, J.C. Modeling learner affect with theoretically grounded dynamic bayesian networks. In Affective Computing and Intelligent In-teraction. Springer, 2011, 286–295. 107. Sabourin, J., Shores, L., Mott, B., and Lester, J. Predicting student self-regulation strategies in game-based learning environments. (2012), 141–150. 108. Sahoo, G. Analysis of Parametric & Non Parametric Classifiers for Classification Technique using WEKA. International Journal of Information Technology and Computer Science (IJITCS) 4, 7 (2012), 43. 109. Sano, A. and Picard, R.W. Recognition of sleep dependent memory consolida-tion with multi-modal sensor data. Body Sensor Networks (BSN), 2013 IEEE In-ternational Conference on, IEEE (2013), 1–4. 110. Sao Pedro, M.A., d Baker, R.S., and Gobert, J.D. Improving construct validity yields better models of systematic inquiry, even with less information. In User Modeling, Adaptation, and Personalization. Springer, 2012, 249–260. 111. Schapire, R.E. A brief introduction to boosting. Ijcai, (1999), 1401–1406. 112. Schmitz, B. and Skinner, E. Perceived control, effort, and academic performance: Interindividual, intraindividual, and multivariate time-series analyses. Journal of Personality and Social Psychology 64, 6 (1993), 1010. 113. Schraw, G. and Nietfeld, J. A further test of the general monitoring skill hypothe-sis. Journal of Educational Psychology 90, 2 (1998), 236. 114. Schwartz, L.S. and Gredler, M.E. The effects of self-instructional materials on goal setting and self-efficacy. Journal of research and development in education 31, 2 (1998), 83–89. 115. Sibert, J.L., Gokturk, M., and Lavine, R.A. The reading assistant: eye gaze trig-gered auditory prompting for reading remediation. ACM Press (2000), 101–107. 122  116. Silvia, P.J. Interest—The curious emotion. Current Directions in Psychological Science 17, 1 (2008), 57–60. 117. Smilek, D., Carriere, J.S., and Cheyne, J.A. Out of Mind, Out of Sight Eye Blink-ing as Indicator and Embodiment of Mind Wandering. Psychological Science 21, 6 (2010), 786–789. 118. Steichen, B., Carenini, G., and Conati, C. User-adaptive information visualization: using eye gaze data to infer visualization tasks and user cognitive abilities. Pro-ceedings of the 2013 international conference on Intelligent user interfaces, ACM (2013), 317–328. 119. Tobii Technology. Accuracy and precision test report. 2011. 120. Toker, D., Conati, C., Steichen, B., and Carenini, G. Individual User Characteris-tics and Information Visualization: Connecting the Dots through Eye Tracking. Proc. of the ACM SIGCHI Conference on Human Factors in Computing Sys-tems,(CHI 2013), (2013). 121. Toker, D., Steichen, B., Gingerich, M., Conati, C., and Carenini, G. Towards facil-itating user skill acquisition-Identifying untrained visualization users through eye tracking. System 16, , 19. 122. VanLehn, K. Student modeling. Foundations of intelligent tutoring systems, (1988), 55–78. 123. Vyzas, E. Recognition of emotional and cognitive states using physiological data. 1999. http://affect.media.mit.edu/pdfs/TR-510/TR-510.pdf. 124. Wang, H., Chignell, M., and Ishizuka, M. Empathic tutoring software agents using real-time eye tracking. Proceedings of the 2006 symposium on Eye tracking re-search & applications, (2006), 73–78. 125. Wang, N., Johnson, W.L., Mayer, R.E., Rizzo, P., Shaw, E., and Collins, H. The politeness effect: Pedagogical agents and learning outcomes. International Jour-nal of Human-Computer Studies 66, 2 (2008), 98–112. 126. Wang, W., Li, Z., Wang, Y., and Chen, F. Indexing cognitive workload based on pupillary response under luminance and emotional changes. Proceedings of the 2013 international conference on Intelligent user interfaces, (2013), 247–256. 127. Webb, G.I., Pazzani, M.J., and Billsus, D. Machine learning for user modeling. User modeling and user-adapted interaction 11, 1-2 (2001), 19–29. 128. Winne, P. and Hadwin, A. The weave of motivation and self-regulated learning. Motivation and self-regulated learning: Theory, research, and applications, (2008), 297–314. 129. Winne, P.H. and Hadwin, A.F. Studying as self-regulated learning. Metacognition in educational theory and practice, (1998), 277–304. 130. Witten, I.H. and Frank, E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. 131. Woolf, B.P. Building intelligent interactive tutors: Student-centered strategies for revolutionizing e-learning. Morgan Kaufmann, 2008. 132. Young, J.D. The effect of self-regulated learning strategies on performance in learner controlled computer-based instruction. Educational Technology Research and Development 44, 2 (1996), 17–27. 123  133. Zimmerman, B.J. Investigating self-regulation and motivation: Historical back-ground, methodological developments, and future prospects. American Educa-tional Research Journal 45, 1 (2008), 166–183.   124  14 Appendix Table 14.1: Effects of report time on classifying curiosity using gaze features Family Component F-ratio Effect Size Significance Curiosity Accuracy Classifier F(4,64) = 4.897 η2 = .234 p < .01 Report F(3,64) = 13.083 η2 = .380 p < .001 Interaction F(12,64) = 5.557 η2 = .510 p < .001 Curiosity Kappa Classifier F(4,64) = 4.400 η2 = .216 p < .05 Report F(3,64) = 3.576 η2 = .144 p = .076 Interaction F(12,64) = 4.195 η2 = .440 p < .001  Table 14.2: Effects of window size on classifying curiosity using distance features Family Component F-ratio Effect Size Significance Curiosity Accuracy Classifier F(5,270) = 68.438 η2 = .559 p < .001 Window F(5,270) = 5.481 η2 = .092 p < .001 Interaction F(25,270) = 10.662 η2 = .497 p < .001 Curiosity Kappa Classifier F(5,270) = 39.015 η2 = .419 p < .001 Window F(5,270) = 7.292 η2 = .119 p < .001 Interaction F(25,270) = 7.871 η2 = .422 p < .001  Table 14.3: Effects of report time on classifying curiosity using distance features Family Component F-ratio Effect Size Significance Curiosity Accuracy Classifier F(5,270) = 27.436 η2 = .433 p < .001 Report F(3,270) = 77.317 η2 = .563 p < .001 Interaction F(15,270) = 16.364 η2 = .577 p < .001 Curiosity Kappa Classifier F(5,270) = 11.660 η2 = .245 p < .001 Report F(3,270) = 32.517 η2 = .351 p < .001 Interaction F(15,270) = 19.495 η2 = .619 p < .001  Table 14.4: Effects of window size on classifying curiosity using EDA features Family Component F-ratio Effect Size Significance 125  Curiosity Accuracy Classifier F(5,270) = 144.860 η2 = .728 p < .001 Window F(5,270) = 19.165 η2 = .262 p < .001 Interaction F(25,270) = 25.085 η2 = .699 p < .001 Curiosity Kappa Classifier F(5,270) = 28.882 η2 = .348 p < .001 Window F(5,270) = 10.723 η2 = .166 p < .001 Interaction F(25,270) = 8.353 η2 = .436 p < .001   Figure 14.1: Kappa scores for predicting boredom from eye gaze data -0.2-0.100.10.20.30.4Kappa score Window Length Boredom - Kappa by Window Length Logistic RFNaiveBayes BaselineSVM

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0135541/manifest

Comment

Related Items