Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A model and adaptive support for learning in an educational game Manske, Micheline 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2006-0074.pdf [ 15.05MB ]
Metadata
JSON: 831-1.0051720.json
JSON-LD: 831-1.0051720-ld.json
RDF/XML (Pretty): 831-1.0051720-rdf.xml
RDF/JSON: 831-1.0051720-rdf.json
Turtle: 831-1.0051720-turtle.txt
N-Triples: 831-1.0051720-rdf-ntriples.txt
Original Record: 831-1.0051720-source.json
Full Text
831-1.0051720-fulltext.txt
Citation
831-1.0051720.ris

Full Text

A Model and Adaptive Support for Learning in an Educational Game Data-driven refinement and evaluation of a student model and pedagogical interventions for Prime Climb by Micheline Manske B.Sc, Queens University, 1999 B.Ed., Queens University, 2000 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of British Columbia December 2005 © Micheline Manske 2005 11 Abstract Educational games are highly motivating, however, there is little evidence that they can trigger learning unless game-play is supported by additional activities. We aim to support students during game play via an intelligent pedagogical agent that intervenes to offer hints and suggestions when the student is lacking domain knowledge, but does not interfere otherwise, so as to maintain the engagement that computer games are known to bring about. Such an agent must be informed by an accurate model of student learning. In this thesis we describe research on data-drive refinement and evaluation of a probabilistic model of student learning for an educational game on number factoriza-, tion, Prime Climb. A n initial version of the model was designed based on teachers' advice and subjective parameter settings. We illustrate data-driven improvements to the model, and we report significant improvements on its accuracy. This model is used by an intelligent pedagogical agent for Prime Climb. We present results from an ablation study in which students played with a version of the game which employed either a pedagogical agent acting on the original model, a pedagogical agent acting on the new model, or no pedagogical agent at all. Learning gains and students' subjective assessments of the agent are discussed. I l l Contents Abstract n Contents : • • • 1 1 1 List of Tables vi List of Figures x Acknowledgements . . xi i 1 Introduction 1 1.1 Thesis goals 3 1.2 Contributions of this research 4 1.3 Outline 4 2 Related Work 6 2.1 The computer as an educational tool 6 2.1.1 Intelligent tutoring systems . . 6 2.1.2 Educational computer games 6 2.2 Student Modeling using Bayesian Networks 9 2.2.1 Models designed using expert intuition 10 2.2.2 Models learned from data 11 2.2.3 Models which combine expert intuition and data learning . . . . 13 2.2.4 Our approach , 15 2.3 Evaluation " 16 2.3.1 Evaluation of ITS and Educational Games 16 Contents iv 2.3.2 Evaluation of student models 19 3 Prime Climb 26 3.1 Game Rules 26 3.2 Tools 27 3.3 Pedagogical Agent 29 4 Model of Student Learning 34 4.1 Original Student Model 34 4.1.1 Structure of the original model 34 4.1.2 Evaluation of the original student model ,43 4.1.3 Problems with the original student model 47 4.2 New Model: Fixing the apportion of blame problem 49 4.2.1 Structure 49 4.2.2 Data-driven parameter refinement 53 4.2.3 Model Accuracy 56 4.2.4 Sensitivity to parameters 57 4.3 New Model: Adding common factor knowledge 63 4.3.1 Structure 64 4.3.2 Data-driven parameter refinement 65 4.3.3 Model Accuracy 68 4.4 Comparison of Models 71 4.5 Implementation 75 5 Changes to Agent Interventions 76 5.1 Original hinting strategy 76 5.2 New hinting strategy 78 5.3 Small pilot study of the new hinting strategy 96 6 Ablation Study 99 6.1 Study design 99 Contents v 6.1.1 Assessment tools 102 6.2 Study Results 108 6.2.1 Hypothesis 1: The new model was not more accurate than the old model in the course of this study 110 6.2.2 Hypothesis 2: The test we used does not properly assess learning 131 6.2.3 Hypothesis 3: Learning has been obstructed by changes to the agent's interventions 136 6.2.4 Student perceptions of the agent 147 6.2.5 Experimenter questionnaires N 149 6.2.6 Other mined student actions 151 6.2.7 Results conclusion \ . . . 153 7 Conclusions and Future Work 155 7.1 Satisfaction of thesis goals • • .155 7.2 Future Work 157 7.2.1 Model of student learning 157 7.2.2 Learning gains with Prime Climb 158 7.2.3 Long term projects 158 7.3 Conclusion '159 Bibliography 160 A Test for original model 170 B Pre-test 172 C Post-test for agent conditions 174 D Post-test for no-agent condition 176 E Observation sheet for no-agent condition 178 F Observation sheet for agent conditions 179 vi List of Tables 3.1 Agent responses to student-requested questions using the "Help" icon . 30 3.2 Unsolicited hints provided by the agent Merlin ' . . . 32 4.1 C P T for factorization node after magnification 40 4.2 Questions on the pre and post test which assess specific skills 46 4.3 Parameter estimates from frequencies in the new model 54 4.4 Average training set accuracy across folds by threshold in the new model 56 4.5 Parameter estimates from cross-validation in the new model 56 4.6 Sensitivity, specificity, and accuracy by test fold in the new model . . . 57 4.7 Sensitivity to parameters in the new model 58 4.8 Accuracy with more extreme parameter values 59 4.9 The influence of prior probabilities in the new model: Area under the curve comparison 62 4.10 Pairwise comparisons of differences between the three prior conditions using a z-score 62 4.11 Accuracy by prior probability setting in the new model 63 4.12 Parameter estimates from click frequencies 66 4.13 Average training set accuracy across folds by threshold 67 4.14 Parameter estimates which maximize the training set accuracy 67 4.15 Testing sensitivity, specificity and accuracy on factorization nodes, by . fold 69 List of Tables vii 4.16 Testing sensitivity, specificity and accuracy on common factor nodes, by fold. Folds 6-8 show no value in the specificity column as none of the data points in these folds had a post-test assessment of unknown hence specificity (and accuracy) cannot be computed for these folds 70 4.17 Average testing sensitivity, specificity and accuracy on common factor nodes across folds, by threshold 70 4.18 Area under the curve and standard error measures for each of the three models across factorization nodes 72 4.19 Models classified on two dimensions 75 5.1 Hints in the current game version 77 5.2 Optimal thresholds 79 5.3 Model threshold values . 83 5.4 Pedagogical hinting strategy for three levels of hints progressing from general to specific for the Andes Physics tutor [80], the Cognitive tutors . [29], and the new hinting strategy for Prime Climb 85 5.5 Progression of Common Factor and Factorization hints 86 6.1 Numbers appearing on factorization portion of the pre-test 103 6.2 Questions on student post-questionnaire 105 6.3 Questions on experimenter post-evaluation 106 6.4 Average pre-test score, post-test score, and learning gains by condition . 109 6.5 Accuracy on previous and current study, by model-factorization nodes . I l l 6.6 Accuracy on previous and current study, by model-common factor node 111 6.7 Numbers assessed on post-test in each study . . . • 112 6.8 Accuracy during game play and after game play, by model-factorization nodes 119 6.9 Accuracy during game play and after game play, by model-common factor nodes 121 6.10 Agent accuracy, by model - factorization nodes . 125 List of Tables vii i 6.11 Confusion matrices for the old model and new model across factorization nodes 126 6.12 Agent accuracy, by model - common factor node 128 6.13 Confusion matrices for the old model and new model across the common factor node 129 6.14 Average pre-test score, post-test score, and learning gains by condition replicating the test in [28] 131 6.15 Average pre-test score, post-test score, and learning gains by condition when controlling for fatigue 133 6.16 Percentage of students that did not know the factorization of test numbers 135 6.17 Average number of times that each test number was encountered . . . .137 6.18 Average number of total hints, common factors hints and factors hints given per student 139 6.19 Average number of total hints, common factors hitns and factors hitns given per student after removing two students with outlying behaviour . 140 6.20 Average number of moves and hints per student per mountain 141 6.21 Average time spend reading hints each time through the cycle 143 6.22 Average time spent reading each hint 144 6.23 Comparison between old hinting strategy hints and new hinting strategy hints on the percentage of each hint type given, percent followed by correct action, and correlation of that hint with learning 145 6.24 Number of students in the no agent group that responded a) or b) to the question "If you played Prime Climb again, would you rather play Results are shown for both the study discussed here, and that of Conati and Zhao [28] 147 6.25 Average responses to student questions 148 6.26 Significant Pearson correlations between student-answered questions about Merlin • 148 6.27 Average responses to experimenter questions 150 6.28 Magnifying glass use 151 List of Tables ix 6.29 Requested hints 152 X List of Figures 3.1 A correct move • 27 3.2 A n incorrect move and fall 28 3.3 Recovery from a fall with a correct move 29 3.4 Factor tree in the P D A 31 3.5 Help dialogue box 33 3.6 The Prime Climb interface with pedagogical agent 33 4.1 Factorization nodes . 3 6 4.2 Short-term student model: factorization nodes ' 37 4.3 Short-term student model: knowledge nodes 38 4.4 Dependencies between evidence and factorization nodes 39 4.5 Dependencies between two time slices in the original short-term student model 40 4.6 Roll-up of a non-root node in the original short-term student model . . 42 4.7 Roll-up from long-term to short-term student model 43 4.8 Relationship between the click and factorization nodes in the diagnostic direction 48 4.9 Dependency between click and factorization nodes in the causal direction 50 4.10 Roll-up in the new model 51 4.11 A n R O C curve comparison of the influence of prior probabilities in the new model 60 4.12 Click configuration with Common Factor node 65 4.13 Comparisons of the three models using R O C curves 71 List of Figures x i . -4.14 The original model including a common factor node 74 4.15 R O C curve for the alternative configuration of the original model with common factor node. Sensitivity and specificity are computed across the 10 factorization nodes 74 5.1 Hinting algorithm A 80 5.2 Hinting algorithm B 82 5.3 A level 1 hint being spoken by Merlin and shown in a speech bubble. . . 86 5.4 Merlin speaking level 2 hints with corresponding examples 87 5.5 Agent hint with dialogue box asking if the student would like more hints 90 5.6 Subroutine to hint on common factors 91 5.7 Subroutine to hint on number factorization 92 5.8 a) Common factor stream of hints, b) factors stream of hints cycle 1, c) factors stream of hints cycle 2 93 6.1 Experimental Setup 100 6.2 R O C curves for accuracy after game play on factorization nodes . . . . 114 6.3 R O C curves for accuracy after game play on common factor node . . . . 115 6.4 Algorithm for accuracy during game play - factorization nodes 117 6.5 Algorithm for accuracy during game play - common factor node 123 X l l Acknowledgements A lot of people have helped me in writing this thesis. I want to start with my family -mom, dad, Teresa and Mike, who have always supported me in everything that I chose to do, and who often proudly commend my achievements, even if they don't always understand them. I'd like to thank my supervisor Cristina Conati for helping me find a project that combined computer science with my love of teaching, for continued guidance through-out the project, for consoling me through the occasional breakdown in her office, and for pushing me to complete a polished final product. I'd like to thank the members of the U M - A I reading group for providing me with valuable discussion. In particularly I want to thank Heather Maclaren who assisted me with the user studies, helped me to understand the Prime Climb code, and often acted as my sounding board. I also want to thank my second reader Giuseppe Carenini for his valuable feedback. His prompt reading of my thesis during one of his busiest times was greatly appreciated. T d like to thank all those people who put up with my complaints during thesis writing; over dinner at the Orphanage (Scott, Dima, Dan, Dave, Dustin), at Koerners (Sarah, Jeanette, Jim), on the ulti field (Scott, Dustin, Greg, Sarah, Asher, Dima), and over the coffee machine (Dima, Sarah, Brian, Andrew, Kasia, and Andrea). I'd like to thank the fantastic CS department at U B C that gave me the chance to participate so fully in department affairs and really made me feel that I belonged. In particular I want to both apologize to, and thank, the friends that encouraged me when I was most discouraged, and helped me to see the light at the end of the tunnel. Sarah M , a fantastic cousin and friend, whom I admire so much for her work I, Acknowledgements ' xiii ethic and her determination. Sarah C and Greg, my office mates that both provided their fair share of pep talks. I can't think of a single time that I walked into the office or phoned, when they didn't drop everything to hear me out and put me back on track (and I can't think of many times from August to December that I didn't call needing to talk). Thanks also to Sarah C for all the running around she did to help me get this thesis submitted while I was 4 provinces away - most people think its enough of a burden to do it one time, but Sarah cheerfully submitted herself to it a second time. And finally, I could not have made it through this degree without Dustin. While writing this thesis I've lost count of the number of times that he dropped his own work to find a fix to my computer glitch, that he listened as I talked through a long-winded explanation of something, just so I could get it straight in my own head, that he waited patiently because I was so preoccupied that I hadn't finished a sentence, that I broke our plans so I could work and he acknowledged my decision by fixing me a drink and making dinner, that he slipped an encouraging word into the discussion about the work that I'd done, or that he muffled my frustrated tears with a big hug. But mostly I credit him because he never pushed me, but it was his gentle encouragement, when I'd given up on this degree, that got me to open that dreaded thesis again and give it another try. I couldn't have accomplished this without him. 1 C h a p t e r 1 Introduction As our society becomes more and more technologically savvy, kids turn to computer games for their entertainment. Its not uncommon for kids to spend hours a day in front of the computer playing games [53]. Why? - because they're fun! What could be more attractive for educators than to capitalize on the captivating nature of computer games to teach academic concepts and supplement classroom activities? . Several authors have suggested that video and computer games have potential as educational tools [51, 69], and many educational games are already in use in classrooms [53, 63]. However, although educational games are motivating, there are only few studies which indicate that they promote learning [63, 11]. In fact, it is often the case that students do not learn when playing educational games, unless these games are coupled with additional supporting activities [44, 63, 49, 53]. However, students are often not under the direct supervision of a teacher when playing educational games in school, as computer resources are often limited to one or two computers per classroom [53], and students must use the computers sequentially. This makes it difficult for a classroom teacher to support students during game play and encourage them to think carefully about their actions. Although games are highly motivating - a property that is extremely desirable in any educational activity - the fast-paced game environment can work against learning as students rush to' complete levels and don't pause to engage in proactive thinking about the underlying instructional material. This is further perpetuated by the fact that students are also often reluctant to ask for help when playing a game or working at the computer[44, 23, 2]. Even though an educational game's rules are based on instructional content, it is Chapter 1. Introduction 2 often possible for students to play a game sucessfully without reasoning about the underlying domain knowledge that the game is supposed to teach [24]. Conati and Lehman [24] show results indicating that students will often use superficial heuristics to progress in a game, rather than reasoning. Baker et. al. [8] coin the term "gaming the system", although not specifically for educational games, to describe behaviour in which students advance by systematically taking advantage of regularities in the software's feedback and help. Many students will play in this way unless expressely guided to think about their actions by an instructor or other educational agent. A l l of these factors point to a problem that although students find educational games very engaging, unless game play is supported, they do not learn. We argue that computer-provided individualized support based on careful assessment of student learning during game play can help overcome this limitation and make educational games a truly effective form of learning. The testbed for our research is the educational game Prime Climb. Prime Climb is a computer game which aims to teach 6th and 7th grade students about number factorization. We propose adding to Prime Climb an emotionally intelligent pedagogical agent that uses a student's game actions to track her evolving knowledge, and uses this as a basis for generating timely, tailored interventions to trigger constructive reasoning. Providing this support during game play can be extremely challenging however, because it requires careful tradeoffs between fostering learning and maintaining a pos-itive affective state. A n agent which intervenes too often or with the wrong advice will frustrate the student and jeopardize the motivating advantage of using a game rather than a traditional tutoring system to convey the instructional content. Thus, it is crucial for the agent to have accurate models of both student learning and affect. In practice, creating such models is hard. It requires understanding cognitive and affective processes on which there is very little knowledge, given the relative novelty of games as educational tools. Unless we interrupt the student to ask her what she knows and how she is feeling (which is unreliable, interrupts game play and would likely change the emotional state of the student) we must infer these states from her Chapter 1. Introduction 3 game actions alone. Using only game actions to infer the knowledge and emotional state which brought about these actions is a process fraught with uncertainty. Thus, it is difficult to create an accurate student model, yet important if we want to use the model to generate interventions which facilitate student learning. The long-term goal of this project is to add a decision-theoretic agent to the Prime Climb educational game which uses both a model of student learning and a model of student emotional states to determine its interventions during game play. The agent would use the two models to reason about the predicted effect on learning and affect for each of its possible actions, and select an action that optimizes the balance between learning and engagement. These actions may be providing various forms of help, offering encouragement, or no intervention at all. In this thesis we take one step towards this goal by tackling the problem of modeling a student's learning when s/he is playing with Prime Climb. A model of student affect for the Prime Climb educational game is described in [25]. The pedagogical agent referred to in this thesis uses only the model of student learning at present [28], with a future goal of combining this model with a model of student affect using a decision-theoretic approach as described above. In the remainder of this thesis we will use the term model to mean "model of student learning". 1.1 Thesis goals The goals for this thesis are as follows: 1. Use data from a user study to refine an initial model of student learning in Prime Climb built from expert knowledge [28]. Assess empirically the improvement in accuracy brought about by these changes. 2. Investigate the role that the various parameters and prior probabilities play in the model. 3. Assess the role that the student model plays in the learning outcomes achieved by the pedagogical agent's interventions. Chapter 1. Introduction 4 4. Assess the role the model plays in students' assessment of the pedagogical agent and suggest improvements to the agent's interventions. 1.2 Contributions of this research This thesis contributes a furthur step toward providing intelligent computer based support to learning with educational games. In particular we offer the following con-tributions: • We describe the design and evaluation of a student model to assess student learning during interaction with Prime Climb, and educational game for number factorization. Although much research has been devoted to creating student models for various types of computer based support, little work exists on student • modelling for educational computer games, a relatively new type of pedagogical interactions. • We present a method for learning parameters from log file data when some of the variables in the model are unobservable and for assessing a model's accuracy from post-test data. • We improve the accuracy of the original model from 50.8 percent to as high as 82.8 percent with ideal parameter and prior probability settings. • We present an ablation study to determine the role of the model in the student's interaction with the agent in an educational game, and provide suggestions for improving the effectiveness of the agent and student acceptance of its interven-tions. 1.3 Outline We begin by presenting related work in the areas of educational games, student mod-eling, evaluation of student models and pedagogical agents in chapter 2. In chapter 3 we introduce the educational game that was used in this research - Prime Climb. Chapter 1. Introduction 5 In chapter 4 we present the student model for Prime Climb. First, in section 4.1 we describe the original model of student learning and present study results which test its effectiveness. We then describe data-driven modifications made to this model to improve its accuracy, including changes to the dependency structure (section 4.2) and the addition of the modeling of common factor knowledge (section 4.3). .Accuracy and sensitivity to model parameters are discussed. In chapter 5 we describe the new hinting algorithm developed for the agent to take into account the new model which models common factor knowledge. In chapter 6 we present results from an ablation study in which the original and new models are compared using the new hinting algorithm. Finally we conclude, review the achievement of the thesis goals, and offer suggestions for future work. 6 C h a p t e r 2 Related Work 2.1 T h e computer as an educat ional too l 2.1.1 Intelligent tutoring systems Using computers to teach is not new - since the 1970's computers have been used by educators to provide one-on-one instruction [71]. However, early systems were not capable of adapting to different learners. The advent of Intelligent Tutoring Systems (ITSs) was seen as the solution to this problem. A n ITS is a system that engages in computer-aided instruction, and in addition possesses (i) knowledge of the domain (expert model), (ii), knowledge of the learner (student model), and (iii) knowledge of teaching strategies (tutor) [71]. These elements make the system able to adapt to different learners and provide one-to-one instruction that has been shown to be the most effective way of teaching if administered by human tutors [17]. The most common forms of adaptation are selecting exercises appropriate to the student's level and offering hints and remediation when the student needs help with the learning task. Many ITSs are now regularly used in schools [61, 77, 60]. 2.1.2 Educational computer games A new direction for the ITS community is educational games [69, 53, 63, 54]. Roblyer, Edwards and Havrilik [65] state that a classroom without games would be a very boring classroom indeed. Studies show that students find games more engaging than regular classroom instruction [63] and games can be used to motivate learners that might not otherwise be motivated to learn [47, 63, 48]. With new games engines such as UnReal [76] and BioWare [30], creating exciting Chapter 2. Related Work 7 . games which are similar to those played by kids on their own time, is now even easier. However, the challenge remains to design games that actually help students to learn. As was mentioned earlier, students often do not learn from educational games unless they are supported with additional activites [44, 63, 49, 53]. There are many educational games that have been developed to teach a wide range of subjects including mathematics (e.g. [45, 74]), programming (e.g. [36]), physics (e.g. [9, 19]), language (e.g. [41]) and business (e.g. [37, 50]), as well as other games, such as Civilization [21] that were not initially developed as educational games, but are being used for educational purposes [73]. Despite the increasing number of educational games being developed, there are relatively few examples of systems which combine the intelligent features of an ITS with the game environment, namely by maintaining a student model and adapting the structure and support the game offers based on this model. Beck et. al. [10] note that user-modeling in games is more difficult than in ITSs because the domain is not as well structured and analyzed. How the West was Won (WEST) [18] was one of the first educational games de-veloped which provided adaptive support. W E S T mimicked a board game in which students created arithmetic expressions out of numbers and advanced the equivalent number of spaces on the game board. The student model comprised counts of the num-ber of times the student's choice of arithmetic expression matched or did not match an optimal one chosen by the system. Tutorial interventions were provided based on these percieved gaps in the student's knowledge. However, W E S T was never deployed beyond a prototype and its student model was quite simple. M I T O [56] is a game which aims to teach students about Spanish orthography. Students solve exercises in different modules, and correct responses are rewarded with points and the movement of a character along a path to the goal. The student model, which is quite simple, keeps track of the number of correctly solved exercises in each module. It is used determine when to provide more difficult questions and whether to offer help. Stacey et. al. [74] have incorporated a student model into a series of games which Chapter 2. Related Work 8 teach concepts about decimal numbers. The student model, which is described in detail in the next section, is used to decide the sequencing of game items and when to present help. Adaptive support is also offered in some simulation games. KMQuest [50] is a simulation game for the knowledge management field. Groups of students work as members of a fictitious company and must respond to events that occur by making business decisions which affect business indicators such as 'the level of competence of marketing employees' or 'the number of patents pending'. These business indicators act as a student model, and feedback is provided to students based on the indicator values. Johnson et. al. [41] have incorporated a student model into the Tactical Language Tutor. In the Mission Practice Environment students practice basic communication phrases in a foreign language by asking questions of characters that help them to accomplish a mission. A virtual character acts as an aide to the student and provides adaptive assistance based on the model's assessment of the student's mastery of the language. The student model is built based on the student's progress in the simulation environment as well as through exercises that the student completes prior to beginning the simulation. Tan et. al. propose an ecosystem adventure game [75] in which students query characters in a simulation environment in order to uncover the cause of a series of environmental problems. The student model, which encorporates past problems solved and aspects of the current problem yet to be solved, will be used to determine the level of help that will be given by characters in the game. However, this game has not yet been developed. Conati and Zhao [28] introduced a pedagogical agent to the educational game Prime Climb, which teaches students about number factorization. The agent provides adaptive support during game play based on a model of student learning. We extend this work to improve the model which the Prime Climb pedagogical agent uses to provide its interventions. Chapter 2. Related Work 9 2.2 Student Modeling using Bayesian Networks A student model is a fundamental component of any ITS [78]. The student model stores information on the student's cognitive and/or affective states, and it is this information that is used to provide adaptive support. The amount and quality of information available to the student model is termed bandwidth [78]. The higher the bandwidth (more information), the more accurate and fine-grained the model's assessement can be. The problem with assessing students' knowledge and affective states, is that it is difficult to achieve high bandwidth without, being obtrusive. We can acquire low bandwith information by simply observing students' actions, but higher bandwidth requires students to be more explicit about their solutions and feelings. The tradeoff between bandwidth and obtrusiveness is even more acute in educational games where we wish to avoid interrupting game play as much as possible. In Prime Climb we take a low bandwidth approach, and do not require students to explicitly show their reasoning. In order to manage the resulting uncertainty in a principled way, we use Dynamic Bayesian Networks. Bayesian Networks [33, 62] are a graph-based framework for reasoning under uncer-tainty in which nodes represent random variables and arcs represent direct probabilis-tic dependencies among them. In user modeling, nodes in a Bayesian network usually represent user unobservable states or observable actions. In our setting, a Bayesian Network is used by fixing the values for the observed states, and using a standard reasoning algorithm to arrive at probabilities for the unobserved states. The three steps involved in building a Bayesian Network are to 1. identify the important variables in the process to be modeled (represented by nodes) and the possible values that they can take on, 2. identify and represent the relationship between the variables in a network struc-ture (the dependency arcs), and 3. parameterize the network by determining the conditional probability table for each node [34] . Chapter 2. Related Work 10 In user modeling, the first step of identifying the cognitive or affective states that will make up each of the nodes is traditionally done by an expert. The network structure and conditional probability tables are either determined by an expert's intution (e.g. [27, 57]), learned from data (e.g. [52, 6]), or a combination of both (e.g. [74, 59, 25, 85]). 2.2.1 Models designed using expert intuition H Y D R I V E [57] is a tutoring and assessment system for developing troubleshooting skills for an aircraft's hydraulics system. As the student attempts to uncover a problem with the aircraft, the tutoring system applies rules to her actions in order to classify them into action sequences (e.g. serial elimination of problem causes). The system uses a Bayesian network to reason about the student's knowledge based on these observed action sequences. Nodes in the network represent the different action sequences, as well as aircraft system knowledge, procedure knowledge, and strategy knowledge. The dependency arcs between the nodes reflect that observing a particular action sequence is dependent on the student's system, procedure, and strategy knowledge. The values in the conditional probability tables were initially set using input from a number of expert instructors and a task analysis methodology in which technicians solve a problem mentally and describe the reasons for their actions at each step. Modifications were made to the initial conditional probability tables based on reasonableness checks. The system uses the model's assessments to provide feedback to the student as she progresses with the tutor. In the A N D E S [27] physics tutor, students solve physics problems and also study worked-out examples. A Bayesian network tracks their long term progress, and also their current solution path (sequence of rules that can be applied in order to solve the current problem) . Nodes in the network are either (a) domain general, representing long-term knowledge of physics rules, or (b) problem-specific, representing facts, goals, rule applications, and strategies relevant to the current problem or example. The prior probabilities for the long-term knowledge are either set by population pre-test data or represent subjective estimates of the designers. These probabilities are updated Chapter 2. Related Work 11 after each problem that is solved or example that is studied. For the problem-specific portion, the Bayesian network encodes all possible solution paths and the student's progress through the solution thus far. The conditional probability tables for the network are set using the intuition that a student will apply a rule if she has knowledge of the rule, a goal of applying the rule, and if all of the facts necessary for the rule application are present in the current problem. However, in this situation there is also non-zero probability that she will not apply the rule due to some noise, a set-up referred to as a Noisy-AND. A second class of conditional probability table in the network encodes that a student has a particular goal if she is observed applying at least one of the rules necessary for achieving that goal. However, there is a non-zero probability that she has the goal even if she applied none of the rules relevant for that goal, a set-up referred to as a Leaky-OR. The student model in A N D E S is used to signal errors the student makes while completing problems and to provide mini-lessons on rules the student has a low mastery of. The goal recognition capabilities of the network enable the system to react to a student's problem-solving impasse and provide procedural help. Additionally, during example-studying, the system prompts the student to self-explain (clarify and elaborate on new information contained in the example) if the model indicates that she is not doing so [26]. 2.2.2 Models learned from data C A P I T [52] is a constraint-based tutor designed to teach capitalization and punctua-tion. Students must find and correct the capitalization and punctuation errors in each problem. The tutor uses a Bayesian Network to represent its long-term student model. The nodes in the model represent the problems, constraints about capitalization that may be violated or satisfied on each attempt at a problem, and feedback given to the student about each constraint. Whether a student will satisfy a constraint depends on the problem that is being attempted, which of the constraints she satisfied last time, and whether she was given feedback on that constraint. Rather than connecting every node to every other node, which would be very computationally expensive, the authors Chapter 2. Related Work 12 used the mutual entrophy maximization algorithm to determine which links would be included in the structure. Using data from log files of 3300 student actions previously collected, this algorithm learns a network structure by removing unnecessary links, and keeping only links between nodes which share a minimum amount of information in the data set'. The authors investigated six different network structures characterized by the strength of information required between nodes to in order to connect them and whether prior information was always included (ie. whether the node representing the previous attempt at a constraint was always connected to the node for the current attempt at the constraint). The conditional probability tables for the nodes were set using frequencies obtained in the data set. For each of the candidate models, the structure and conditional probability tables were learned using 80% of the data. The remaining 20% of the data was used to test the models. The model structure that was most accurate at predicting student actions in the test set was selected. The student model in C A P I T is used to select the next problem that the student will encounter by predicting the student's performance on each future problem and selecting a problem which falls in the zone of proximal development (not too hard but not too easy). The model is also used to select error messages by pretending that feedback was given and seeing how this changes the resulting network, in order to select the error message which maximizes learning. Arroyo and Woolf use a Bayesian network to model the attitudes of students that use Wayang Outpost, a multimedia tutoring system for high school mathematics [6]. Nodes in the network represent attitudes, such as wanting to challenge oneself or not wanting to enter a wrong answer, and evidence of student actions, such as the average number of hints used per problem or the time between question attempts. Relations between the nodes imply that interface actions are the direct result of a student's attitudes. The network structure was learned using data from a study involving 230 students. The data consisted of log files containing information such as the number of problems seen and time,spent on each, a pre-test and post-test on mathematics, and a survey on attitudes and motivation. The structure was determined by linking any two nodes which had a significant correlation in the data, eliminating correlation links Chapter 2. Related Work 13 among evidence nodes, and removing any links that created cycles (leaving in the links with the higher correlation strength). The role of different nodes in the network was investigated by leaving out each node and calculating the subsequent decrease in the accuracy of the model at predicting students' attitudes. The conditional probability tables for each of the nodes were parameterized using frequencies observed in the data set, as all nodes were observable in the data. Currently the student model is not being used to provide support during interaction with the tutor. There are also many examples of student models learned from data which do not use a Bayesian network, including [12, 15, 13, 43, 81, 31]. 2.2.3 Models which combine expert intuition and data learning There are advantages and disadvantages to the methods presented thus far for building models. Models built by experts have the advantage of being grounded in theory and thus more intuitive, however, these models are very time-consuming to build and require expert intervention. On the other hand, models learned from data can be less tedious to build, however they require a great deal of data and the resulting network may not have a structure that makes sense intuitively. Thus, more recently there has been an interest in building models which combine expert intuition with data learning. Stacey et. al. [74] use a Bayesian network to diagnose students' misconceptions as they work their way through a series of games about decimal numbers. The basis for the student model is a common decimal test, D C T , in which a student's answers on dif-ferent types of questions map to misconceptions they may have about decimal numbers. In the network, nodes represent a student's answers on different types of questions, a coarse classification of her misconception, a fine classification of her misconception, and evidence from particular games such as how many attempts the student took at a question. In [59], the authors compared networks built for this system which varied in the degree of expert design and data learning. In the purely expert-designed model, nodes were linked by intuition; responses on question types and game actions were made dependent on the student's coarse and fine misconception. In this network the Chapter 2. Related Work 14 conditional probability tables were parameterized using the D C T test's mapping of questions to misconceptions as a guide, and experts' estimates of how often students make a slip when answering test questions. In a second model, the expert-designed structure was used, but the parameters were learned from frequencies observed in data collected from over 2500 students of each student's test responses and an expert's clas-sification of her misconception. A third model used the same data to learn both the structure (using the Causal M M L algorithm [32]) and the parameters in the condi-tional probability tables. .Each of the three networks and other variants were assessed by (a) comparing the model's assessment of each student's coarse and fine misconcep-tion with the assessment by the standard D C T test, and (b) by using the model to predict a student's response on a test question in the data set after the model has seen evidence of the student's responses on other test questions. The expert-designed model with learned frequencies was the closest match to the D C T test. The three models exhibited similar prediction accuracies. In the games, the student model is used to pick the next type of question the student will encounter, with the strategy of selecting questions that the student will get correct at the beginning of the interaction and increasing in difficulty, with preference given to questions that will disambiguate between conflicting misconceptions [74]. In addition, the network is used to determine when to provide help and when to move on to the next game. A Bayesian network is used to assess students' emotions as they play the edu-cational game Prime Climb [85, 25]. The network was designed based on an expert theory, then modified using data. The O C C theory of emotions states that emotions are caused by one's cognitive appraisal of a situation and how it fits with one's goals. The network contains nodes for different emotional states, goals, and perceived satis-faction of these goals, as well as nodes for personality traits, observed game actions, and help interventions. Links were added between nodes by the intuition that (a) per-sonality traits cause students to have certain goals, (b) goals, help interventions, and game actions affect the student's perceived satisfaction of her goals, and (c) perceived goal satisfaction causes specific emotions by the O C C theory. The structure was then refined using data from several studies which included personality and goal question-Chapter 2. Related Work 15 naires and log files of actions taken during game play. A structure was selected which maximized the log likelihood (a measure of how well a structure fits the data) and maintained links between nodes which had significant correlations in the data. The conditional probability tables were parameterized using frequencies found in the data and smoothing techniques to account for the sparce data. The model of students' emotions is not currently in use in Prime Climb, although the authors plan to use it in conjunction with a model of student learning to determine when' to intervene to provide help to students. 2.2.4 Our approach One of the contributions of this thesis is the use of data to refine the structure and parameterize a model of student learning for-the educational game Prime Climb. The model was initially designed from expert intuition. The work by Nicholson et. al. [59] is similar in that the Bayesian network student model for their educational game is learned from data. However, there is one significant difference; the data for net-work learning in [59] come from students' performance on a traditional test to detect decimal number misconceptions. When parameters like the probability of an error of distraction (slip) or a lucky guess are learned from these data, their values do not reflect student performance in actual game playing. Thus, there is an underlying as-sumption of a strong similarity between test and game performance. This assumption is not always justified; several studies have shown that students can be successful game players by learning superficial heuristics rather then by reasoning about the underlying domain knowledge. Furthermore, students may make more slips than during formal testing, because they get distracted by the game aspect of the interaction. Thus, game performance is likely to be a less reliable reflection of student knowledge than traditional tests. As we will see in a later section, in the work presented in this thesis we learn the model from data coming from actual interaction with Prime Climb, and we report model accuracy in assessing student's knowledge from actual game playing. Thus, the Chapter 2. Related Work 16 parameters in our model provide us with insights on how students learn and interact with this type of educational system, per se a contribution given the relatively lack of understanding on these mechanisms. Furthermore, in our model, some of the variables are unobservable so, although we learn some of the parameters from frequencies as done in [52], [6], and [59], we also use an alternative method based on cross-validation. 2.3 Evaluation 2.3.1 Evaluation of ITS and Educational Games Evaluation is becoming increasingly common in the ITS community as researchers realize the need to justify claims that their software has educational benefit. Most studies are concerned with learning gains commonly assessed either from pre and post-tests or from performance during interaction with the system and its fit to a power-curve for learning. Other measures include learning efficiency (how long it takes to learn a predefined body of material), user attitudes towards the software, and evaluation of how the system is used in practice. There are many different methodologies for evaluating ITS software, including comparing the ITS to regular classroom instruction (e.g. [70, 46, 55, 1, 44, 11]), varying software features within the ITS and comparing to itself (e.g. [3, 5, 26, 28, 44]), comparing the ITS with an ablated version,which lacks a particular feature (e.g. [68, 14, 20]), and use in real contexts (e.g. [48, 83]). Many of these evaluation methods have also been used for educational games [63]. Comparing the game to regular instruction Klawe [44] describes an evaluation of the mathematics adventure game, Phoenix Quest conducted with five grade 5 and 6 classes. Three of the classes played the game and also worked through supporting activities, one class played the game and received lectures on the material covered in the game, and one class recieved only lectures. A l l Chapter 2. Related Work 17 students wrote a pre-test and post-test on mathematics concepts that the game and lectures aimed to teach. Researchers found that only the group of students that played with the game and worked through materials directly related to game play improved significantly from pre-test to post-test. The group that recieved lectures and played the game did not improve at all, and the group that did not play the game actually did worse on the post-test. Klawe concludes that game play helps students to learn, but only if it is supported with additional activities which link concepts learned in the game to those studied in class. Students learning conversational Arabic with the Tactical Language Tutor par-ticipated in a study to assess whether playing a simulation game enhanced language learning [11]. 21 adults were randomly assigned to one of two conditions; ITS only, or ITS and simulation game. A l l students worked with the system for six hours, after which they wrote a post-test on Arabic. Researchers found that although students liked playing the game more, students in the ITS-only condition did better on the post-test than students that played the game and worked with the ITS. They hypoth-esize that this is because students playing the simulation game used cheat sheets to help them with the game, which hindered their learning of the material [40]. Comparing the game to a varied or ablated version A study is described in [44] in which 116 grade 6 students played with different versions of the mathematics game, Super Tangrams, in order to assess what aspects of the game helped them to learn tranformational geometry. In the 3x2 study, students were randomly assigned to play one of three different versions of Super Tangrams with different levels of scaffolding, with an interface that either did or did not have additional embellishments such as graphics and music. Students wrote a pre-test and post-test on transformational geometry and also filled out a post-questionnaire on their attitudes towards the game. Researchers found that students in the condition with the most scaffolding learned significantly more than those in the other two groups. Christoph et. al. [20] performed an ablation study to determine what effect the Chapter 2. Related Work 18 addition of a task model had on students' aquisition of knowledge in a simulation game for knowledge management, K M Quest. The 46 students that participated in the study were randomly assigned to one of two conditions; those that played a standard version of the game, and those that played the game with an additional task model which they could use to guide their decision-making during game play. A l l students wrote a pre-test and a post-test which assessed their knowledge of declarative and procedural knowledge, and played the simulation game for approximately 8 hours. Researchers found that although there was a significant effect of learning for both declarative and procedural knowledge (students in both conditions improved from pre-test to post-test), there were no significant differences between the group that had access to the task model and the group that did not. E v a l u a t i n g g a m e u s e i n r e a l c o n t e x t s Klawe [44] describes an ethnographic study in which over 10,000 students were ob-served interacting with computer games at a science museum over a period of two months. One finding of these observations is that most students do not make connec-tions between the skills that they learn while playing the game and the same skills that they learn in school. The researchers also noted that there is a gender difference in game preferences; girls prefer games with story lines and characters, whereas boys prefer action or adventure games. Lee et. al. [48] investigated classroom culture and adoption of software in a study of educational game use in context. 63 grade two students were given hand-held game consoles on which to play the game Skills Arena, a game in which students do addition and subtraction drills in order to acquire points. Researchers observed students in the classroom as they played the game, and also collected statistics on the number of questions attempted, the number of questions correctly solved, and the time spent playing. They found that students were highly motivated to play the game, and completed three times more questions on average during a regular classroom period than they did with pencil and paper exercises. They also found Chapter 2. Related Work 19 that students explored the software and encorporated the use of other game features that had not previously been introduced to them. Researchers observed cooperation amongst students and creative story-telling brought about by game use. However, the study did not investigate whether the use of Skills Arena improved students' addition and subtraction skills. Squire [72] describes three case studies which were carried out with a high school class, a camp, and an after-school club, to investigate the potential for a computer simulation game to learn social sciences. In each study, students played with Civiliza-tion III [21], a simulation game which encorporates elements of geography, history and economics, for approximately 20 hours. Researchers carried out discussions and inter-views with the students, and observed them as they played the game. Initially they found that students were confused by the game, and suggested that in future incarna-tions of the program instructors should start each day with a lecture and discussion to get students thinking about what they will be learning. Squire points out that although the students did ask more questions and show an increased interest in social sciences during the program, they still needed to be encouraged to seek out answers to their questions and to use resources to solidify their knowledge. Squire advocates the use of computer games in the classroom, but stresses that game play should be supported teacher-mediated interventions which aim to get students to .formulate and test hypotheses about the material in the game. 2.3.2 Evaluation of student models When evaluating an ITS or educational game which provides adaptive support to the student, it is important to evaluate not only the complete system, but also to investigate what role the student model plays in the learning gains that are observed. A student model can be evaluated in two ways [22]; • Directly: by assessing model accuracy at predicting student responses or test scores, or • Indirectly: via an ablation study in which the system with a student model is Chapter 2. Related Work 20 compared to the same system with a random or no student model. Direct assessment of model accuracy Weibelzahl and Weber suggest that the two methods for assessing the accuracy of a user model are to (a) compare the model's assessment of the user's hidden states to the assessment of these same states by an external test (or expert), and (b) to use the model to predict the user's behaviour and compare the prediction to the actually displayed behaviour of the user [82]. Although Weibelzahl and Weber were speaking generally of user models, their classification is applicable to student modeling as well. We broaden their use of the term user, however, to describe both students and simulated students. VanLehn and Niu [79] use the method of comparing to an external test with sim-ulated students to assess the accuracy of the Bayesian student model for the A N D E S physics tutor. Students were simulated by randomly deleting rules from the knowl-edge base and using this reduced knowledge base to solve problems. Recordings of the solver's actions while solving a problem make up one simulated student. Sensitivity analysis was performed by modifying parameter values and structural features of the model assessor and determining the resulting effect on the model's accuracy at assess-ing which rules the "student" had mastered. In this case, the model's assessment is compared to the true state of the reduced knowledge base. Although this analysis is useful for investigating the role of different aspects of the model, it does not give us an indication of the accuracy the model would have on predicting real students' physics mastery. Arroyo and Woolf [6] assess the accuracy of their model at determining students' attitudes given log files of interaction with the ITS Wayang Outpost and pre-test and post-test scores. Using 10-fold cross-validation, they train their model with data from student interactions with the system, pre and post-tests on mathematics, and attitude questionnaires, and report the test-set accuracy on predicting the student attitudes reported on the questionnaires. However, this model is not currently being used in the Wayang Outpost system to provide adaptive support. Chapter 2. Related Work 21 Nicholson et. al. [59] evaluate their student model, which assesses student miscon-ceptions on decimal numbers, in several different ways. In the case-based evaluation, experts were given the opportunity to interact with the system, simulate different stu-dent responses, and to investigate the resulting posterior probabilities of misconception classification. In the comparison evaluation, the model's predictions of which miscon-ception a student had were compared with expert assessments of the same students. In the prediction evaluation, the model was used to predict how a student would respond given a particular question. For each class of problem, the model was shown evidence of five student responses, and used to predict the sixth. However, the data used in these evaluations were collected from a pencil and paper test of questions, rather than data from actual interaction with the game. Thus, it is possible that student modeling during game play is more difficult because responses given in a pencil and paper test may be different to those given during actual system use due to the excitement of game interaction. Anderson et. al. [4] evaluate the student model for their Lisp tutor by using the model to predict the midterm and final exam scores for students who used the tutor. Although this method provides an estimate of the overall accuracy of the student model, it has the disadvantage of grouping all of the model's assessments on individual items into one crude score, rather than predicting mastery on individual knowledge items. Such a prediction does not lend insight into which aspects of the model contributed to the overall accuracy and allows small defects in the student modeler to go undetected. Indirect evaluation: Ablation studies The second way to evaluate a student model is indirectly, by assessing the effect that the addition of the student model has on the overall instructional capabilities of the system. This is commonly done via an ablation study, where learning gains in students who use the complete system with the student model are compared against those of the students that use a system with a random or no student model. The rationale Chapter 2. Related Work 22 is that if the student model is correctly assessing the student's knowledge, the with-model condition will learn significantly more than the no-model condition. However, this approach has one limitation - if a student model which is accurately modeling students' knowledge is not be being used effectively by the system to provide adaptive support, significant learning gains might not be observed in the with model condition, even if the model is extremely accurate. Beck et. al. [14] use an ablation study to evaluate the student model for the ITS Whale Watch, which aims to teach grade 6 students about fractions. The model uses belief vectors for each of the fraction sub-skills to represent the model's belief that the student is at -a particular level. The model is used to provide feedback to the student by selecting a hint for the sub-skill at the lowest level (by weighting each level by its belief) [16]. 60 students participated in the ablation study. Each student was randomly assigned to either the full feedback condition, which used the Whale Watch student model to provide feedback, or the no feedback condition which provided students with vague messages such as "Try again" when they answered a question incorrectly. The system was designed to increase students' perceptions of their mathematical abilities, thus students were given pre and post questionnaires which targeted their feelings of self-confidence about mathematics. However, the study found no significant differences between the feedback and no-feedback groups, and in fact found no gain in self-confidence between pre-test and post-test for any of the groups. Conati and Zhao [28] conducted an ablation study to investigate the effect that a pedagogical agent using a student model for the educational game Prime Climb has on learning. In the study, 20 grade 7 students played with either a version of Prime Climb which had a pedagogical agent providing hints based on a model of student learning or no pedagogical agent. Students were given a pre-test and post-test which assessed their knowledge of how to factorize numbers. Researchers found that students playing with the game that contained the pedagogical agent learned marginally significantly more than the group playing with no help interventions. This study is described in more detail in section 4.1.2. Chapter 2. Related Work 23 Models evaluated directly and indirectly Both direct and indirect evaluations of student models are important, but for different reasons. The direct evaluation of model accuracy gives a numerical estimate of how good the model is at determining students' hidden knowledge and/or affective states or at predicting students' responses. This evaluation is useful from the user modeling perspective. The indirect evaluation gives a sense of how useful the model is in tailoring support to the system user, and helps us to answer the question of whether adaptive support can improve learning. Both types of evaluation are necessary. Without direct evaluation, we have no idea whether our model is actually modeling what it is supposed to. Without indirect evaluation, we have no idea whether use of the model provides an improvement to the system. Only a few student models, however, have been evaluated in both ways. Baffes and Mooney [7] performed several evaluations of their student modeler AS-SERT. A S S E R T models student errors (bugs) and captures novel bugs with the con-struction of a bug library, as students solve problems in the ITS for C++ programming, C++ Tutor. In the indirect ablation study, 100 university students were randomly as-signed to one of 4 condions; (a) full A S S E R T modeler used to provide feedback, (b) A S S E R T modeler without the bug library used to provide feedback, (c) no modeling, topics for feedback chosen randomly, and (d) no feedback given. The first three groups were given the same amount of feedback, however, the model determined what C++ rules the feedback related to. Every student was given a pre-test and post-test of multiple choice questions in which they had to identify the error (or lack-thereof) in a 'piece of C++ code. Researchers found that students learned more in condition a than b, more in b than c, and more in c than d. However, the difference between the full A S S E R T modeler and the modeler without bug library was not significant. The direct evaluation of A S S E R T used log file data of questions ansered by students using the C++ Tutor (without feedback). The questions were split into 50% training set and 50% testing set. For each student, the model was shown the training data for that student, and was used to predict the students' responses on the test set data. Chapter 2. Related Work 24 This was repeated for both the full A S S E R T modeler and the A S S E R T modeler with no bug library. The full A S S E R T modeler had an accuracy of 62.4%, however there were no significant differences between the full A S S E R T modeler and the one without the bug library. The accuracy of an induction learner using the same training and set data was also computed as significantly worse at 49.4%. To test the usefulness of the bug library, 20 students were simulated by modifying the correct knowledge base with bugs. This buggy knowledge base was used to generate student answers to test questions. These answers were then given to the A S S E R T modeler to determine how well A S S E R T could recreate the original "buggy" knowledge bases. Different parameters in the model were modified to investigate how the parameters affected the model accuracy. The student model for the C A P I T spelling and punctuation tutor [52] was eval-uated directly and indirectly. In the direct evaluation, data of student actions was collected from students using the tutor to solve problems. The data were split.into a training set and a test set. The model was presented with training evidence of actions for each student, and used to predict the actions the student would take (which rules they would violate). These predictions were compared to the actions in the test set. This evaluation was carried out during the development of the model to select amongst competing model structures. A n ablation study with three classes of 9 and 10 year olds was also carried out with C A P I T . Students were randomly assigned to one of three conditions; no tutor, tutor with random problem and error message selection, or tutor with problem and error messages determined by a decision-theory approach using the student model assess-ments. Students were given a pre-test and post-test on capitalization and punctuation. The results of the student showed that students learned significantly more with the version of the tutor which used the student model. The model had an effect size of 0.557. In this thesis we present work in which we evaluate a student model both directly and indirectly. In chapter 4 we present results on the accuracy of the student model at predicting students' knowledge on test items. In section 6 we present the results of an Chapter 2. Related Work 25 ablation study in which learning gains are compared between students who recieved feedback tailored using the student model, feedback tailored using another model, and no feedback at all. We begin by introducting the game Prime Climb. 26 C h a p t e r 3 Pr ime Cl imb In this section we describe the Prime Climb educational game; the game rules, the tools available to students, and the pedagogical agent which provides support. Prime Climb was devised by the Electronic Games for Science Education group (EGEMS) at the University of British Columbia. It is designed to teach number factorization to 6th and 7th grade students. 3.1 Game Rules Prime Climb is a 2-player game in which students pair up to climb a series of mountains. Each mountain is divided into numbered hexes (see figure 3.1), which students can click on to make their way up the mountain. Each player should move to numbers that do not share any common factors with the number her partner is on. If a wrong number is chosen, the climber falls and swings from the rope until she can grab onto a correct number. A fall also causes a loss of points for the pair of players. For example, in Figure 3.1, the player at the bottom moves to the number 8 while her partner is on the number 9. This is a correct move because 8 and 9 share no common factors. If the student then moves to the number 42 (figure 3.2), she will fall and swing, because 42 and 9 have a common factor of 3. She will continue to swing across the numbers 3, 19 and 2, until she clicks on one of the numbers she is swinging past. In figure 3.3, the student recovers from the fall by clicking on the number 2. The hexes that the student may move to are outlined in green. The student is not able to move to hexes with rocks or trees on them. The player may not move more than two hexes away from his/her partner as there is a rope connecting them. Chapter 3. Prime Climb 2 7 Figure 3.1: Correct move: The student moves to the number 8 while her partner is on the number 9 However, players are free to move up, down and sideways. The two players do not have to take turns, they may go in any order. When either player makes it to the top of the mountain, both players move on to the next level. Hypothetically this should encourage the students to work together, although this is not always observed in practice - students are very competitive. There are ten levels in all, which increase in both size and difficulty of the numbers on the mountain. 3.2 Tools To help students with the climbing task, Prime Climb includes a few tools which can accessed from the P D A on the upper right corner of the screen (see figure 3.6). If the student clicks on the magnifying glass icon, she accesses the Magnifying Glass tool which can be used on any number on the mountain to display the first level of a factor tree for that number in the P D A . A factor tree is a common representation used in mathematics textbooks to visualize the factors of a number. A number z is Chapter 3. Prime Climb 28 Figure 3.2: Incorrect move: The student attempts to move to the number 42 while her partner is on the number 9. She falls and swings across the number 3, 19 and 3. first decomposed into two numbers x and y such that z=x*y. These two numbers form the first layer of the factor tree. This process is repeated recursively for x and y until prime factors are reached, in order to form the subsequent layers. On the first use of the magnifying glass, only the first layer of the factor tree is displayed (figure 3.4a). Subsequent clicks on the numbers in the tree reveal in the factor tree in increasing levels of detail (figures 3.4b and 3.4c). There is more than one way to lay out a factor tree. For example, the factor tree displayed in figure 3.4 could have had as its first layer the numbers 3 and 14. However, the factor tree is always shown with a decomposition that generates the most balanced, and hence shortest, factor tree so that it can be displayed easily in the P D A . Students can also use the Help icon found on the P D A . Clicking on this button brings up a help dialogue box in which the student can ask specific questions such as "Why am I falling" and "How can I use the Magnifying Glass?" (Figure 3.5) . These questions are answered by the Prime Climb pedagogical agent. Chapter 3. Prime Climb 29 Figure 3.3: Correct move after fall: The student clicks on the number 2 while swinging 3.3 Pedagogical Agent Each student has a pedagogical agent (see figure 3.6) - the agent Merlin - that was added to the game to provide individualized support, both on demand and unsolicited, for those students who tend not to learn effectively from the unstructured interaction that the game supports. In general the agent is there to provide help when a student gets stuck, through student-demanded and spontaneously generated hints, and also to encourage the student to slow down and think about her actions when she gets caught up in the game aspect of Prime Climb. The agent provides answers to the student-demanded questions as shown in table 3.1. For the first two questions, the agent provides first a general hint, then subse-quently more detailed hints if the student clicks "Further help?" (see figure 3.5). The responses are shown in a bubble coming from Merlin's mouth and also spoken aloud. In addition to answering the student-demanded questions, the agent uses a model of student learning to determine whether to provide unprompted hints, after both incorrect and correct moves. The hinting strategy used in the first version of the agent [28] is described in detail in chapter 5, but we highlight the main points here. Chapter 3. Prime Climb 30 What shall I do now? A n s l : "click a green highlighted hex to continue" Ans2: "use the magnifying glass to check the highlighted hexes around you, find one that doesn't share common factors with your partner's number" Ans3: "move to x" or "wait for your partner" Where can I move? A n s l : "choose a green highlighted hex which doesn't share a common factor with your partner's number" Ans2: "use Magnifying glass to help you" Ans3: "move to x" or "wait for your partner" Why am I falling? Ans: "you fall only if you click a number which shares common factors with your partner's number" How can I stop swinging? Ans: "click a number you are swinging through" How can I use the Magnifying glass? Ans: "click the button with a Magnifying glass on the P D A , and then click the number you want to see factor tree o f Table 3.1: Agent responses to student-requested questions using the "Help" icon Chapter 3. Prime Climb 31 Figure 3.4: A factor tree displayed in the P D A after a student uses the magnifying glass on a) the hex 42 on the mountain, b) the number 6 on the factor tree, and c) the number 7 on the factor tree The hints provided are shown in table 3.2. If the model indicates that the student does not know the factors of the number she is on or her partner's number, the agent intervenes with one of three hints which increase in detail from general to specific. If the model indicates that the student does know the factors, yet moves incorrectly, it is assumed that she does not understand the concept of common factors, and hence the agent provides one of three hints on common factors, beginning with the most general, and progressing to more specific on subsequent hinting opportunities. Finally, if the student is judged not to have the relevant factorization knowledge, yet makes a correct move, the agent provides a hint which encourages her to slow down and think about her actions. Again, the hints are shown in a bubble coming from the agent's mouth and are also spoken aloud. In chapter 5 we describe a new algorithm and hints we devised to improve the agent's pedagogical interventions. To provide well-timed and appropriate interventions, the agent must have an ac-curate model of student learning. In the next chapter we describe the original model for student learning, and data-driven improvements we made to this model. Chapter 3. Prime Climb 32 Model indicates that the student does not know the factors of the player's or partner's number(s) Hint 1: "Think about how to factorize the number you clicked on" Hint 2: "Use the Magnifying glass to help you" Hint 3: "It can be factorized like this: X l * X 2 * . . . * X n " Model indicates that the student knows the factors, but moves incorrectly Hint 1: "You cannot move to a number which shares common factors with your partner's number" Hint 2: "Use the Magnifying glass to see the factor trees of your and your partner's numbers" Hint 3: "Do you know that x and y share z as a common factor?" Model indicates that the student does not know the factors, yet moves correctly Hint: "Great, do you know why you are correct this time?" Table 3.2: Unsolicited hints provided by the agent Merlin Chapter 3. Prime Climb 33 C What shall 1 do now? 1 C Where can 1 move? C Why am I falling? C How can I stop swinging? How con I use the Magnifying gJooo? Cancel Figure 3.5: Help dialogue box which appears when the student clicks 'Help' on the P D A Figure 3.6: The Prime Climb interface with pedagogical agent Chapter 4 34 Model of Student Learning In this section we first describe the original model of student learning devised for Prime Climb. We then present two new models with data-driven improvements. We conclude by comparing the three models using Receiver-Operator curves and accuracy scores. 4.1 Original Student Model We begin by describing the structure of the original student model for Prime Climb, as well as a user study that indirectly tested its effectiveness. Both the model and the study are described in [84], and represent the starting point of the work that is the focus of this thesis. We then describe the first step in this thesis work, ie. a direct evaluation of the model that we ran to gain a better understanding of the model's role in the system's effectiveness. From this evaluation we isolate two problems with the existing structure which are addressed in sections 4.2 and 4.3. 4.1.1 Structure of the original model One of the difficulties in modeling student knowledge in educational games is the high level of uncertainty involved. In Prime Climb the only actions we use to infer student knowledge are the student's moves and tool uses, and from these reason about the domain knowledge that caused these actions. This low bandwidth approach frees students from being required to answer questions about their knowledge or to provide explicit solution steps, but at a cost; the assessment is fraught with uncertainty. Game actions give us ambiguous insight into student knowledge. A correct move can be evidence of student knowledge, or it might just be a guess. In fact, students Chapter 4. Model of Student Learning 35 often play Prime Climb well without good factorization knowledge by resorting to superficial heuristics rather than reasoning about their moves. In addition, students are prone to errors of distraction in which they may have then underlying knowledge necessary to make a correct move, and yet they err because they become distracted by the game aspect of the interaction. We use Dynamic Bayesian networks [33, 66] to handle this uncertainty in a principled way. We also use an iterative process of design and evaluation that starts from a simplified network structure based on designer and teacher's intuition and then refine it based on a formal evaluation of its performance. In Prime Climb, there is a Dynamic Bayesian network for each mountain that the student climbs (the short-term student model). A Dynamic Bayesian network consists of time slices [33, 67]. Each time slice represents a relevant temporal state in the process to be modeled. In our case, a time slice is created in the network after every student action, to capture the evolution of student knowledge as the climb proceeds. Assessment at the end of each mountain/level is stored in a long-term student model [84], which is used to initialize the short term model for the next mountain to be ascended. Short-term student model Each short term model includes the following random binary variables: • Knowledge Nodes - Factorization Nodes (Fx): A n Fx node represents whether the student has mastered the factorization of number x down to its prime factors. Fx nodes have two states; Known (K) and Unknown (U). The short-term stu-dent model includes factorization nodes for each of the numbers on the mountain as well as for all of their factors. - Knowledge of Factor Tree Node ( K F T ) : The K F T node models whether the student understands the factor tree representation. The K F T node has two states; Known (K) and Unknown (U). Each short-term student model contains one K F T node. Chapter 4. Model of Student Learning 36 Figure 4.1: Dependency relationship between factorization nodes in the original stu-dent model. Fz — Fx * F^ • Evidence Nodes — Click Nodes ( C x ) : Each Cx node models the correctness of a students click on number x. Click nodes are introduced into the model when the student clicks on a number, and are immediately set to one of their two states; Correct (C) or Wrong (W). — Magnification Nodes (Magx) : Each Magx node denotes the use the magnifying glass on number x. Magnification nodes are introduced into the model when the student uses the magnifying glass, and are immediately set to Yes (Y). Figure 4.1 illustrates the structure used in the original version of the model to represent the relationship between factorization nodes. A key assumption underlying this structure is that knowing the prime factorization of a number influences the prob-ability of knowing the factorization of its factors. The opposite is not necessarily true -it is hard to predict if a student knows a number's factorization given that s/he knows how to factorize its non-prime factors. For example, if the student knows the factorization of the number 42, this implies that the student knows the factorization of the numbers 6 and 7. However, knowing the factorization of the number 6 does not imply that the student knows the factorization of the number 42. This assumption about the relationship between factorization nodes was derived from talking to 6th and 7th grade mathematics teachers. Chapter 4. Model of Student Learning 37 Figure 4.2: Subset of a short-term student model showing the dependencies between factorization nodes in the original model of student learning To represent this assumption, factorization nodes are linked as parents of nodes representing their non-prime factors. Figure 4.2 shows a sample subset of the short-term student model. The conditional probability table for each non-root factorization node (e.g. F3 in figure 4.2) is defined so that the probability of the node being known is high when all the parent factorization nodes are true, and decreases proportionally with the number of unknown parents. The actual values in the C P T were based on the designer's subjective assessment. The mountain shown in figure 3.6 would have a short-term student model with knowledge nodes as shown in figure 4.3. Note that there is a factorization node for each number in the mountain (e.g. the number 40) and for each of their factors (e.g. the number 4). The factors that each node is linked to are chosen using the same algorithm as is used to lay out the factor tree in the P D A in the most balanced way. There is one K F T node. A time slice of this network represents the model's assessment of the knowledge that the student has on each of the numbers' factorizations and the factor tree representation at a particular point in time. ' Evidence nodes are introduced into the model when the student performs an in-terface action. Wi th each evidence node, a new time slice is created. The two types Figure 4.3: The knowledge nodes in the short-term student model for the mountain shown above Chapter 4. Model of Student Learning 39 Figure 4.4: Two time slices depicting the dependencies between evidence and factor-ization nodes. At time tj the student has clicked on the number 7 while her partner is on the number 10. At time ti+\ the student has used the magnifying glass on the number 42 of actions which are modeled include moves to a given hex and clicks to access the magnifying glass on a given number. The action of clicking on number x when the partner is on number y is represented by adding a click node as parent of nodes Fx and Fy (see for example figure 4.4, time slice ti with x—1 and y=10). If the click is correct, this has the effect of increasing the probabilities that the student knows the factorizations of the two numbers involved in the click. If the click is incorrect, these probabilities are decreased. This structure of evidence coming from click actions in the diagnostic direction was adopted initially to prevent evidence on a number x from propagating upwards to the numbers that contain it as a factor. For example, if x is a factor of z, then the evidence node Clickx and the node Fz are d-separated, a configuration which has the consequence that they are conditionally independent given no evidence oh Fx. In figure 4.4, time slice ti, evidence from the node Clicks does not propogate back up to node F42 because Clickx and F42 are conditionally independent given no evidence on F7. This was done to respect the insights on the relation between factor nodes that was provided by our teachers, namely that it is not the case that a student knows a Chapter 4. Model of Student Learning 40 Magx KFT P{FX) = K Y K P+0.1 Y U P Table 4.1: Conditional probability table for a factorization node Fx after the magnify-ing glass has been used on the number x. p is the probability that node Fx was known in the previous time slice. Figure 4.5: Dependencies between two time slices in the original short-term student model number's factorization just because s/he knows how to factorize its non-prime factors. Clicks to acces the magnifying glass are prepresented as shown in figure 4.4, slice ti+\. When the student uses the magnifying glass on the number x (42 in figure 4.4, slice U+i) this has the effect of adding a Magx node to the network. The Magx node and the K F T node are linked as parents of the corresponding factorization node Fx. The conditional probability table for the factorization node is shown in Table 4.1, and is set up such that the probability that the student knows the factorization of the number x increases if she knows the factor tree representation. If the factor tree representation is unknown, the probability remains the same as it was in the previous time slice. Chapter 4. Model of Student Learning 41 R o l l - u p A Dynamic Bayesian network is considered dynamic because it captures the evolution of knowledge over time. This is done using time slices which represent knowledge and evidence at various points in time, and dependency relationships between the multiple copies of the knowledge nodes in the different time slices (figure 4.5). However, we cannot keep an infinite number of slices so at some point we must discard old ones. If we were to simply discard these time slices without somehow saving the probabilities represented in the nodes, we would lose all of the information obtained from the evidence introduced thus far. The solution to this is to use a procedure called roll-up [66]. Roll-up is the process of saving the posterior probabilities of the nodes the slice that is terminated (e.g. slice U in figure 4.5) into the corresponding nodes in the new slice that is created (e.g. slice U+i in figure 4.5). One technique for roll-up is to roll up only the root nodes. The posterior probabil-ities of the root nodes in the old slice are simply saved as priors of the corresponding nodes in the new slice. This is appropriate if the root nodes are the only nodes whose states are believed to be changing over time. This is not appropriate for our situation, however, as many of the factorization nodes in our model are non-root nodes, and we assume that the student's factorization knowledge changes throughout the game as s/he learns. If we do not roll up these nodes, we would lose the information we had gleaned on them from the student's game play thus far. One possible roll-up solution proposed by Schafer and Weyrath [67] is to maintain several time slices, and only do roll-up (remove the first slice of the sequence) when the evidence from that slice has a negligable effect on the most recent slice. This approach, however is computationally expensive and not appropriate to provide real-time modeling in a fast-paced interaction like that with Prime Climb. To reduce the computational complexity of evaluating the short-term model, at any given time we maintain at most two time slices in the Dynamic Bayesian network. The roll-up for the non-root nodes in the original model is illustrated in figure 4.6. In time slice before the action has taken place, the conditional probability Chapter 4. Model of Student Learning 42 Fz P(FX) = K K P i U P2 E Fz P(FX) = K C K P i + We C U P2 + We Fz P(FX) = K U P i + We U P2 + We Figure 4.6: Roll-up of a non-root node, Fx, in the original short-term student model table for the non-root node Fx is shown. In time slice ti, evidence E (either a click or magnification action) is introduced into the model. we represents the weight that this evidence brings to the conditional probability table entries for the node Fx; they are increased to account for positive evidence and decreased to account for negative evidence. In time slice U+i, the evidence is removed, but the conditional probability table entry remains changed, thus the information from the evidence is maintained in the model. The disadvantage of this approach is that the conditional probability tables change, thus the dependencies among the knowledge nodes are not consistent across different nodes and different time slices. Long-term student model The long-term student model connects the short-term student models from each moun-tain. After a student finishes climbing a mountain, the posterior probabilities for all of the numbers in the short-term student model are stored in the long-term student model. Before the next level is started, the probabilities stored in the long-term stu-dent model are rolled-up into the corresponding nodes on the short-term student model for the next level. In the new short-term student model, root nodes (for example Fz in figure 4.7) are given prior probabilities directly from the long-term student model as described Chapter 4. Model of Student Learning 43 Fz ) ( Priorx Priorx Fz P(FX) = K K K 0.99 K U 0.91 U K 0.5 u U 0.01 Figure 4.7: Roll-up from long-term to short-term student model for non-root nodes in the previous section. The process is slightly more complicated in the case of non-root nodes (for example, Fx in figure 4.7). Wi th non-root nodes we add an additional parent node, Priorx, with a prior probability from the long-term student model. The conditional probability table for this node is shown in figure 4.7. The entries in the conditional probability table give more weight to the prior probability of each node from the long-term model, than to the other parents of the non-root node. The values in this table were chosen by the designer's intuition. This concludes the section on the structure of the original model. In the next section we describe how this model was indirectly evaluated prior to this thesis. We then discuss the direct evaluation we ran to assess model accuracy. 4.1.2 E v a l u a t i o n o f t he o r i g i n a l s t uden t m o d e l Indirect evaluat ion of the o r ig ina l mode l [28] describes a study that was designed to assess the pedagogical effectiveness of an agent acting on the original student model. In this study sixteen 7th grade students were randomly assigned to either the experimental group (N=9) or the control group (N=7). Students from the experimental group played with a complete version of Prime Climb which included a pedagogical agent that based its interventions on the original model of student learning. The control group played with a version of Prime Climb Chapter 4. Model of Student Learning 44 without a pedagogical agent. A l l students played for approximately 20 minutes. Before and after game play, students wrote a test on number factorization. The dependent variable was learning, assessed as post-test score minus pre-test score. Using this measure of learning, the experimental group learned more than the control group with a difference which is marginally significant (p=0.068) and a con-siderable effect size of 0.7. Students in the no-agent condition showed no improvement from pre-test to post-test, consistent with previous results on the limited pedagogi-cal effectiveness of educational games without support. Although these study results show that students learn better with a pedagogical agent, we still did not know what part the model played in this improvement. The first step in this thesis's work was to conduct a second study to directly assess model accuracy. Direct evaluation of the original model's accuracy The goal of the second study was to assess model accuracy by comparing the model's assessment of student knowledge with a paper and pencil assessment of the same knowledge. Fifty-two 6th and 7th grade students from three local schools played Prime Climb for approximately 10 minutes each. A l l students played a version of Prime Climb which included the pedagogical agent, each with an experimenter playing as her partner. In order to obtain model predictions,, all game actions were logged so that they could be replayed in simulations with different versions of the student model. For the purposes of this study, we used the original model of student learning, and ran simulations using different values of the model parameters in order to find an ideal set of parameter values. The pencil and paper test which the students wrote both before and after game play was designed to assess the same knowledge as the model. This test can be seen in appendix A . The first ten questions assessed the students' knowledge of the factorization of the numbers in table 4.2. These numbers were chosen because they appear frequently on the first few mountains or as factors of many numbers that appear Chapter 4. Model of Student Learning 45 on the mountains. The correctness of a student's response on a factorization question with the number x corresponds directly with the model node Fx. On an initial version of the test, students were asked to write out the factors of each of the ten numbers. However, after a pilot study with a class of grade 6 students in which students spent a great deal of time pondering whether they had remembered to include all of the factors, questions with numbers which had more than four factors were changed to recall questions in which the students had to circle the factors from a list. A l l of these factors questions were marked 1 if they were correct, and 0 if the student was missing any of the factors or circled any incorrect factors. Students were also assessed on whether they understood the concept of common factors. This was done by comparing their answers on 3 questions which asked them for the common factors of two numbers with their reponses on questions which asked for the factorization of the same numbers. If the student got the factorization questions correct but the corresponding common factor questions incorrect, she was deemed not to know the common factor concept. If the student made errors on the common factor questions which were consistent with errors made on the corresponding fac-torization questions, she was judged to know the common factor concept. Students that answered some questions correct and others wrong in no clear pattern or got all questions incorrect, were judged to not know the common factor concept. Students were also assessed on their knowledge of factor trees. In the initial version of the test, students were asked to draw two factor trees. This was changed after the pilot study because the corresponding node in the model, the K F T node, assesses not the student's ability to draw a factor tree, but her ability to pick out factors from looking at a factor tree in the P D A . Thus, in the study, these questions were changed to ones in which the student was given a factor tree and asked to list the factors of the number. It is possible that this does not directly assess student knowledge of the factor tree concept, as students may have already known the factors despite not understanding the tree at all, however we use this question as a best measure for assessing students' knowledge of the factor tree representation with the understanding that this knowledge might be over-estimated. Chapter 4. Model of Student Learning 46 Skil l item Pencil and paper assessment Factorization of specific num-bers Ten questions on the factorization of the numbers 2, 3, 4, 11, 15, 36, 40, 42, 50, and 81 Common factor concept Three common factor questions and re-sponses on factorization questions corre-sponding to the same numbers Factor tree representation Two questions on factor trees Table 4.2: Questions on the pre and post test which assess specific skills Because the post-test questions correspond directly to nodes in the model, we can use the post-test answers as a gold-standard against which to evaluate the model's assessment of student knowledge after game play. In the case of this assessment of the original student model, we compare the posterior probabilities for the 10 relevant factorization nodes with the responses to the 10 questions corresponding to the same numbers. The data on the common factor concept and factor tree representation are not used at this time, but this data is used later to set prior probabilities for the nodes (described in section 4.2.4) and to assess the accuracy of other versions of the student model (described in sections 4.3 and 4.4). In order to obtain a model predictions to compare with post-test responses, we used the log files of students' actions to simulate student game play, and generated model node predictions for each student. We repeated these simulations using many different values for the parameters in the model, in order to obtain a set of model parameters which maximized accuracy. The accuracy measure that we used was (sensitivity + specificity)/2, and is explained in more detail in section 4.2.2. In order to assess model accuracy, the model predictions for each student for each relevant factorization node were thresholded to known or unknown, and compared to the answer the student gave for the corresponding question. The accuracy measure was computed with many different threshold values, and a threshold of 0.5 was chosen because it yielded the Chapter 4. Model of Student Learning 47 maximum accuracy. Despite using the threshold and parameter values that yielded the highest accuracy, this model's accuracy for predicting students' ability to factorize individual numbers was assessed as no better than chance (50.8 percent). Study conclusions The original student model was designed originally with design decisions which simpli-fied the model construction and model complexity, but at a potential cost to accuracy. These design decisions are discussed in the next section. We wished to begin initially with a simple model and see how far this model would take us. From the direct eval-uation of the model's accuracy it appears that these simplifications limit the model accuracy quite substantially, so that it is only slightly better than chance. However, the results of the indirect evaluation of the model indicate that even interventions , that are based on an almost random model were sufficient to trigger some degree of students' reflections and learning during game playing. However, the results on the post-test for the experimental group - an average of 76.8% - still leave room for im-provement. We hypothesize that an agent using a more accurate student model may yield even more substantial learning gains. Thus, we seek to improve the accuracy of the original student model. 4.1.3 P r o b l e m s w i t h t he o r i g i n a l s t uden t m o d e l In this section we discuss two limitations in the original student model that contribute to reduce its accuracy. Approportion of Blame The first problem with the original model is that it does not apportion blame for an incorrect click in a principled way. Because the relationship between the click and factorization nodes is in the diagnostic direction (see figure 4.8), the two factorization nodes involved in the click get blamed equally for an incorrect click (for example, the Chapter 4. Model of Student Learning 48 Figure 4.8: Relationship between the click and factorization nodes in the diagnostic direction. This relationship prevents evidence of a click action from propogating up to the non-prime superfactors of the number, Fz in this example. numbers x and y are blamed equally for an incorrect click on the number x in figure 4.8). More likely, the number with the lower posterior probability was the cause of the wrong move, and should be blamed more. Said another way, the two factorization nodes involved in a click should be conditionally dependent given the action, so that the node with the lower probability of being known can be blamed more for the incorrect move. The only way to encode this dependency in the above structure is to add a link between the two factorization nodes (e.g, Fx and Fy in figure 4.8). However, doing so increases the model's complexity and in the original model design we wished to adopt as simple a model as possible. The reason that the structure with a diagnostic relation between the click and factorization nodes was chosen initially was to avoid evidence from the incorrect click on a number propogating back up to the numbers that have that number as a factor (for example, Fz in Figure 4.8). This was adopted to keep in line with the mathematics teachers' assumption that if a student knows the prime factorization of a number this does not imply that the student knows the factorization of the numbers which have it as a factor (it's 'superfactors'). One possible solution to the apportion of blame problem is to change the structure to refect a causal relationship between factorization and click nodes. Although this will violate the mathematics teachers' assumption, we feel that this tradeoff may be justified by an increase in accuracy from correctly apportioning blame on incorrect Chapter 4. Model of Student Learning 49 clicks. Common Factor Concept The second limitation is that the model does not include a node to explicitly represent the knowledge of the common factor concept, which is a key component in playing the game successfully. Although we were aware that this was an important skill to model, adding it would increase the model's complexity and potentially slow down the game. A fast-paced game is essential for maintaining engagement so we were anxious to see how far a simple model could take us. However, it is possible from the accuracy data that this may have been too much of a compromise for a simple model. In sections 4.2 and 4.3 we describe improvements we made to the model to address these two concerns. 4.2 New Model: Fixing the apportion of blame problem In this section we describe the modification made to the network to overcome the apportion of blame problem. We begin by describing the changes to the structure and the parameters involved. We then describe the data-driven method used to refine the parameters. Finally we present the accuracy of the model with this improvement. 4.2.1 Structure You will recall from the previous section that one of the problems with the network structure in its intial incarnation is that it did not correctly apportion blame for an incorrect click action. To solve this problem, we change the account of evidence about click actions from the diagnostic to the causal direction (see figure 4.9). In this new configuration, the probability that a student makes a correct or incorrect click is causually influenced by the student's knowledge of the factorization of the two numbers involved in the click. The one drawback of this new configuration, as was mentioned previously, is that evidence from a click action is propagated up to the superfactors of a number (for exam-Chapter 4. Model of Student Learning 50 Cl ickx Fx Fy P(Cl ick=C) K K 1-slip K U e^guess U K e-guess U U guess Figure 4.9: Dependency between click and factorization nodes in the causal direction as a solution to the apportion of blame problem pie, Fz in figure 4.9). We acknowledge that this violates the assumptions about student knowledge laid out by our mathematics teachers, but wish to determine whether the gains from this new structure overcome this drawback. The .three parameters needed to specify the configuration in figure 4.9 are a, e.guess, and guess as shown in the associated conditional probability table. • a: The a parameter represents the probability of making an incorrect move despite knowing the factorization of the relevant numbers (Fx = Fy — K ) . Wi th this configuration this could be due to either a simple slip, or to the student not understanding the concept of common factoring. • guess: The guess parameter represents the probability that the click is correct when the factorization of both numbers involved in the click are unknown (Fx = Fy = U), and the student must resort to guessing. • e.guess: We introduce this parameter to account for the possibility that when the student knows the factorization of one of the two numbers involved in the Chapter 4. Model of Student Learning 51 t Priorx Fz F(FX = K) K K 1 K U 1 U K max U U 0 Figure 4.10: Roll-up in the new model. Root nodes such as Fz are rolled-up directly. Non-root nodes such as Fx are rolled-up by adding a Priorx node, which is given as a prior the posterior probability of Fx in slice i j . The conditional probability table for node Fx in slice is shown. click, it may be easier for her to guess correctly. In addition to these three parameters, there is an additional parameter which must be learned which is associated with the roll-up procedure described in the next section. R o l l - u p When evidence is introduced into the model (e.g. node E in figure 4.10, time slice ti), it influences the posterior probabilities of the knowledge nodes in the network. To avoid losing the effect of this information when the time slice is removed, the knowledge nodes are rolled up (figure 4.10, time slice ti+i). For root nodes such as Fx in figure 4.10, the posterior probabilities in the old slice ti are simply saved as priors of the corresponding nodes in the new slice . For non-root nodes the. approach is as follows: for every non-root factorization node that needs to be rolled up (e.g. Fx in figure 4.5) we introduce an additional Prior node in the new time slice (e.g. Priorx in figure 4.10). The Prior nodes is given as a prior the posterior of the corresponding factorization node in the previous time slice (e.g. the prior probability of node Priorx in slice is the posterior probability of node Fx in slice ti). Chapter 4. Model of Student Learning 52 The conditional probability table for the factorization node in the new slice is shown in the table in figure 4.10. The table is set up such that knowing the factorization in the previous time slice implies knowing the factorization in the current slice. If the Prior node is known in slice ti, the corresponding node is known (given a value of 1 in the conditional probability table) in slice This may be a simplification of the actual situation. It is possible that students forget knowledge over time, thus a node being known in the previous time slice would not imply that it is still known in the current time slice. However, we think that this simplification is justified as students likely do not forget knowledge while playing Prime Climb because game play is often short, and knowledge that is taught is conceptual, not factual. Thus, to reduce model complexity, we do not model forgetting. Otherwise, the probability of the node being known is 0 when all the parent fac-torization nodes are unknown, and increases proportionally with the number of known parents to a maximum of max. The formula for the probability that the node is known (K) given that the Prior is unknown (U) is: P(FX = K) = — x max (4.1) V where p is the number of parent nodes that Fx has, and px is the number of those parent nodes which are known. The max parameter represents the probability that the student can infer the fac-torization of x by knowing the factorization of all of x's parent nodes. Using this method for roll-up, we do not lose the information that was gained from the evidence that was introduced in the slice that we now terminate. The advantage of this approach over the approach that was described"for the old model in section 4.1.1 is that the dependencies among the knowledge nodes are consistent across different nodes and different time slices. The reader will recall that in the old model's roll-up, the conditional probability tables themselves were changed, and hence the dependencies were different for different numbers and different time slices. We now describe how we learn the parameters a, e.guess, guess, and max from data obtained from the user study described in 4.1.2. Chapter 4. Model of Student Learning 53 4.2.2 D a t a - d r i v e n p a r a m e t e r re f inement We have identified four parameters for which appropriate values must be chosen. To estimate these values we use the data from the study which was described in section 4.1.2. From this study, we have the log files describing the game actions of each of the 52 students that played Prime Climb. In addition, we have students' -results on a pre-test and post-test which evaluated their knowledge of the factorization of ten commonly appearing numbers and the common factor concept. This data was used to estimate the parameters in two ways; first by frequencies, and secondly by cross-validation, as explained in the next two sections. E s t i m a t i n g p a r a m e t e r s b y f r e q u e n c i e s When all of the nodes involved in a given conditional probability table are observable, the table values can be learned from frequency data. Factorization nodes involved in specifying the parameters a, e.guess, and guess, are usually not observable, however, we have pre-test and post-test assessment on 10 of these nodes for each of our 52 students. If we consider data points in which pre-test and post-test had the same an-swer, we can assume that the value of the corresponding factorization nodes remained constant throughout the interaction (i.e. no learning happened). Then we can use those data points to compute the frequencies related to the conditional probability table entries involving a, e^guess, and guess. We found 58 such data points in our log files, yielding the frequencies in table 4.3. For example, when estimating the a parameter we looked for cases in the log files in which a student made a move involving two numbers whose factorizations she got correct on both pre-test and post-test. There were 44 such moves. Of these, 8 were incorrect moves, yielding a parameter estimate of 0.23 for a. The frequency for the a parameter is based on 44 cases, thus we feel confident fixing its value at 0.23. However, because we have far fewer cases for the e_guess and guess parameters (see table 4.3), we must estimate these parameters in another manner. Similarly, we cannot use frequencies to set the max parameter as we do not Chapter 4. Model of Student Learning 54 Parameter Estimate Cases a 0.23 44 e_guess 0.75 12 guess 0 2 Table 4.3: Parameter estimates from click frequencies have data on Prior nodes, which represent the (possibly changing) student knowledge at any given point in the interaction. The method for estimating these parameters is described in the next section. Estimating parameters by cross-validation To select ideal values for e_guess, guess, and max we attempt to fit the data to the answers that we have for students' post-tests. We fix the parameters to a specific (e-guess, guess, max) triplet, feed each student's log file to the model, and then com-pare the model's posterior probabilities over the 10 relevant factorization nodes with the corresponding post-test answers. Repeating this for our 52 students yields 520 (model prediction, student answer) pairs for computing model accuracy. Since it would be unfeasible to repeat this process for every combination of param-eter values, we select initial parameter values by frequency estimates and intuition. Next we determine whether the model is sensitive to any of the three parameters, and if so, try other parameter settings. The values used initially for e.guess were (0.5, 0.6, 0.7), chosen using the frequency in table 4.3 as an upper limit and rounding to the nearest tenth. For guess there are too few cases to base the initial values on frequencies, so we rely on the intuition that they should be less than or equal to the e.guess values, and thus use (0.4, 0.5, 0.6). For max we use (0, 0.2, 0.4). A parameter estimate of 0 for max would imply that knowledge of a factorization of a number is not influenced by the superfactors of that number. Values of 0.2 and 0.4 imply an increasing influence of this relationship. We try. all 27 possible combinations of these values and chose the setting with the highest model accuracy. Chapter 4. Model of Student Learning 55 To avoid over fitting the data we perform 10-fold cross-validation by splitting our 520 data points to create 10 training/test folds. Each training set consists of 90 percent of the data, and it is used to compute the accuracy of the 27 different parameter triplets, as described above. For each fold, we select the parameter setting which yields the highest accuracy on the training set, and we report its accuracy on the test set. The goal is to select the parameter setting with the best training set performance across folds. As our measure of accuracy, we chose (sensitivity + sped ficity)/2 [79]. Sensitivity is the true positive rate (the percentage of known numbers that the model classifies as such), specificity is the true negative rate (the percentage of unknown numbers classi-fied as such). Thus, we need a threshold that allows us to classify model probabilities as known or unknown. To select an adequate threshold, we investigated behaviour for four different thresholds, starting with a high 0.95 and decreasing by 0.15 down to 0.5. We also ^ investigated behaviour with the low threshold of 0.4. For each of these thresholds values we chose the parameter triplet which maximized the average training set accuracy across folds. The threshold yielding the highest average training set accu-racy across folds with an optimal parameter setting was 0.8 (see table 4.4). Although in table 4.4 we show only the training set accuracies which arose from the optimal parameter settings, the accuracies across folds for the various parameter triplets with a threshold of 0.8 ranged from a low of 0.757 to a high of 0.772 which was better performance than seen at any of the other thresholds. Thus, we feel confident using a threshold of 0.8 for this model. Using a threshold of 0.8, the parameter settings which yielded the best training set performance averaged across all 10 folds are shown in table 4.5. The fact that the two guess parameters are high confirms previous findings that students can often perform well in educational games through lucky guesses or other heuristics not re-quiring correct domain knowledge. The fact that they are equal indicates that there is no substantial difference in the likelihood of a lucky guess given different degrees of number factorization knowledge. The setting of 0 for max indicates that the teachers-suggested relation between knowing the factorization of a number and knowing the Chapter 4. Model of Student Learning 56 Threshold Training Set Accuracy Std. Deviation 0.4 0.624 0.010 0.5 0.697 0.009 0.65 0.753 0.007 0.8 0.772 0.007 0.95 0.725 0.006 Table 4.4: Average training set accuracy across folds by threshold, using the optimal parameter triplet for each threshold. Parameter Estimate e_guess 0.5 guess 0.5 max 0 Table 4.5: Parameter estimates which maximize the training set accuracy factorization of its non-prime factors may be too tenuous to make a difference in our model. 4.2.3 Model Accuracy At this point we fix the parameters and the threshold at the values which maximized the training set accuracy. Using these values, the model sensitivity, specificity, and accuracy on each of the test folds is shown in table 4.6. There is always a trade-off between sensitivity and specificity. In our data set of (model prediction, student answer) points, 77.5 percent of the points are known (as assessed on the post-test), so a model which predicts all items as known could obtain very high sensitivity of 1. However, the resulting specificity would be 0, yielding an accuracy of only 0.5. Inspection of table 4.6 shows the sensitivity and specificity breakdown for each fold. We see that they remain relatively constant between folds i and similar to one another. The average test set sensitivity is 0.767 (std dev. 0.071) and the average test set specificity is 0.786 (std dev. 0.093). Chapter 4. Model of Student Learning 57 Test F o l d Sens i t iv i ty Specif ici ty A c c u r a c y 1 0.773 0.778 0.775 2 0.846 0.846 0.846 3 • 0.789 0.786 0.788 4 0.667 0.750 0.708 5 0.850 0.900 0.875 6 0.773 0.875 0.824 7 0.733 0.833 0.783 8 . 0.854 0.667 0.760 9 0.730 0.600 0.665 10 0.657 0.824 0.740 Average 0.767 0.786 0.776 std dev. 0.071 0.093 0.063 Table 4.6: Sensitivity, specificity and accuracy by test fold The previous existing model of learning had an accuracy of 0.508, thus this is the accuracy we strive to beat. This new version of the model with optimal parameter settings achieves an average test set accuracy (across folds) of 0.776 with a standard deviation of 0.063. This is a substantial improvement over the 0.508 accuracy of the old model. 4.2.4 S e n s i t i v i t y t o p a r a m e t e r s Before addressing the second limitation of the model, we wish to briefly discuss the model's sensitivity to its various parameters. Collecting data and running log files to refine parameters requires a great deal of time, and it would be a considerable advantage to know which parameters in a model are most sensitive to changes so that efforts could be focused on these parameters. We discuss the sensitivity of the model to two types of parameters; the parameters a, e.guess, guess and max found in the conditional probability tables for the model nodes; and the initial prior probabilities Chapter 4. Model of Student Learning 58 Parameter Maximum Minimum Average std dev. std dev. std dev. e-guess 0.007 0.001 0.003 guess 0.008 0.003 0.005 max 0.007 0.001 0.002 Table 4.7: Maximum, minimum, and average standard deviation of model accuracy across three values of each parameter, while the other two are fixed. The maximum, minimum, and average are computed across the 9 fixed values of the other two param-eters. of the factorization nodes themselves. We use as a guide the techniques found in [79]. Sensitivity to a, e^guess, guess, and max parameters To investigate the model's sensitivity to changes in the parameters e^guess, guess and max, we fix two of the parameters and calculate the standard deviation of the model's accuracy across all three values of the third. The maximum, minimum, and average standard deviations across the 9 settings of the other two parameters are shown in table 4.7 for each of the parameters e^guess, guess, and max. For example, in the first row of table 4.7, the standard deviation of the accuracy across the three values of e_guess was calculated for each of the 9 (3x3) different combinations of parameters settings for guess and max. The largest of these 9 standard deviations was 0.007, the smallest was 0.001, and the average was 0.003. Similar calculations we done for the parameters guess and max. The table values indicate that the model has a low sensitivity to small changes in these parameters. However, to rule out the possibility that the three values we initially chose for each parameter were not ideal, we tried a few more extreme values (0.3 and 0.1 for guess and e.guess; 0.6 and 0.8 for max). The accuracy across all (model prediction, student answer) points with these values (the other parameters were kept at their ideal settings) are shown in table 4.8. Accuracy was worse with these extreme r Chapter 4. Model of Student Learning 59 Pa ramete r se t t ing A c c u r a c y guess = e_guess = 0.1 0.648 guess = e_guess = 0.3 0.729 max = 0.6 0.765 max = 0.8 0.765 slip = 0.1 0.757 slip = 0.5 0.751 Table 4.8: Average accuracy across all (model prediction, student answer) points for extreme values of the parameters. Parameters which are. not listed were set to their ideal parameter values. values, indicating that the model is sensitive to larger changes in these parameters. We also varied the value of the a parameter, first slightly with little change in accuracy, and then to more extreme values (0.1 and 0.5) with a decrease in accuracy, as also seen in table 4.8. These results indicate that we were able to identify adequate value ranges for the parameters in our new model configuration, and that the model is not sensitive to small changes of these parameters in the given ranges. They also suggest that we could select a value slightly higher than 0 for the max parameter if we want to maintain in the model the teacher-suggested relationship among factor nodes. However, we can choose to ignore these relationships if we ever need to improve the efficiency of model update. Sens i t iv i ty to pr ior probabi l i t ies We were also curious to find out how sensitive the model is to the values chosen for the initial prior probability for each of the nodes. We tried three different settings: • Default: 0.5 for each factorization node. • Individual: Priors specific to each student, derived from the student's pre-test answers. Priors were set to 0.99 and 0.01 for nodes corresponding to correct and incorrect pre-test answers, respectively. Chapter 4. Model of Student Learning 60 1 ROC - Influence of Priors \.z -f -' n n 1 u.u n Pi ••A A" — — 0-4 i / f ' 4 J n ? 1 Q _ u 2 ( -0;2-0 2 0. 4 0 6 0 CD 1. 1 -specificity Default Priors Generic Priors Individual Priors I Figure 4.11: R O C curves investigating the influence of prior probabilities • Generic: Priors set to population data based on the frequency of all students' answers for each of the 10 numbers on the pre-test questions. The numbers not assessed on the pre-test have their corresponding factorization nodes set to 0.5. Note that all results presented thus far have used generic priors. To compare the three conditions, we plot each with a Receiver-Operator Curve (ROC) [35] [39] shown in figure 4.11. A n R O C curve plots sensitivity (true positive rate) against l-specificity (false positive rate) at different thresholds. The R O C is used to investigate the tradeoff between these two at different thresholds. Picking a low threshold, thereby classifying most items as known results in high sensitivity, but low specificity. Picking a high threshold reverses these. A good classifier has good behaviour across thresholds, resulting in an R O C curve that is in the upper left quadrant of the graph. Chapter 4. Model of Student Learning 61 The data that were used to generate the R O C curves shown in figure 4.11 were the sensitivity and specificity computed across all of the (model prediction, student answer) points. We see from figure 4.11 that generic priors and individual priors appear to do better than default priors at most thresholds. To compare the three curves statistically, we use the area under the curve (AUC) metric. A U C is equal to the probability that a randomly selected known case will be given a larger posterior probability by the model than a randomly selected unknown case [39]. We compute A U C and the standard error of A U C using the formulas laid out in [39], and shown below. ^ ' ^ E E w ) ' « (4-2) i = l j = l where np and nn are the number of positive and negative cases (known or unknown on the post-test), s is the score of a case (thresholded model output), and P and N are the sets of positive and negative test cases. The function C is defined as follows, 1 if sp > sn C = { 0.5 if sp = sn . (4.3) 0 if Sp < Sn The standard error is computed as, 'A'(l-A')+Dp + DN SE(A') = A / „ \ ' „ (4-4) where D o = (n„ - Ul k 2 - A' P  p-l)(-^—-A'2) ' (4.5) and 9 A'2 DN = ( n n - l ) ( T T J ; - A ' 2 ) (4.6) The A U C and its standard error are shown in table 4.9 for each of the three prior conditions. [39] also provide calculations for comparing two curves statistically using a z-score. A' - A' z = A * (4.7) Chapter 4. Model of Student Learning 62 Priors A U C S E ( A U C ) Default 0.670 0.026 Specific 0.838 0.018 Generic 0.846 0.017 Table 4.9: Area under the curve and standard error of the area under the curve for each of the three prior probability conditions Comparison z-score significance p default vs generic 5.517 0.01 default vs individual 5.310 0.01 generic vs individual 0.219 not significant Table 4.10: Pairwise comparisons of differences between the three prior conditions using a z-score Pairwise comparisons between the three models were computed using this z-score, and are shown in table 4.10. Differences between the default and generic, and default and individual conditions are both significant at the 0.01 level (2-tailed). There was no significant difference between the generic and individual conditions overall. We also compare the three conditions at their threshold of maximum accuracy. B y investigating the R O C curve, we select the threshold for each of the three conditions which maximizes accuracy. These results are shown in table 4.11. We see that the optimal threshold chosen for the generic setting was very similar than the one selected by cross-validation. The thresholds are lower for default and individual settings. Even with default priors, the maximum accuracy remains above 0.7. This shows that the model can still have good performance even when accurate priors are not available. On maximum accuracy, the model with individualized priors is better able to distin-guish between known and unknown factorizations than the model with generic priors, however on the A U C metric there is no difference. The A U C metric gives us an overall estimation of how good our model is as a predictor of post-test knowledge. However, the accuracy with a specific threshold is more useful in practice because the agent Chapter 4. Model of Student Learning 63 Se t t ing M a x i m u m A c c u r a c y Thresho ld Default 0.717 0.70 Generic 0.776 0.85 Individual 0.828 0.45 Table 4.11: Maximum accuracy achieved across all (model prediction, student answer) points, with each prior probability setting uses a threshold to determine whether to intervene or not based on the student model posterior probabilities. In practice, however, it may not be feasible for the model to begin with individual priors, as this would require asking students to write a test before playing the game, which certainly makes the game much less fun! It is worth noting that when setting the values for the generic priors, we only had generic settings for ten numbers (the remainder were given the default setting) while there are over forty numbers in the first three levels alone. Model accuracy might be further improved by collecting population data on other numbers that are encountered in the game. In this section we have presented a refined model of student learning with changes to the dependency between click and factorization nodes and refined parameters. A l -though this model has shown significant gains in accuracy, there are still improvements to be made. One such improvement is adding the knowledge of the concept of common factors as an additional component that influences student climbing performance in Prime Climb. We discuss this addition in the next section. 4.3 New Model: Adding common factor knowledge The reader will recall that in Section 4.1.3 we cited two problems with the existing model- the apportion of blame problem and the lack of account for common factor knowledge in the model. Changes made to address the first problem brought about significant increases in accuracy. We now turn our attention to the second problem -modeling the students' knowledge of the common factor concept. Chapter 4. Model of Student Learning 64 In this section we describe the modification made to the new model to account for common factor knowledge in the network, and the subsequent changes to the structure and conditional probability tables for the click nodes. We then describe the outcome of parameter refinement using the method in the previous section. Finally we present results on the accuracy of the network for predicting factorization knowledge as well as for predicting the students' knowledge of common factoring. 4.3.1 Structure Because the models presented thus far do -not model common factor knowledge, when a student makes an incorrect move despite knowing the factorization of both num-bers involved in the move, the model can only infer that the student either made a slip or does not know the concept of sharing common factors. This obviously limits the system's capability to provide precise feedback based solely on model assessment. However, modeling common factor knowledge increases model complexity as we will show shortly. To see how much we can actually gain from this addition, we generated a new model that includes a common factor node (CF) representing the probability that the student understands the common factor concept. There is one common factor node in each short-term student model. When a click node is added to the model, the common factor node is linked as its parent, along with the factorization nodes corresponding to the two numbers involved in the click. This configuration is shown in figure 4.12. Intuitively, this structure reflects that a correct or incorrect move is caused by the student's knowledge of the factorizations of the two numbers involved in the move as well as her knowledge of the common factor concept. The parameters that must be learned for this configuration are shown in the table in figure 4.12. Note that the conditional probability table entry corresponding to an incorrect action when all the parent nodes are known (K) now isolates the probability of a slip. As before, the guess and e.guess parameters in the conditional probability table reflect potential differences in the likelihood of a lucky guess given different levels of existing knowledge. Chapter 4. Model of Student Learning 65 CF CF Fx P(Click=C) K K K I-slip K K U e.guess K U K e.guess K u U guess U K K guess U K U guess u U K guess u u U guess Figure 4.12: Click configuration with Common Factor node We make the assumption that if the student does not understand the concept of common factoring she must guess, despite whether she knows the factorization of the two numbers involved in the click. This is reflected in the parameter guess on the last four lines of the table in figure 4.12. While it may be the case that this type of guess is different from the situation where the common factor concept is known but the factorizations of the other two numbers are unknown (line 4 in the conditional probability table in figure 4.12), introducing a fourth parameter into the conditional probability table would require more data to populate the table, so we chose not to make this distinction between types of guesses. 4.3.2 Data-driven parameter refinement As in the previous model, we have four parameters, slip, guess, e.guess, and max, whose values we estimate using data from our study. As before, we begin with estimates based on frequencies, and use these estimates as a starting point for estimation by cross-validation on the points for which we do not have enough data to rely on the frequency estimate. Chapter 4. Model of Student Learning 66 Parameter Estimate Cases slip 0.19 41 e_guess 0.71 10 guess 0.57 7 Table 4.12: Parameter estimates from click frequencies Estimating parameters by frequencies As with the previous model, when slotting moves into cases, we look only at items for which the assessment did not change from pre-test to post-test (ie. no learning occured). Determining whether a node is known or unknown was done by looking at whether the student got the question(s) corresponding.to that concept correct on the test. We use the frequency data to find the parameter estimates where possible, the results of which are shown in table 4.12. For example, there were 41 cases in which a student made a move involving two numbers she correctly identified the factors of on pre-test and post-test, and the student was assessed as knowing the common factor concept on both pre-test and post-test. Of these cases, 33 were correct moves and 8 were incorrect moves, resulting in a slip estimate of 0.19. Note that this value is lower than the slip value for the previous model, consistent with the fact that the new slip models only the probability of an error of distraction. We have only enough cases to set the slip parameter. To estimate the other pa-rameters (including max), we use cross-validation. Estimating parameters by cross-validation We use cross-validation to estimate the remaining parameters. Using their frequency estimates as a guide, we try values of (0.5, 0.6 and 0.7) for both guess and e^guess, and (0, 0.2 and 0.4) for the max parameter. We fix the values to a particular (guess, e^guess,max) triplet, run the log files for each student and obtain model predic-tions on ten factorization nodes and the common factor node, and compare the model Chapter 4. Model of Student Learning 67 Threshold Training Set Accuracy Std. Deviation 0.4 0.707 0.009 0.5 0.752 0.007 0.65 0.763 0.007 0.8 0.772 0.007 0.95 0.730 0.006 Table 4.13: Average training set accuracy across folds by threshold Parameter Estimate e.guess 0.6 guess 0.6 max 0 Table 4.14: Parameter estimates which maximize the training set accuracy predictions to the students' post-test results on the corresponding questions to achieve a measure of accuracy. This is done with 90 percent of the data to form one fold, and repeated for each of 10 folds. The setting triplet which achieves the highest average accuracy across folds on the training sets is selected. The accuracy reported is the average accuracy on the remaining 10 percent of the data for each of the folds. This is repeated for the same thresholds that were used before, and the average accuracy across folds with the optimal parameter settings are shown in table 4.13. The same threshold of 0.8 is chosen for this model as for the previous one, as it maximizes the average training set accuracy. The parameter settings that resulted from this process are shown in table 4.14. These estimates show extremely good consistency with the parameters in the model without the common factor node. Also like that model, the new model is not very sensitive to small changes in the parameters. Chapter 4. Model of Student Learning 68 4.3.3 Model Accuracy As before, when reporting accuracy, we report the average test-set accuracy where accuracy is measured by comparing model predictions with student knowledge assessed from post-test results. Wi th this new model we have the ability to assess the students' knowledge of the common factor concept as well as factorization knowledge. Thus, we break accuracy down into accuracy on factorization nodes and accuracy on the common factor node Accuracy on factorization nodes The sensitivity, specificity, and accuracy on the ten factorization nodes for which we had an assessment is shown in table 4.15, broken down by fold. The average accuracy across folds is 0.768 (std dev. 0.064). This accuracy is comparable to that of the model without the common factor node. Accuracy on the common factor node The sensitivity, specificity, and accuracy on the common factor node is shown in table 4.16, broken down by fold. The average accuracy across folds is 0.500 (std dev. 0.000), with an average sensitivity of 1.000 (std dev. 0.000) and an average specificity of 0.000 (std dev. 0.000). Essentially, the model always predicts that the common factor node is known, resulting in a perfect score for sensitivity but a low score for specificity. These results would lead us to believe that the threshold for the common factor node is set too low. The threshold of 0.8 was chosen because it maximized the training accuracy across all factorization nodes and common factor nodes. However, this is biased towards factorization nodes as there are ten factorization nodes for every common factor node in the data set. Since the factorization and common factor nodes model different skills, it is concievable that they have different optimal thresholds. Thus, we look at the average accuracy across 10 folds on the test set of the common factor node, at 1 when estimating parameters the accuracy measurement used for the training sets was across all 10,factorization nodes and the common factor node Chapter 4. Model of Student Learning 69 Test Fold Sensitivity Specificity Accuracy 1 0.773 0.778 0.775 2 0.821 0.846 0.833 3 0.789 0.786 0.788 4 0.692 0.750 0.721 5 0.850 0.900 0.875 6 0.773 0.875 0.824 7 0.733 0.667 0.700 8 0.829 0.667 0.748 9 0.730 0.600 0.665 10 0.686 0.824 0.755 Average 0.768 0.769 0.768 std dev. 0.057 0.099 0.064 Table 4.15: Testing sensitivity, specificity and accuracy on factorization nodes, by fold different thresholds. The thresholds range from 0.65 to 0.95 in increments of 0.05, as shown in table 4.17. We also include the threshold of 0.5. We report the test set accuracy here because the parameters have already been fixed, however the training set accuracies at each threshold followed the same trend. Wi th a threshold of 0.95, the common factor node has an average test set accuracy of 0.715 (std dev. 0.334) with a sensitivity of 0.754 (std dev. 0.167) and a specificity of 0.667 (std dev. 0.516). This is considerably better than the accuracy of 0.5. The high standard deviation observed in the specificity is likely due to sparse data. Wi th 52 students, there are only 5 or 6 common factor points in each of the 10 folds. The data is even sparser for the specificity calculations as only 14% of the common factor points were unknown in the data set, which is likely why we see a high standard deviation for specificity. In the next section we compare the three models presented thus far, and comment on the relative benefits of each. Chapter 4. Model of Student Learning 70 Test Fold Sensitivity Specificity Accuracy 1 1.000 0.000 0.500 2 1.000 0.000 0.500 3 1.000 0.000 0.500 4 1.000 0.000 0.500 5 1.000 - -6 1.000 - -7 1.000 - -8 1.000 - -9 1.000 0.000 0.500 10 1.000 0.000 0.500 Average 1.000 0.000 0.500 std dev. 0.000 0.000 0.000 Table 4.16: Testing sensitivity, specificity and accuracy on common factor nodes, by fold. Folds 6-8 show no value in the specificity column as none of the data points in these folds had a post-test assessment of unknown hence specificity (and accuracy) cannot be computed for these folds. Threshold Sensitivity Specificity Accuracy 0.50 1.000 0.000 0.500 0.65 1.000 0.000 0.500 0.70 1.000 0.000 0.500 0.75 1.000 0.000 0.500 0.80 1.000 0.000 0.500 0.85 0.967 0.000 • 0.472 0.90 0.886 0.167 0.514 0.95 0.754 0.667 0.715 Table 4.17: Average testing sensitivity, specificity and accuracy on common factor nodes across folds, by threshold Chapter 4. Model of Student Learning 71 ROC - Model comparisons 1-specificity Figure 4.13: Comparisons of the three models and the baseline chance using R O C curves 4.4 Comparison of Models We compare the three models, the old model, the new model without the common factor node, and the new model with common factor node, using Receiver-Operator curves in figure 4.13. Note that the sensitivities and specificities used to compute the curve are across factorization nodes. We also plot the baseline chance model which predicts known fifty percent of the time. The A U C metric is used to compare the three models (table 4.18). Pairwise com-parisons yield significant differences between the old model and the new model without common factor node (z=7.096) and between the old model and the new model with the common factor node (z=6.847). Both differences were significant at the p=0.01 level. There was no significant difference between the new model with the common Chapter 4. Model of Student Learning 72 M o d e l A U C S E ( A U C ) Old 0.602 0.030 New - no CF • 0.846 0.017 New - CF 0.839 0.018 Table 4.18: Area under the curve and standard error measures for each of the three models across factorization nodes factor node and without (z=0.278). Thus, we have significantly improved the model accuracy from the old model, however there is no change in accuracy brought about by the addition of the common factor node. Admittedly, the assessment accuracy on the common factor concept is not very high, as seen in the previous section. One may suggest that the addition of the C F node would not substantially increase the model's capability to support precise didactic interventions, and thus may not be worth the potential delays in model updates due to larger conditional probability tables. However, our current data may not be sufficient for accurate parameter learning in this more complex model. As it was noted in section 4.3.1, there may be different types of guesses, which we have lumped together into the parameter guess. With more data to fill the frequencies in the conditional probability table, this distinction could be made. We plan to gather more data and see if that improves accuracy in the C F assessment of the model, however this work does not form part of this thesis. A n additional source of inaccuracy is that the gold-standard against which we compare the common factor node's assessment is itself a subjective assessment. Dif-ferently from how we assessed factorization knowledge, whether the student knows how to common factor was deduced from examining the pattern of their answers on the common factoring questions and comparing them to their answers on the factoring questions. It could be that our own assessment of whether students understand the common factor concept is flawed, and thus it is not a good basis for comparison. Chapter 4. Model of Student Learning 73 A l t e r n a t i v e configurat ion w i t h a C o m m o n Factor node At this point, the reader may wonder about a fourth model configuration that has not yet been assessed. In table 4.19 we classify the three models presented thus far on two dimensions; click-factorization relation and presence of the common factor node. Changes have been incremental, and the common factor node was added to the new model. What if we had chosen to add the common factor node to the old model? The configuration used for testing this fourth model is shown in figure 4.14. The R O C curve for this model, computed across the 10 factorization nodes, is shown in figure 4.15, however, upon comparison with the old model without the C F node using the area under the curve metric, the addition of the C F node brings no significant improvement to the accuracy of the model on factorization nodes (z=0.212). The optimal threshold for this model is 0.5, with a sensitivity of 0.743 and a specificity of 0.257, yielding an accuracy of 0.617. The optimal threshold for the common factor node is also 0.5 for this model, with a sensitivity of 0.587, a specificity of 0.510, and an accuracy of 0.549. This accuracy is lower than the accuracy of 0.715 reported for the new model with the common factor node at predicting common factoring knowledge. Although the original model with the common factor node is not as accurate as the new model with common factor node, it is not significantly different in accuracy to the original model without the common factor node when predicting factorization knowledge. Thus, we use the original model with common factor node as our original model when we compare new and original models in the ablation study described in chapter 6, as both provide a means for assessing the students' factorization knowledge and common factoring knowledge. Now that we have shown that with our new model we have a relatively accurate student model, we turn to the question of whether this model can significantly improve the pedagogical agent's interventions to support students as they play Prime Climb. Before we discuss the study that we ran to investigate this, we explain a new algorithm that we developed to improve the agent's hinting strategy based on this method. This is discussed in the next chapter. Chapter 4. Model of Student Learning 74 Figure 4.14: The original model including a common factor node Origin al model with CF node 0.4 0.2 r tfiilllfl / SlillSIlt S8BP 0 0:2 0.4: 0.6. 0.8 1 1.2 ^specificity Figure 4.15: R O C curve for the alternative configuration of the original model with common factor node. Sensitivity and specificity are computed across the 10 factoriza-tion nodes Chapter 4. Model of Student Learning 75 Click-Factorization relation C F node Diagnostic Causal absent Old Model New Model present ? New Model with C F Table 4.19: Models classified on two dimensions: click-factorization relation and pres-ence of a common factor node 4.5 I m p l e m e n t a t i o n The student model is written in Visual C+-h The Bayesian Networks are built using the M S B N x toolkit (http://research.microsoft.com/adapt/MSBNx/). 76 Chapter 5 Changes to Agent Interventions In this chapter we describe changes that we made to the agent's hinting strategy. The original hints that the agent provided [28] were based on a model without a common factor node, thus the hinting strategy did not use a model's assessment of common factoring knowledge to tailor hints specifically for the case when the student does not understand the concept of common factors. As we had to update the agent's hints to take common factor knowledge into account, we took this opportunity to redesign the agent's hinting strategy. 5.1 Original hinting strategy In the version of the game described in [28], the agent had seven hints grouped into three categories as shown in table 5.1. The first column shows the hints that were provided if the model indicated that the student did not know the factorization of either the player's or the partner's number, and the student made an incorrect move. These hints progress from general to specific. As the previous model did not have a common factor node, it was assumed that if a student fell but the model indicated that she knew the factorizations of the two numbers involved in the move, the incorrect move must have been due to the student not knowing the concept of common factoring or not understanding the game rules. Hence, in this situation the agent provides a first hint on the game rule and second and third hints on common factoring as shown in the second column of table 5.1. Again, these hints progress from general to specific. This "assessment by elimination" assumes that the only reason a student would make an error if she had the relevant Chapter 5. Changes to Agent Interventions 77 Hint Fs or. Fp is un-known, incorrect move Fs or Fp is known, incorrect move F s or Fp is un-known, correct move 1 "Think about how to factorize the number you clicked on / your partner's number "You cannot move to a number which shares common factors with your partner's number" "Great, do you know why you are correct this time?" 2 "Use the Magnifying glass to help you" " Use the Magnifying glass to see the factor trees of your and your partner's numbers" 3 "It can be fac-torized like this: X l * X 2 * . . . * X n " "Do you know that x and y share z as a common factor?" Table 5.1: Hints in the current game version, where s is the number the student is on, and p is the number the partner is on Chapter 5. Changes to Agent Interventions 78 factorization knowledge is because she did not understand the rule or common factors, overlooking the possibility that she made a simple slip. However, without a model which assesses common factoring knowledge, a heuristic must be used to decide when to hint on common factors. Wi th the new hinting strategy we will be able to use the model to assess when the student does not know the concept of common factoring, and hint regardless of whether the move was correct or incorrect. The model implicitly takes into account students' slips. However, the model does not model whether the student understands the game rules, thus the new hinting strategy will use a heuristic to determine when to provide a hint on the game rules. The final hint, shown in the third column of table 5.1 was provided for the situation in which the model indicates that a student does not know the factorization of one of the numbers involved in the move, yet she still moves correctly. This hint was designed to deal with lucky guesses by getting the student to slow down and think about her actions. 5.2 New hinting strategy With the addition of the common factor node to the model, we can now isolate the situations in which a student does not understand the common factor concept as well as situations in which the student does not know the factorization of one of the two numbers involved in the move. We view the problem of designing an agent hinting strategy as three tasks: 1. Deciding which skill to hint on 2. Deciding when to hint on each skill 3. Deciding how to hint on each skill Choosing which skill to hint on When deciding what to hint on, we use the following strategy: Chapter 5. Changes to Agent Interventions 79 Old Model with C F node New Model with C F node common factor node 0.50 0.95 factorization nodes 0.50 0.80 Table 5.2: Optimal thresholds determined in chapter 4 for the common factor node and the factorization nodes for the each of the two models • If the common factor node is unknown and the factorization nodes relevant to the click are known, provide hints on common factors. • If one or more of the factorization nodes relevant to the click are unknown and the common factor node is known, provide hints on factorization. • If both the common factor node and one or more of the factorization nodes relevant to the click are unknown, either hint on common factors or factorization, alternating each time this situation arises. • If all of the relevant nodes are known, do not hint. Deciding when to hint on each skill In this strategy, we need to pick a threshold for determining whether the nodes are known or unknown. In section 4.3.3 we observed that different thresholds were ap-propriate for factorization nodes and for the common factor node, so in the hinting algorithm each of these types of nodes are given a different threshold. This algo-rithm, which determines when to provide hints on common factors or on factorization, is shown in figure 5.1, where CFTHRESHOLD is the optimal threshold for the com-mon factor node and FACTHRESHOLD is the optimal threshold for the factorization nodes, and model(Fx) returns the model's assessment of the student's knowledge of the-factorization of x i.e. the posterior probability of the node Fx (and similarly for model(cfNode)). Optimal thresholds were determined for both the original model with common factor node (in section 4.4) and the new model with common factor Chapter 5. Changes to Agent Interventions 80 INPUTS: numl, numl INITIALIZATION: 1 bool cf'^unknown = (model(cf Node) < CFTHRESHOLD) 2 bool fac.unknown = (model(FNUML) < FACTHRESHOLD) O R (model(FNUM2) < FACTHRESHOLD) HINTING A L G O R I T H M A: 3 if (cf .unknown A N D (NOT f ac.unknown)) 4 hint_on_common factors 5 else if (fac.unknown A N D (NOT cf .unknown)) 6 hint.ori-factorization 7 else if (cf .unknown A N D facjunknown) { 8 if (last hint was on common factors) 9 hint.on-factorization 10 else 11 hint.on-common-factors end if } end if Figure 5.1: Algorithm A : Used to determine whether to give a hint after the student moves to numl while the partner is on num2. Chapter 5. Changes to Agent Interventions 81 node (in section 4.3), and are shown in table 5.2. On lines 4, 6, 9, and 11 of the hinting algorithm in figure 5.1, the system provides a hint to the student about either common factors or factorization. Before turning our attention to the form these hints will take, we wanted to get a rough idea of how this timing strategy would play out during interaction with the game. Thus, we implemented the algorithm in figure 5.1 with prototype hints and asked graduate students in our computer science department to play the game and give feedback on the frequency of the hints. The overwhelming observation from this informal experiment was that the agent intervened too often after the player made a correct move. Examining the hinting algorithm in figure 5.1, there is no mention of whether the move is correct or not. Because we do not need to reason by elimination to determine whether common fac-toring is known, as was done in [28], we do not need to use the move's correctness to determine when to hint on common factors. If the model indicates that the student is lacking sufficient knowledge of a skill, the agent intervenes, regardless of whether the student made a correct move. This is important because one of the problems with educational games is that students can learn how to play well without having the underlying knowledge. However, from a practical standpoint, students may find it frustrating to be interrupted after they have executed a correct move, even if the interruption is justified. This issue can be resolved using a decision-theory approach as was done in [38] and [52]. Decision-theory provides us with a method of choosing between possible actions. In this framework we assign a numerical value to the utility of different actions given knowledge; in our case, to the utility of hinting when the student lacks the skill, hinting when the student possesses the skill, not hinting when the student lacks the skill, and not hinting when the student possesses the skill. These utilities are weighted with the model's assessments of the probability that the student possesses each skill to obtain the expected utility of each action. The action with the greatest expected utility is chosen. The model of student affect.that is being developed in parallel to our learning model should allow us to make more informed decisions on how to balance affect Chapter 5. Changes to Agent Interventions 82 INPUTS: numl, numl INITIALIZATION: 1 boo l wrongjinove = shareCommonFactors(numl,num2) 2 i f [wrong..move) 3 boo l c}-unknown — (model (cf Node) < CF-wrongrnRES HOLD) 4 b o o l fac-unknown = (model(Fnumi) < FACwrongTHRESHOLo) O R (model(Fnum2) < FACjwrongTH RES HOLD) 5 else 6 b o o l cf .unknown = (model (cf Node) < CF_correctTH RESHOLD) 7 boo l fac.unknown = (model(Fnliml) < FAC.correctTHRESHOLD) O R (model(Fnum2) < F'AC'-CorrectTH RES HOLD) end i f HINTING A L G O R I T H M B: 8 i f (cf ^unknown A N D ( N O T f acjunknown)) 9 hint-on-commori-factors 10 else i f (facjunknown A N D ( N O T cf -unknown)) 11 hint-on-factorization 12 else i f (cf -unknown A N D fac.unknown) { 13 i f (last hint was on common factors) 14 hint-OU-factorization 15 else 16 hint^on-common-factors end i f } end i f • Figure 5.2: Algorithm B: Used to determine whether to give a hint after the student moves to numl while the partner is on num2. Chapter 5. Changes to Agent Interventions 83 Old Model with C F node New Model with C F node wrong correct wrong correct move move move move common factor initial: 0.50 initial: 0.50 initial: 0.95 initial: 0.95 node final: 0.50 final: 0.40 final: 0.95 final: 0.85 factorization initial: 0.50 initial:0.50 initial: 0.80 initial: 0.80 nodes final: 0.50 final: 0.50 final: 0.80 final: 0.50 Table 5.3: Values for the four thresholds in each of the two models and learning by having both an accurate assessment of these measures and decision strategies that allow us to strike a reasonable tradeoff between the two. However, as our study participants were school children, we did not have time to implement a full decision-theory action selection strategy based on both the affective and learning model before the end of the school year in which we conducted the study. Thus, we devised a heuristic which uses the correctness of the move to temporar-ily work around the identified problem of intervening too often after a correct move. Using this heuristic, if the move is correct, we set the threshold lower for the com-mon factor node and the factorization nodes. There are therefore four thresholds in the model; CommonFactor-WrongMove, CommonFactor-CorrectMove, Factorization-WrongMove, Factorization-CorrectMove. A node is judged to be unknown if it is strictly less than the relevant threshold. The pseudo-code for the algorithm which uses this heuristic is shown in figure 5.2. Note that this algorithm is identical the the one presented in figure 5.1 except for lines 1-7 in which we first check if the move is correct or wrong in order to decide which thresholds to use to determine whether the skill is known. We chose initial values for the thresholds, then adjusted these values after observing their effect on the timing of the agent's hints during game play. The initial thresholds (see table 5.3) were set to the values chosen in section 4.4 for the old model and sections 4.3.2 and 4.3.3 for the new model, as shown in table 5.2. We then asked a number of graduate students to play with the game as we varied the thresholds. Chapter 5. Changes to Agent Interventions 84 We asked the subjects to comment on the timing of the interventions and strove to achieve a balance in which we thought that the agent did not intervene too much, but did intervene when there was a problem. The threshold values were further adjusted after two pilot studies which are described in the next section. The final values which resulted for each threshold are shown in table 5.3. The reader will note that in the case of an incorrect move, the thresholds did not change from initial to final values, indicating that we selected good initial estimates in chapter 4. In the case of the old model, only the threshold for CommonFactor-CorrectMove was changed. Although this method of parameter adjustment is not very rigorous, it has prior precedent in work by Mayo and Mitrovic [52]. In this paper, the authors set the parameter which determines how rapidly old evidence is discounted in the decision-theoretic action selection model for the C A P I T tutor by simulating student actions and observing what effect this has on the actions selected by the system. A n expert observed the actions selected for different values of the parameter to determine the parameter setting which was ideal. Although graduate students cannot be considered experts in our field, we feel that they were able to provide valuable insight into the help interventions offered by the system. We acknowledge that using this heuristic to determine when to provide help means that the accuracy of the model interventions are now different from those expressed in chapter 4. The ramifications of this change are discussed in the next chapter. Setting these thresholds highlights the tradeoff inherent in educational games. To maintain engagement we want to choose a low threshold that may favour student positive affect over pedagogical effectiveness. To ensure pedagogical effectiveness we want to choose a high threshold that allows the system to intervene in all cases in which there is a problem, even if that means intervening too often. Examining our human-adjusted thresholds indicates to us that in the Prime Climb game environment it seems more natural to prefer engagement over learning, as in all cases the threshold remained the same or went down from initial to final value. In other words, the penalty for giving a hint when it was not wanted was deemed to be quite high by our players. A decision-theory approach with a model of student affect would allow us to explore Chapter 5. Changes to Agent Interventions 85 Level Andes [80] Cognitive Tutors [29] Our approach 1 Pointing hint which directs student's at-tention to the loca-tion of the error Hint which advises student of the goal she should be pursu-ing at this point in problem-solving Hint which focuses stu-dent's attention on the appropriate skill; either remind about the concept of common factors or in-cite to think about factors, of relevant numbers 2 Teaching hint con-sisting of a general description of how to solve the problem General description of how to achieve the goal General hint either teach-ing with a definition and general example or point-ing to tools the student can use to find the answer 3 Bottom-out hint which tells the student what to do Hint with concrete advice on solving the goal in the current context Bottom-out hint giving a worked-out example using the numbers from the cur-rent context Table 5.4: Pedagogical hinting strategy for three levels of hints progressing from gen-eral to specific for the Andes Physics tutor' [80], the Cognitive tutors [29], and the new hinting strategy for Prime Climb this further. > Deciding how to hint on each skill In the previous section we provided an algorithm for deciding when to hint on each ' skill - common factoring or factorization. We turn our attention now to the strategy for how to hint on these skills. In general, we wish to progress from general hints to specific hints, as is commonly done in several ITSs. In the Andes Physics tutor [80], three hints are available for Chapter 5. Changes to Agent Interventions 8 0 L e v e l H i n t C o m m o n Fac to r ing Fac to r i za t ion 1 Focus You cannot click on a num-ber which shares a common factor with your partner's number Think carefully how to factorize the number you clicked on / your partner's number 2 Def. 1 A common factor is a num-ber that divides into both numbers without a remain-der. Here's an example. Factors are numbers that di-vide evenly into the number. Here's an example. 2 Def. 2 A common factor is a factor of both numbers. Read this example. Factors are numbers that multiply to give the number. Look at this example. 2 Tool n /a You can use the magnifying glass to see the factors of the number you clicked on / your partner's number 3 Bottom-out You are right because x and y share no common factors / You fell because x and y share z as a common factor, x can be factorized as x\*X2*--*xn. y can be factorized as yi*y2*---*ym-Table 5.5: Progression of Common Factor and Factorization hints Figure 5.3: A level 1 hint being spoken by Merlin and shown in a speech bubble. Chapter 5. Changes to Agent Interventions 8 7 Figure 5.4: Merlin speaking level 2 definition hints with corresponding dialogue boxes showing examples, (a) common factors definition 1, (b) common factors definition 2, (c) factors definition 1, and (d) factors definition 2. Chapter 5. Changes to Agent Interventions 88 each student error - a pointing hint, a teaching hint, and a bottom-out hint. The first hint directs the student's attention to the location of the error, the second gives a general description of how to solve the problem, and the third tells the student what to do (see table 5.4 column 1). In the cognitive tutors [29], there are also three levels of hints for each.student error (see table 5.4 column 2). At the first level, the hint advises or reminds students of the appropriate goal at this stage of problem solving. At the second level, general advice is given on how to achieve the goal.' At the third level concrete advice is given on how to solve the goal in the current context. Each level may contain multiple hints. We model the hints in our new hinting algorithm after these tutors, as shown in the third column of table 5.4. We use this progression of hint categories regardless of whether we are hinting on factors or common factors. The specific hints given in each of these hint categories for the skills of common factoring and factorization are shown, in table 5.5. In the first category of hints, we point the student's attention to the appropriate skill. For common factoring this is done by reminding the student of the game rule which states that the student cannot click on a number which shares a common factor with her partner's number. For factorization focusing is done by encouraging the student to think about how to factorize the relevant number. For this hint, "the number you clicked on" or "your partner's number" is selected depending on which number, the player's or her partner's, has a lower probability in the model. Note that these two hints are the same as in the previous hinting strategy (table 5.1), although the wording has been changed slightly to more clearly express the goal that the hint conveys. These hints are spoken aloud by Merlin and shown as a speech bubble coming from his mouth (see figure 5.3). The second level contains three hints. The purpose of these hints is to give the student general advice which will help her to solve a problem by herself. To achieve this we provide both reteaching in the form of definitions and examples (see level 2, def. 1 and def. 2 in table 5.5), and point the student to tools that she can use to find her answer (see level 2, tools hint in table 5.5). The definitions and examples are new to this hinting strategy, as no similar hints exist in the original strategy. We reach this level of hint if the first level hints have been given and the model indicates that the Chapter 5. Changes to Agent Interventions 89 specific skill, either common factoring and factorization, is unknown, thus the student should benefit from seeing a definition of the skill. In addition, a dialogue box appears a few seconds after Merlin speaks the definition. The box shows a worked out example which reinforces the definition that was just given. The examples are general: they serve to solidify the student's understanding of the definition and can be used as. a template for finding the factors or common factors of other numbers that the student sees on the mountain. Examples are shown for the two common factors definitions (figures 5.4a and 5.4b) and for the two factors definitions (figures 5.4c and 5.4d). As shown in table 5.5, we provide two different number factorization and common factors definitions at level 2. The reason for this is that there is not one common unifying definition for either common factors or factorization. A search for "common factors" online brings up many websites providing either a definition relating to divi-sion (definition J) or a definition using the factors (definition 2), or both. Similarly for "number factorization", the definitions are evenly split between those relating to division (definition 1) and those relating to multiplication (definition 2). For lack of a common unifying definition we chose to provide two hints at this level, one for each of the definitions. Note that in the accompanying example box, the general example is worked out in a way that is consistent with the definition currently given by Merlin. The tool hint at level 2 is applicable only for the factorization hints as there is only a tool which helps the student find factors, i.e. the magnifying glass. This hint is the same as the one given in the original hinting strategy, however Merlin speaks "the number you clicked on" or "your partner's number" depending on which number, the player's or her partner's, has a lower probability in the model (see table 5.5). At the third level there is only one hint, the bottom-out hint (see table 5.5). At this level we provide an example in context, using the numbers that the student and her partner are on. This hint is the same for common factoring and factorization. "You are right because x and y share no common factors" is given after the student makes a correct move, and "you fell because x and y share z as a common factor" is selected after the student makes an incorrect move. The factorizations of x and y are then given. Chapter 5. Changes to Agent Interventions 90 Figure 5.5: Agent hint with dialogue box asking if the student would like more hints Although we believe that the agent should still provide hints regardless of the correctness of the move if the model predicts that the student is lacking relevant knowledge, it may appear strange from the student's standpoint if she were to receive a help message after making a correct move. To make the hints in these situations appear more natural, they are prefaced with a line acknowledging the student's correct move, for example "Great move. However, you might still need some help" or "Well done. Here is another hint that will help you in the game". This is a common pedagogical practice to begin by praising a student success, and follow with a suggested improvement. Progressing through hints We thought that it was important for students to be able to request more hints if they wished, as is done in several ITSs (e.g. [80, 29]). With the exception of the bottom-out hint, every agent-spoken hint is followed with a dialogue box asking the student if she would like to see another hint. This forms part of the dialogue box that displays the example in the case of the definition hints (see figure 5.4) and stands alone for the other hints (see figure 5.5). After each student's move, the agent uses the model to determine whether to hint and which skill - common factoring or factoring - to hint on, using Algorithm B from figure 5.2. If the system chooses to hint on common factors, the subroutine Chapter 5. Changes to Agent Interventions 91 SUBROUTINE: hint_on_commonJactors 1 if (nextHint.cf = Focus) 2 give Focus-CF hint 3 nextHint_cf = Defl .cf 4 else if (nextHint.cf = Defl.cf) 5 give Definitionl-CF hint 6 show Examplel-CF dialogue box 7 seen Definitionl-CF hint = true 8 nextHint.cf = Def2_cf 9 else if (nextHint.cf = Def2_cf) 10 give Definition2-CF hint 11 show Example2-CF dialogue box 12 seen Definition2-CF hint = true 13 nextHint.cf = DefLfac 14 else if (nextHint.cf = DefLfac) OR (nextHint.cf = Def2.fac) OR (nextHint.cf = Bottom. .out) 15 if (NOT seen Defmitionl-Factor hint) 16 give Definitionl-Factor hint 17 show Examplel-Factor dialogue box 18 nextHint.cf = Def2Jac 19 else if (NOT seen Definition2-Factor hint) 20 give Definitions-Factor hint 21 show Example2-Factor dialogue box 22 nextHint.cf = Bottom.out 23 else - 24 give Bottom-out hint 25 nextHint.cf = Bottom.out end if end if 26 if (nextHint.cf NOT= Bottom.out) 27 ask if student wants more hints 28 if (student_response = 'YES ' ) 29 hint_on_common-factors 30 else 31 nextHint.cf — Definitionl _cf end if Figure 5.6: Subroutine to hint on common factors which is called by the hinting algorithm presented in figure 5.2. The specific hints are given in table 5.5: Chapter 5. Changes to Agent Interventions 92 S U B R O U T I N E : hinLonJ"actors 1 if (nextHint_fac = Focus) 2 give Focus-Factor hint 3 nextHintJac = Defl Jac 4 else if (nextHintJac = Defl Jac) 5 give Definitionl-Factor hint, show Examplel-Factor dialogue box 6 seen Definitionl-Fac hint = true 7 nextHintJac = Def2 Jac 8 else if (nextHintJac — Def2Jac) 9 give Definition2-Factor hint, show Example2-Factor dialogue box 10 seen Definition2-Fac hint = true 11 if (second cycle through hints) 12 nextHintJac = Bottom.out 13 else 14 nextHintJac = ToolsJac 15 else if (nextHintJac = ToolsJac) 16 give Tools-Factor hint 17 if (second cycle through hints) 18 nextHintJac = Defl Jac 19 else 20 nextHintJac = Defl .cf 21 else if (nextHintJac = Defl.cf) O R (nextHintJac = Def2.cf) O R (nextHintJac — Bottom.out) 22 if (NOT seen Definitionl-CF hint) 23 give Definitionl-CF hint, show Examplel-CF dialogue box 24 nextHintJac = Def2_cf 25 else if (NOT seen Defmition2-CF hint) 26 give Definition2-CF hint, show Example2-CF dialogue box 27 nextHintJac = Bottom.out 28 else 29 give Bottom^out hint 30 nextHint.fac = Bottom.out end if end if 31 if (nextHintJac N O T = Bottom.out) 32 ask if student wants more hints 33 if (student_response = 'YES ' ) 34 hint.onjactors 35 else 36 nextHintJac = TooLcf 37 second cycle through hints = true Figure 5.7: Subroutine to hint on number factorization which is called by the hinting algorithm presented in figure 5.2. The specific hints are given in table 5.5. Chapter 5. Changes to Agent Interventions 93 Figure 5.8: a) Common factor stream of hints, b) factors stream of hints cycle 1, c) factors stream of hints cycle 2 Chapter 5. Changes to Agent Interventions 94 hint.on^common_factors is called. The pseudo-code for this subroutine is shown in figure 5.6. This subroutine provides the student with one of the common factor hints in table 5.5 and, if it is not the last hint, asks if she would like to see another hint. Another hint is provided to the student if she clicks ' Y E S ' to more help using the dialogue box (see figure 5.6 lines 27-29) or the next time the system choses to intervene with a hint on common factors after a move (using hinting algorithm B in figure 5.2). Because the student can continue to ask for more hints, is it possible for the student to request every one of the hints in the common factor stream, right down to the bottom-out hint (at which point she will not be given the option of requesting another hint). However, the agent will only provide one unsolicited hint per move otherwise. At the next hinting opportunity, it will provide the next hint in the stream. This progression through the stream of hints for common factoring is shown graph-ically in figure 5.8a. The common factor stream progresses from general to specific, through the level 1 hint focus, and the level 2 hints definition 1 and definition 2. Before giving the final bottom out hint, however, we also provide the student with the two factor definition hints (see hint-on..commonf "actors subroutine in figure 5.6 or the common factor stream in figure 5.8a). The reason we do this is because we want the student to see a general factors example before seeing the bottom-out hint which provides the current problem's entire solution. This progression ensures that the first time through the common factor hint sequence the student will see each of the factor and common factor definitions. We could have chosen to only offer the (factor) hints if the relevant factorization nodes were below threshold, however we felt that this would further complicate the flow of the algorithm, thus we use this heuristic instead. We will see in the next chapter that this did not generate too many unnecessary hints. After the student has reached the bottom-out hint, the next time the agent choses to hint on common factors, the stream starts again at definition 1 - common factors. The reason we do not repeat the focus hint is because the student has already seen all of the hints, and thus should be focused on the skills already. This is different from the original hinting strategy in which each one of the hints was repeated. We assume that at this stage the student will benefit from seeing the definitions again, this Chapter 5. Changes to Agent Interventions 95 time with examples using different numbers. This second time through the sequence, however, the agent will not provide the hints on factors as the student will have already seen them. The agent continues to cycle from definition 1 - common factors to the bottom-out hint in this sequence, and has a repetoire of many different examples. There is a similar progression for the factors hints, shown in pseudo-code in figure 5.7, and graphically in figures 5.8b and 5.8c . Figure 5.8b shows the first cycle through the stream, and figure 5.8c shows the second cycle through. The stream proceeds through focus, definition 1, definition 2, and tools hints. The next hints in the sequence are the common factors definitions 1 and 2, if the student has not already seen them. Again, the purpose of showing these two hints is to ensure that the student has seen all of the definitions before moving on to the Bottom-out hint. The stream ends with the bottom-out hint, after which the student cannot request more hints. The second time through the factors stream, after the bottom-out hint has been given, we proceed through the hints in a slightly different order (see figure 5.8c). Like the corresponding common factors stream, the level-1 focus hint is not repeated be-cause the student should already be focused at this point in the interaction. However, the level-2 hints are given in a slightly different order the second time through the cycle. We begin with the tool hint which reminds the student to use the magnifying glass to see factors of her number or her partner's number. This is inserted before the definitions because at this stage the student has seen all definitions and examples at least once, and possibly needs only a reminder to use the magnifying glass in order to put together the pieces of what she has learned in order to find the answer for herself. The'rest of the hints proceed as before, ending at the bottom-out hint (figure 5.8c). When a hint is given we currently do not change the network in any way. In future, we may chose to update the model after showing a hint to reflect the fact that the student's knowledge increases from this event. However, as we did not yet know what effect the hints had on the students' learning, we chose to leave this as future work. Chapter 5. Changes to Agent Interventions 96 5.3 Small pilot study of the new hinting strategy The agent's new hinting strategy and the game interface was piloted with a grade five and a grade six student, one male and one female. One student played the game with the agent acting on the old model with common factor node, and the other played the version with the new model with common factor node. Each student played for approximately ten minutes and an experimenter played as her partner. After game play, the experimenter verbally asked the students questions about the agent and the game, encouraging them to expand on any points they raised. The questions and excerpts from the responses which are relevant to the agent's hinting style are summarized below. 1. Did the agent Merlin intervene too often? too little? S I : " in the early levels he didn't, but later on he did (intervene too often)" S2: "when I was playing for longer he did. If you say you don't want help he shouldn't help you as much" (clarification: after saying 'NO' to more hints, the student felt that the agent should not intervene for a while afterwards, but could not be more specific on the amount of time. However, the student did not feel that Merlin should stop intervening altogether.) 2. Did the agent Merlin say anything unusual? anything that you didn't under-stand? S I : "I understood, but he repeats himself. It should be shorter" S2: "I understood everything he said" 3. Which would you like better, if Merlin said things aloud and showed them writ-ten, or if you just saw them coming out his mouth? (demo of both options for the same hint with Merlin) S I : "His talking distracts me. I can't think when he's talking... although I might not listen to him otherwise" Chapter 5. Changes to Agent Interventions 97 S2: "You can't understand what he's saying anyhow, so I don't like the voice." 4. Is there anything that you didn't understand about the game? didn't like about the game? What would you change if you could? SI: "No, I liked the game" S2: "The last level seemed slow. Sometimes I would click and my player wouldn't move right away, so I would click again, and that was annoying" In response to these comments from students in our target audience, we made several changes to the hinting style. 1. To address the responses to the first question, we adjusted the thresholds for each of the two versions so that Merlin did not intervene as often, the results of which were already described in section 5.2. However, we did not follow the second student's suggestion that the agent should not hint for a period of time if the student has said she does not want more hints. It would be possible to implement a heuristic which says that the agent will not intervene for x moves after the student has said she does not want more help. However, in the ablation study described in the next chapter, we wished to test the effectiveness of an agent which bases its interventions on only the model and not using a heuristic for affect. Thus, the agent does not provide another hint after the student says ' N O ' to more hints, however on her next move, assuming that the model assesses her knowledge as lower than the threshold, the student would be given another hint. 2. To address the responses to the second question, we reduced the amount of text in the hints for the second time through the cycle. For example, the first time through the cycle, Merlin says " A common factor is a number that divides into both numbers without a remainder. Here's an example.". The second time through he says " A common factor divides into both numbers without a remainder" and follows directly with the example box. We also changed the Chapter 5. Changes to Agent Interventions 98 wordings of the text that comes before hints that followed correct moves, so that every one starts of differently ("Good job", "Well done", etc) so that the hints do not seem as repetitive. This is in line with the findings presented in [80] and [4] in which it is advised to keep help messages as short as possible in an ITS. 3. The responses to the third question clearly indicate that Merlin's voice is difficult to understand. Unfortunately the speech generator that we are using does not produce clear or lively speech. We chose to eliminate Merlin's voice and provide hints in text only. 4. Finally, we acknowledge that in the very high game levels, in which the model contains many large numbers with interconnected factors, the game did slow down noticeably 1. In the user studies we explained to the students that this might happen and told them to only click once for each move they wished to make instead of trying to get a response through repeated clicks that could further clog the system. This may be a concern that can be overcome with further optimization of the game code and faster computers. We use this new agent hinting strategy in the ablation study which is described in the next section. were running the game on two Pentium 3 laptop computers, 190MHz, 196 M B memory, running Windows 2000 99 Chapter 6 Ablation Study The purpose of the ablation study was to determine what effect the Prime Climb student model plays in the pedagogical effectiveness of the game. A previous study [28] determined that students playing a version of Prime Climb with a pedagogical agent learned more than those playing a version without. Although this indicates that the pedagogical agent had an impact on learning, it does not give us insight into the role that the student model, which had an accuracy of 50.8%, played in this learning. We would like to determine whether the learning gains are greater when the agent is working from a more accurate student model. 6.1 Study design The study was run in two Vancouver elementary schools with 48 sixth grade and 4 seventh grade students. 27 of the students were female, and 25 were male. The students were randomly assigned to one of three conditions: • No Agent - Students in the no agent condition played a version of Prime Climb that did not contain a pedagogical agent providing help interventions. • Old Model - Students in the old model condition played a version of Prime Climb that was identical to the no-agent version, but in addition it contained a pedagogical agent. The agent provided unsolicited hints based on the original model of student learning with the common factor node presented in section 4.4. The hinting strategy that was used by the agent was the one described in chapter 5. Chapter 6. Ablation Study 100 Student 2 Camera 1 Camera 2 Experimenter 1 Experimenter 2 Figure 6.1: Experimental setup with students sitting opposite experimenters and cam-eras catching both the student's expression and the game screen (as seen on the ex-perimenter's computer) • New Model - Students in the new model condition played a version identical to the old model condition, however the model that the agent used to select its interventions was the new model with the common factor node presented in section 4.3. We chose to use the original model and new model both with the common factor nodes in order to provide a consistent hinting strategy as described in chapter 5. Thus, in each of the with model conditions the agent provided hints based on the model's assessment of the student's knowledge of the numbers in the mountain as well as their knowledge of the common factor concept. Since the same hints were being given in each condition, we were in a position to study the effect of model accuracy as reflected in the timing of the hints and the appropriateness of the choice of what to hint on. The first part of the study was conducted in the students' regular classroom on the morning of the study. The entire class wrote a pre-test with their teacher and the experimenter present. The pre-test was designed to assess the students' knowledge of Chapter 6. Ablation Study 101 factors and common factors, and is described in more detail below. The second part of the study was conducted with pairs of students, due to con-straints on computer availability, in another room in the school. Students played with one of the three versions of Prime Climb for approximately 10 minutes. Each student played with an experimenter as her partner.1 One of the experimenters was the author of this thesis, the other was a postdoctoral researcher in the same group who had pre-vious experience with students playing Prime Climb. The experimenters played with the strategy that they would only fall about once per level, and give the student lots of time to think about where to move rather than moving quickly themselves (as Prime Climb is not a turn-taking game, students can move multiple times before letting their partner move). The students were arranged so they could not see the experimenters' screens (see figure 6.1), and each student was not aware of which experimenter was playing as her partner. This was done to limit chatter and questions between the students and experimenters which could confound the results. Thus, all help was obtainable only from the pedagogical agent, when available. When compared anecdotally with previous studies we have run with Prime Climb, this trick seems to limit questions from most students. Students were randomly assigned to an experimenter, and each experimenter played with an equal number of students from each condition. Most students were videoed, subject to parental consent. After game play, students wrote a post-test identical to the pre-test. In the old model and new model conditions they also filled out a questionnaire on their impres-sions of the pedagogical agent. Experimenters filled out a questionnaire on their im-pressions of the student's game play and the agent's hinting style, where appropriate. A l l of the assessment tools are described below in more detail. The study was a double-blind experiment as far as the old model and new model conditions, and a blind experiment as far as the no agent condition. That is to say 1 Although it would have further reduced experimenter bias to have students play against a simu-lated player with a consistent playing pattern, we did not have time to implement this prior to the study Chapter 6. Ablation Study 102 that students did not realize what version of the game they were playing (they were not told that there were different versions). Experimenters did not know with which model they were playing at any given time, however they did know which was the no agent version because they were able to see the help interventions and needed to administer a different questionnaire to these students after game play. . 6.1.1 Assessment tools Pre-test and Post-test on factors and common factors The pre-test and post-test on factors and common factors that were completed by the students were identical. This multiple-choice test is shown in appendix B , and includes a section on finding the factors of a number, and a section on finding the common factors of two numbers. We intended to use the test to get an overall score representing the students' knowledge of factors and common factors in order to assess learning between the pre-test and post-test. However, the test was not designed with solely this purpose in mind. It was our intention to collect additional data on students' knowledge of the factorizations of specific numbers and the common factor concept in order to repeat the accuracy analysis conducted in section 4.3.3 on the accuracy of the new model, and also to populate the model with more accurate prior probabilities obtained from population data, as future work towards our long-term goal of improving the accuracy of the student model. Wi th this in mind we discuss the test design. The test is adapted from the test administered in [28], which had previously been used to assess learning on a similar study which compared the no agent and old model versions. The test in [28] contained • 1 question which asked the student to list the factors of a given number, • 7 multiple-choice questions in which students had to circle the common factors of two numbers, and • 2 questions which asked the student to draw a factor tree for a given number. Chapter 6. Ablation Study 103 corresponding to common factor questions 9, 11, 14, 15, 25, 30, 27, 33, 42 additional numbers 31, 36, 81, 88, 89, 97 Table 6.1: Numbers appearing on the factors portion of the pre-test, including those that directly correspond to numbers in the common factors section, and additional numbers which are common in the first four game levels We chose to eliminate the question which asked the student to list the factors of a given number. Like Murray et al. [58], we hypothesized that given time constraints, a recognition task (such as selecting factors from a list) would be a quicker method of showing skill improvement than a recall task (such as asking the student to list the factors of a number). We replace this factors question with ones which ask students to select the factors of a number from a list. For the purpose of collecting the data for priors and accuracy assessment described above, an additional fifteen factors questions were added for the numbers shown in table 6.1. The first nine of these numbers correspond directly to numbers in questions in the common factors portion of the test. This was done in order to assess students' knowledge of the common factor concept by examining their pattern of responses to factorization and common factors questions about the same numbers, as discussed in section 4.1.2. The remaining six questions were chosen because the numbers they target commonly appear in the first four levels of the game and students' responses to questions involving these numbers would be useful data for determining population priors for the nodes in the network, also as future work. A l l but two of the numbers in table 6.1 appear at least once in the first four levels, and many appear several times. Of the seven common factor questions, we changed one and dropped two. We changed one of the common factor questions to a similar question (two numbers which had no common factors, yet whose factorizations were non-trivial) which involved num-bers appearing on the mountain because the existing question involved two numbers which did not appear in the game. We dropped two questions to reduce the length of the test because the numbers involved did not correspond to numbers that appear Chapter 6. Ablation Study 104 in the game, and for every common factor question we needed to add two factors questions corresponding to the two numbers. Despite this reduction, the new test was still considerably longer, with 23 questions rather than the previous 10. This test was piloted with a class of sixth grade students in the course of another study. After the pilot study we removed the two factor tree questions from the test because we found that many students misunderstood these questions and they were difficult to mark because students' responses varied widely. As well, generating a factor tree is not a skill that the game teaches directly, nor is it something which the agent hints on, hence improvement on this skill should be unaffected by the no agent, old model, or new model conditions. Thus, the new test used in the study had 21 questions. For many of the questions, the correct answer had many components and students were told to circle all correct responses. The test was marked by giving one mark for each correctly circled response, and subtracting one mark for each incorrectly circled response. Thus, it was possible for a student to recieve a negative score on the test if they circled more incorrect answers than correct ones. We chose this method of marking by giving the test to two sixth grade teachers and asking how they would have marked it. This was the method suggested by both teachers. Note that this is different than the marking scheme used in [28] in which correct responses were worth two marks and 1 mark was deducted for each incorrect response. The highest achievable grade on the new test was 30, and the lowest was -30. The pilot study subjects took approximately 5 minutes on average to complete the test containing 23 questions, with an average score of 84 percent. However, during the actual study with 21 questions, the students took approximately 10 minutes on average to complete the test, with an average score of only 66 percent, leading us to assume that our pilot subjects were not representative of the actual test subjects. Chapter 6. Ablation Study 105 Keywords Question helpful I think the agent Merlin was helpful in the game understands I think the agent Merlin understands when I need help play better The agent Merlin helped me play the game better learn factorization The agent Merlin helped me learn number factorization too often The agent Merlin intervened too often not enough The agent Merlin did not intervene enough liked I liked the agent Merlin Table 6.2: Questions on post-questionnaire rated by students from 1 (Strongly dis-agree) to 5 (Strongly agree), with the keywords by which we will refer to them in this chapter. Student subjective assessment In addition to the post-test on number factorization and common factoring, the old model and new model students were asked seven questions about their perceptions of the agent and the agent's interventions. These questions were asked on the same page as the post-test and can be seen in appendix C. The students in the no agent condition answered one question about whether they would prefer to play again with help or without (see appendix D). The seven questions which the students in the two agent conditions answered were given as statements to which the student responded on a Likert scale from 1 (Strongly disagree) to 5 (Strongly agree). These questions are shown in table 6.2 along with the keyword by which we will refer to them in this chapter. The questions were designed to give an idea of whether the student finds the agent helpful, thinks the agent understands her, thinks the agent can help with the game or teach her something, thinks the agent intervened too little or not enough, and finally whether she likes the agent. These statements were adapted from those used in the study described in [28]. In particular, the statement "The agent Merlin intervened at the right time" was expanded into the two statements too often and not enough in table Chapter 6. Ablation Study 106 Keyword Question listened Did the student listen to what the agent said? action w_out listening Did the student perform her own actions without listen-ing to the agent? like to listen Did the student like to listen to the agent? help too much Did the agent try to help too much? not enough help Did the agent not provide enough help to the student? magnifying glass Did the student use the magnifying glass? . when magnified When did the student use the magnifying glass? ask for more hints Did the student ask for more hints? when asked When did the student ask for more hints? Table 6.3: Questions on post-evaluation on which the experimenter rated the student, and the keywords by which we will refer to them in this chapter. 6.2. Experimenter subjective assessment , We also had the experimenters fill out a short questionnaire on each student that they played with. The questionnaire for the no agent condition is shown in appendix E , and the questionnaire for the old model and new model conditions is shown in appendix F. For the two with agent conditions, the questions on this experimenter questionnaire are listed in table 6.3 with the keywords we will use to refer to them in this chapter. The first three questions, listened, action w.out listening, and like to listen, were designed to give us a picture of how the student reacted to the agent. The next two questions, help too much and not enough help, were designed to provide an experimenter's assessment of the timing of the agent interventions. There were two questions on the student's use of the facility to ask for more help, ask for more hints and when asked. For the when asked question, the possible responses were "before chosing a hex", "after falling down", "early in the game (levels 1 and 2)" and "late in the game". It was later Chapter 6. Ablation Study 107 decided to mine the data that these two questions target directly from the log files as this would yield more precise assessments of students' requests for more help. There were also two questions on magnifying glass use, magnifying glass and when magnified. The when magnified question had possible responses of " before chosing a hex", " after falling down" and "after agent suggestion". Again, the data from these two questions could be, and was, obtained from the log files after the study in order to achieve more precise assessments of magnifying glass use. . The questionnaire for the students in the no agent condition contained the two questions about the magnifying glass, as well as the yes/no question "Did the student try to look for help during the game". The experimenters discussed each question before the study to ensure that they were interpreting them in the same way. Experimenters were encouraged to write ad-ditional notes on the page to explain why they had chosen a response if they felt that there was more than one way to interpret the situation. For example, several times both experimenters commented beside the help too much question that the agent inter-vened too much on one particular level in one area of the mountain. These comments were useful in the later analysis. As was mentioned earlier, both experimenters played with, and hence evaluated, an equal number of students in each of the three conditions. The study was piloted with a grade five student and a grade six student, in the course of evaluating the new agent hinting strategy as was described in section 5.3. The students were run through the study as described above, although the setting for the two pilot studies was in one experimenter's office for the first pilot subject, and in the subject's home for the second. After the study, the experimenter verbally asked the students questions about the agent and the game, encouraging them to expand on any points raised. The results of these interviews were summarized in section 5.3. As well, the experimenter asked the students if there was anything that they didn't understand about the test or how we could make it easier to complete. We also asked them to explain what they thought each of the seven questionnaire questions meant, to ensure that the questions were assessing what we had intended them to assess. There were no changes made to either the factors test or the questionnaire following these Chapter 6. Ablation Study 108 two pilot studies. 6.2 Study Results We measure learning gains as the difference between post-test score and pre-test score. Going into the study we had the following hypothesis about the effect of the different conditions on learning gains: Learning gains hypothesis: Students in the new model condition will learn signifi-cantly more than students in the old model condition, who will learn more than •> students in the no model condition. , In our study we had 52 students play the game and complete the pre-test and post-test on number factorization. One student was removed due to an error during collection of the log data. Of the remaining 51 students, we had to remove one student with an outlying score across all students on the pretest, and a further six with outlying learning scores, four in the old model condition and two in the new model condition. After removing the outliers, the average pre-test score was 24.2/31, with a standard deviation of 8.22. There were no significant differences between conditions on pre-test. To ensure that we would be able to pool all of our data, we checked for significant differences on either pre-test or post-test caused by gender, grade level, or school. There were no significant differences for any of these, so we feel comfortable grouping the data from boys and girls at the two schools and two grades together. After removing the outliers we were left with • 13 students in the no agent condition • 14 students in the old model condition • 17 students in the new model condition The test results for these students are shown in table 6.4, broken down by condition. To test our hypothesis and determine if there were any differences between the three conditions we performed an A N O V A using learning as the dependent variable Chapter 6. Ablation Study 109 No Agent Old Model New Model Pre-test u-= 20.62 s.d. 2.83 li= 25.53 s.d. 1.81 /i= 25.77 s.d. 1.72 Post-test u-= 19.39 s.d. 3.41 fi= 25.40 s.d. 1.88 p,= 25.35 s.d. 1.84 Learning /i= -1.23 s.d. 1.33 /i= -0.13 s.d. 0.42 li= -0.41 s.d. 0.64 Table 6.4: Average pre-test score, post-test score, and learning gains by condition. The maximum score is 30. and condition as a fixed factor. However, we found that there were no significant differences between the three conditions on learning (p=0.645). Using a general linear model with post-test as the dependent measure, pre-test as a covariate, and condition as a factor, there are also no significant differences between conditions (p=0.803). Thus, we have not been able to prove from this data the learning gains hypothesis that we began with. However, Reiter et. al. [64] argue that negative results are essential for progressing research in computer science, and should be analysed and published as much as positive results. They cite examples from medicine in which negative results are the main motivation for theorists seeking to improve on theories, as theory induction is better if one has access to negative as well as positive results. Thus, we take the view that there can still be great benefit from analysing negative results and we set out to find out why we were unable to prove our hypothesis true, and to answer the more general question of whether a more accurate student model has an effect on learning. We have come up with three possible reasons for not observing a difference in learning gains amongst the. three conditions: H I : The new model was not more accurate than the old model in the course of this study, due to the changes we made to the optimal learning thresholds (described in section 5.2) Chapter 6. Ablation Study 110 H2: The test we used does not properly assess learning H3: Learning has been obstructed by changes to the agent's interventions A possible fourth hypothesis for not observing learning gains is that students did not play the game for long enough. Unfortunately, playing time was dictated by the availability of the students. We cannot know whether playing time was a factor in learning without repeating the study with a longer playing time. We address the evidence for each of these hypotheses in the following three sections. 6.2.1 H y p o t h e s i s 1: T h e new model was no t m o r e a c c u r a t e t h a n t he old model i n t h e course o f t h i s s t u d y Before we can answer the question of whether a more accurate student model has an effect on learning, we must answer the question of whether the new model was more accurate than the old model on this new data. While this was in principle answered in chapter 4, the reader should recall that upon testing of the new model and new hinting stratey we changed some fo the model parameters to give more weight to student engagement, so we don't really know what the accuracy of the modified new model used in the user study. We calculate the accuracy of the new and old models over the factorization nodes and the common factor node in three different ways: A t the end of game play: This assessment is the same as the one performed in chapter 4, using the same optimal thresholds, and allows a direct comparison between the accuracy on the data set that was used to train and test the models and the accuracy on the new data from the ablation study. D u r i n g game play: This assessment allows us to compare how accurate the model is during game play, versus its accuracy after seeing all game actions. The analysis is done using the optimal thresholds from chapter 4. Agen t accuracy: In this analysis we compute the accuracy of the agent interventions in the course of this study, using the modified model thresholds that the agent Chapter 6. Ablation Study- Ill FACTORIZATION NODES Previous study accuracy Current study accuracy Sens Spec Acc Sens Spec ,4cc Old model 74.3% 25.7% 61.7% 64.4% 37.4% 50.9% New model 76.8% 76.9% 76.8% 16.0% 95.8% 55.9% Table 6.5: Comparison between sensitivity, specificity, and accuracy on the factoriza-tion nodes assessed in the previous study for which results were presented in chapter 4 and the same measures assessed in the current study. Sensitivity, specificity and accuracy are assessed after game play across all data. COMMON FACTOR NODE Previous study accuracy Current study accuracy Sens Spec Acc Sens Spec Acc Old model 58.7% 51.0% 54.9% 88.9% 14.3% 51.6% New model 75.4% 66.7% 71.5% 88.9% 5.7% . 47.3% Table 6.6: Comparison between sensitivity, specificity, and accuracy on the common factor node assessed in the previous study for which results were presented in chapter 4 and the same measures assessed in the current study. Sensitivity, specificity and accuracy are assessed after game play across all data. used to determine whether to intervene and offer a hint. Accuracy at the end of game play We wished to compare the accuracy of the old model and the new model on the data from the study described in section 4.1.2 with their accuracy on the data from the current study. In order to assess accuracy using the current data, for both the old and new model, we ran the log files of the students' actions from the ablation study through a simulator in order to obtain each model's assessment of each student's knowledge at the end of game play (i.e. after having seen all of the actions in the log file for that student). Note that we can use the log files for all 44 students to test each model since Chapter 6. Ablation Study 112 Numbers on post-test Previous study 2, 3, 4, 11, 15, 36, 40, 42, 50, 81 Current study 9, 11, 14, 15, 25, 27, 30, 31; 33, 36, 42, 49, 81, 88, 89, 97 Table 6.7: Numbers whose factorizations were assessed on the post-test for the current ablation study described in this chapter and the previous study described in section 4.1.2. the student's stream of action is not affected by what model they were playing with during the study. Specifically, we examine the posterior probabilities output by the model for the nodes which correspond to questions on the post-test. These posterior probabilities were thresholded to known or unknown using the optimal thresholds determined in chapter 4 and summarized in table 5.2. These model predictions were compared to the post-test assessment of the same knowledge. We pool all of the data, and compute sensitivity, specificity and accuracy across all students and all questions. As before, we report accuracy on factorization nodes and on the common factor node separately. The results for the factorization nodes are shown in table 6.5 and for the common factor node in table 6.6. There are two main differences between the previous and current assessments of accuracy presented in tables 6.5 and 6.6. First, the previous assessment of accuracy was reported as the average accuracy across 10 cross-validation folds, whereas the current assessment is across all of the data. The second difference is that numbers for which the students' factorization knowledge was assessed on the post-test in this study (see post-tests in appendices C and D) were different numbers than were assessed in the previous data (see post-test in appendix A) . In the previous assessment, we report the model's accuracy at assessing the student's knowledge of the factorization of the numbers shown in the first row of table 6.7. In the current assessment we assess the model's accuracy at assessing the student's knowledge of the factorization of the. numbers in the second row of table 6.7. The reader will note that the accuracy is lower for the factorization nodes in this assessment than it was on the previous assessment of accuracy. There are two possible Chapter 6. Ablation Study 113 explanations for the reduction in accuracy across the factorization nodes: a) The models have been fine tuned to optimize the accuracy at predicting the stu-dents' knowledge of the factorization of the numbers from the test used in the first study (table 6.7 first row), and the results do not generalize to accuracy on other numbers. b) The students from this study played very differently, and the model was too fine tuned to the actions of the previous group of students. Although it is possible that the cause is b, we do not think it is the case that the students in the second study behaved very differently from the students in the first study. The two groups of students came from similar backgrounds. One of the two schools that participated in the current study was a school that had participated in the study described in section 4.1.2 (although they were different classes), thus at least half of the students came from the same school. As well, all of students that participated in the two studies were in the same two grades in school. The experimenters did not report a noticeable difference between the game play of the students in the first study and those in the second study. It is more likely that the model was fine-tuned to optimize accuracy on particular numbers. The reader will recall from section 4.2.4, that many of the nodes were given generic priors derived from population data (the average of students' responses on the pre-test), whereas all other nodes were given priors of 0.5. As was seen in section 4.2.4 when we compared the model accuracy with different priors, the accuracy was significantly better with population priors than default priors of 0.5. Since the previous study used the same pre-test and post-test, the numbers/nodes used in the calculation of accuracy were the same nodes which were given population priors. Thus, we would expect a more accurate model prediction for these nodes than for ones which were not given population priors. In the current assessment, students wrote the pre-test the same day that they played the game, thus there was not time to obtain generic prior values for the nodes corresponding to numbers on the new test. Therefore we had population priors for only five of the nodes corresponding to numbers which were Chapter 6. Ablation Study 114 ROC - accuracy on factorization nodes • O l d Mode l • New Mode l > 1.2 -_^ , o 8 n 'R Q~4— 0 2 i R— .2 ( -0'2J 1 ' i ' i — i 1 0.2 0.4 0.6 0.8 1 I 1-specificity O l d New AUC 0.543 0.637 SE 0.025 0.022 difference significant p < 0.01 Figure 6.2: Receiver-operator curve comparison of the accuracy after game play of the old model and new model across factorization nodes on the ablation study data. The table shows the area under the curve (AUC) metric for each R O C curve along with its standard error (SE). subsequently evaluated on post-test; the numbers 11, 15, 36, 42, and 81 which appeared on both the old and new tests (see table 6.7). Hence, in the current assessment we calculated accuracy across nodes which mostly started with default priors of 0.5, whereas in the previous assessment we calculated accuracy across nodes with prior probabilities. As the majority of the nodes that we are using to assess accuracy had prior probabilities of 0.5, contrary to 80% of the nodes on the previous assessment which had generic prior probabilities above 0.5, we might expect that a different threshold would achieve higher accuracy for the current assessment. It is concievable that because the priors on our current test nodes were lower to begin with, the corresponding posterior probabilities did not get as high by the end of game play. Thus, the optimal thresholds may have been too high to accurately classify known nodes. This is particularly pertinant for the new model, which has a high optimal threshold of 0.8 for factorization nodes (see table 5.2). It is this model which saw the greatest reduction in sensitivity due to known nodes being classified as unknown (see table 6.5). Chapter 6. Ablation Study 115 ROC - accuracy on common factor node -Old Model New Model 1.2 ! S -6:6--&4--0.2 -0 2 0.2 0.4 0.6 1-specificity O l d New AUC 0.546 0.562 SE 0.110 0.110 difference not significant Figure 6.3: Receiver-operator curve comparison of the accuracy after game play of the old model and new model on common factor node on the ablation study data. The table shows the area under the curve (AUC) metric for each R O C curve along with its standard error (SE). To determine whether the new model is in fact more accurate than the old model on factorization nodes across all thresholds, we examine R O C curves of the two models on factorization nodes, shown in figure 6.2. Picking the point of maximum accuracy on these curves, the old model achieves a maximum accuracy of 51.49% at a threshold of 0.6, whereas the new model achieves a maximum accuracy of 63.19% at a threshold of 0.48. The A U C metric that was described in section 4.2.4, was calculated for each curve (see the table in figure 6.2). Using this metric, we find that the difference between the two models is significant at the p=0.01 level. Thus, we conclude that although the accuracy on factorization nodes is lower with this new assessment for both models, the new model is still more accurate than the old model on factorization nodes. Assessment for both models would likely improve upon collection of more prior population data. We now have generic pre-test data from this most recent study for the nodes corresponding to numbers on the new test, and our future work is to redo this assessment of accuracy to determine whether the generic priors improve accuracy. Examining table 6.6, we see that the accuracy of the common factor node assessed Chapter 6. Ablation Study 116 at the end of game play also decreased for both models, although more so for the new model. In both cases, the models did much more poorly on specificity with this study data than with the previous study data. We again turn to prior probabilities as a possible explanation. In the previous study, the majority of the students knew the common factor concept on the pre-test, thus the generic prior probability for the common factor node was high (0.83). On the post-test, most of the students also knew the common factor concept (86%), thus the accurate prior likely played a role in the high accuracy of the model at predicting the common factor node. However, in the current study only 22.2% of the students knew the common factor concept on post-test, thus the generic prior was not very representative of this group. We hypothesize that one possible reason for this discrepancy between the two studies is due to student fatigue when answering the common factor questions which were all found at the end of the test. Fatigue on the test is discussed in the next section, and may also be the reason why accuracy is low for this node, as the gold standard against which we compare the model predictions is not a true assessment of the students' knowledge. The R O C curves for the old and new models across the common factor nodes are shown in figure 6.3. Wi th a threshold of 0.84, the old model reaches a maximum accuracy of 55.56%. Wi th a threshold of 0.92, the new model reaches a maximum accuracy of 56.94%. However, when we compare the accuracy of these two models using the area under the curve metric, the difference between the curves is not statistically significant (see the table in figure 6.3). A c c u r a c y d u r i n g game play Next we turn our attention to assessing accuracy during game play on the data from this ablation study. The analysis thus far has looked at how accurate the model is after having seen all game actions. We would also like to know how accurate the model is during the interaction. Do the models achieve high accuracy without seeing many student actions, or do they require many actions before a high level of accuracy is reached? Chapter 6. Ablation Study 117 INPUTS: N = number of students playing version x log file for each student playing version x A L G O R I T H M : 1 for i = each student playing version x { 2 testK.modelK = testK .modelU N = 0 3 testU'N.modelK = testU N .modelU N = 0 4 for each move in log file i { 5 if (hex number known on pre-test and post-test) 6 if (model(FNURLL) >= FACTHRESHOLD) 7 testKjmodelK + + 8 else 9 testK.modelUN + + 10 else if (hex number unknown on pre-test and post-test) 11 if (model(FNUM) < FACTHRESHOLD) 12 testUN.modelU N + + 13 else 14 . testUN.modelK + + end if } 15 sensitivityi = (testK .modelK) / (testK .modelK + testK .modelU N) 16 specificityi = (testU N.modelUN)/(testUN.modelUN + testUN.modelK) } 17 Avg.S ens = J2^=1 sensitivityi/N 18 Avg.Spec = X^iLi specif icityi/N 19 Accuracy = (Avg.Sens + Avg.Spec)/2 Figure 6.4: Algorithm for computing average sensitivity and average specificity across students on the factorization nodes, during game play. Chapter 6. Ablation Study 118 Determining accuracy during game play is difficult because we do not have a test assessment of students' knowledge during the game against which to compare the model's assessment of their knowledge. This is because we assume that knowledge is evolving as students play the game (the students learn while playing Prime Climb). If a student that did not know the factorization of x or the common factor concept on the pre-test but knew it on the post-test, we have no idea when her knowledge changed, or whether she knew the concept at a particular point in the interaction. Thus, for this measure of accuracy, we use only use knowledge assessment on skill items for which the student's answer did not change from pre-test to post-test, ie. the knowledge was constant throughout the interaction. As before, we compute the accuracy of the model separately for factorization nodes and the common factor node. For the factorization nodes, for each student, we look through the log file for moves in which the student encountered a number who's assessment was constant from pre-test to post-test, and compare the post-test answer with the model's thresh-olded assessment of the corresponding node at that point in the interaction. The thresholds used for each model were the optimal thresholds determined in chapter 4 and summarized in table 5.2. This was done each time the student encountered a relevant number, either by moving to the number or moving when the partner was on the number. Sensitivity, specificity and accuracy were computed for each student across all of the moves that were relevant. The sensitivity and specificity were then averaged across all students. The algorithm for computing the average sensitivity, av-erage specificity, and accuracy during game play is shown in pseudo-code in figure 6.4. Note that due to the way in which the log file data was collected, the data cannot be grouped altogether, thus the calculations for the new model and old model are across only the students who played those versions respectively. Table 6.8 shows the average sensitivity, average specificity, and average accuracy across students during game play on the factorization nodes, computed using the al-gorithm in figure 6.4. For comparison, we also show the average sensitivity, average specificity and average accuracy across students at the end of game play 2 . The stan-2 These are different than the accuracies shown in table 6.5 as they represent the accuracy for each Chapter 6. Ablation Study 119 F A C T O R I Z A T I O N N O D E S Accuracy after game play Accuracy during game play (by student) (by student) Avg Avg Avg Avg Avg Avg Sens Spec Acc Sens Spec Acc Old model 66.0% 26.8% 46.5% 94.0% 3.3% 49.4% (14.3) (25.6) (15.4) (7.0) (8.6) (5.4) N=44 N=44 N=44 N=14 N=13 N=13 New model 17.0% 95.6% 56.3% 15.4% 95.7% 55.6% (8.1) (12.1) (8.6) (12.6) (12.2) (9.2) N=44 N=44 N=44 N=17 N=17 N=17 Table 6.8: Comparison between the average sensitivity, average specificity, and average accuracy on the factorization nodes across students after game play and during game play. The standard deviation is shown in brackets. iV represents the number of students that make up the calculation. See the text for a further description. Chapter 6. Ablation Study 120 dard deviation across students for each measure is shown in brackets. N is the number of students used in the calculation. The reason the N is smaller for the during game play measures is because these calculations were done for only the students who played with the old model and new model versions respectively. The N is 13 for the aver-age during game play accuracy, despite the fact that there were 14 students for the sensitivity calculation and 13 students for the specificity calculation. This is because only 13 students had both a sensitivity and specificity assessment, thus we could only compute accuracy for these 13 students. One student had only a measure of sensitivity because all knowledge items were known on pre-test and post-test. Before comparing the accuracies during game play with the accuracies at the end of game play, it is important to realize some important caveats. First of all, the students represented in the two calculations are different. The accuracies at the end are across all students, whereas the accuracies during game play are only across the students in playing a particular version. At that, the old model and new model during game accuracies represent different students playing Prime Climb with the two different versions. Secondly, the numbers (model nodes) which were used in the assessment are different in each case. At the end of game play, each student's accuracy was computed across the 16 numbers on the post-test (see table 6.7). For the during game play calculations, accuracy is computed across only the numbers that the student encountered in the game and had stable knowledge for. Numbers encountered more than one time are counted in the calculation each time they were encountered. The final difference between the after game play accuracies and the during game play accuracies is that the during game play calculations represent only numbers for which there was no learning from pre-test to post-test. Despite these differences between the two calculations of accuracy, it is still in-teresting to examine the differences between the two models at determining student knowledge on the factorization nodes during game play and after game play. We see student, averaged across students instead of the accuracy of all of the data pooled. However, the reader wi l l note that they are quite similar to the results presented in table 6.5, indicating that the models are performing equally on a by-student basis as they are overall. Chapter 6. Ablation Study 121 C O M M O N F A C T O R N O D E Accuracy after game play Accuracy during game play (by student) (by student) Avg Avg Avg Avg Sens Spec ,4cc Sens Spec Acc Old model 88.9% 14.3% 51.6% 46.8% 49.0% 47.9% (33.3) (35.5) (40.5) (37.9) N=9 N=35 N=3 N = l l New model 88.9% 5.7% 47.3% 90.3% 4.7% 45.5% (-33.3) (23.6) (11.6) (4.6) N=9 N=35 N=2 N=15 Table 6.9: Comparison between the average sensitivity and average specificity on the common factor node across students after game play and during game play. The standard deviation is shown in brackets. In each case, N represents the number of students that make up the calculation (see text). The accuracy reported for each model is the mean of the average sensitivity and average specificity, as average accuracy per student has no meaning. See text for details. in table 6.8 that for both models the accuracy during game play and after game play are the about the same, indicating that both models achieve their maximum accuracy early on in the interaction and remain there. We see that for the old model, although the accuracy is about the same, the breakdown between sensitivity and specificity is more marked during game play. At both points, we see that the old model has higher sensitivity and lower specificity, and the new model shows the reverse trend. Overall the new model has a slightly higher accuracy. Looking at the variances across students, we see that after game play the old model has a much higher variance than the new model, thus the old model performs quite differently for different students. During game play the variances for the two models are much more similar. Chapter 6. Ablation Study 122 We now turn to the same calculations for the common factor node, presented in table 6.9. There are a few points to mention about these calculations. First of all, we must bear in mind that we can usually obtain a value for both sensitivity and specificity for each student across the factorization nodes because there are 16 possible factorization nodes for which we have a post-test assessment, and unless they were all known or all unknown, we can compute both sensitivity and specificity. For common factoring knowledge, however, there is only one concept we are measuring, thus for each student we can compute only sensitivity or specificity, but not both. For example, a student that knew the common factor concept on post-test and for whom the model assessed the common factor node as known, we would have a sensitivity of 1, but no specificity measure. Thus, at the individual student level, we cannot compute an accuracy, because there are not two numbers to average. We do not know of a standard procedure for this situation, so to work around this problem, we compute either a sensitivity or a specificity for each student, average these across students, and report as our accuracy the mean of the average sensitivity and average specificity. We must bear in mind, however, that these are calculated across different students. The N in table 6.9 denotes the number of students for which the sensitivity or specificity measure applied. Again, the N is smaller for the during game play assessment, as this represents only students whose common factoring knowledge remained constant from pre-test to post-test and who played with the version of Prime Climb for which we are computing accuracy. The during game play accuracy was calculated using the algorithm outlined in figure 6.5. Of note are lines 5 and 11 on which we compare the model's assessment of the common factor node to the post-test assessment after every move in the interaction. This is because, unlike for the factorization nodes, the common factor knowledge is relevant for every move that the student makes. Looking at table 6.9, we see that the breakdown between sensitivity and specificity remained the same during game play as at the end of game play for the new model. In the case of the old model, we see that although the model has high sensitivity at the end of game play, sensitivity and specificity are more similar during game play. Chapter 6. Ablation Study 123 INPUTS: N = number of students p lay ing version x log file for each student p lay ing version x A L G O R I T H M : 1 for i = each student playing version x { 2 testK.modelK = testK..modelUrN = 0 3 testUN.modelK = testU N.modelU N = 0 4 i f (CF node known on pre-test and post-test) { 5 for each move in log file i { 6 i f (model(CFNode) >= CFTHRESHOLD) 7 testK .modelK + + 8 else 9 testK.modelUN + + } } 10 else i f (CF node unknown on pre-test and post-test) { 11 for each move in log file i { 12 i f (model(CFNode) < CFTHRESHOLD) 13 testUN.modelU N + + 14 else 15 testUN.modelK + + } } end i f 16 sensitivityi = (testK .modelK) / (testK .modelK + testK .modelU N) 17 specificityi = (testU N .modelU N) / (testU N .modelU N + testU N .modelK) } 18 Avg.Sens = J2iLisens,i-tivityi/N 19 Avg.Spec = J2iLi specificityi/N 20 Accuracy = (Avg.Sens + Avg.Spec)/2 Figure 6.5: Algorithm for computing average sensitivity and average specificity across students on the common factor node, during game play. Chapter 6. Ablation Study 124 Again, we caution reading too much into these figures as the TV is quite low for the during game play accuracies, and for the reasons mentioned above. Overall we see that the old model does slightly better at assessing the common factor node, although with the high variances, it is likely that, as with the after game play accuracies in the previous section, these differences are not statistically significant. We cannot test for statistically significant differences between the two models because the accuracy represents only one score, not a distribution of scores. The variance is quite high for both models after game play, and for the old model during game play. This indicates that the models are accurate for some students and inaccurate for others, rather than sometimes accurate and sometimes inaccurate for each student. Bear in mind that the variance is also quite high because there is only one point per student that we are assessing. Note that all of this assessment of accuracy was done with the optimal parameter settings which were chosen in chapter 4. The reader will recall that the new hinting al-gorithm presented in chapter 5 and used in the ablation study uses different thresholds whether the move is correct or incorrect. In the next section we turn our attention to the accuracy of the interventions when the model assessments are used in this way. Agen t accuracy For the purposes of this ablation study, we wish to answer the question of whether the new model was more accurate than the old model in the manner in which it was used in the ablation study. This study used the new hinting algorithm to determine when the agent would intervene (see section 5.2). This algorithm included a heuristic which selected the threshold for the model's assessments based on the correctness of the student's move. Thus, we repeat the assessment of accuracy during game play that was presented in the last section (figure 6.4), however rather than using two thresholds (FACTHRESHOLD and CFTHRESHOLD) to determine if the skill is known or unknown, we use the four thresholds that the agent uses (FAC-CorrectxHRESHOLDi F A C . w r o n g T H R E S H O L D , CF -CorrectxH RES HOLD, and C F jwrongxH RESHOLD) • The Chapter 6. Ablation Study 125 F A C T O R I Z A T I O N N O D E S Average Average Average sensitivity specificity accuracy Old Model 94.0% 3.3% 49.4% (7.0) (8.6) (5.4) N=14 N=14 N=13 New Model 88.1% 77.5% 82.8% (7.4) (32.1) (13.4) N=17 N=17 N=17 Table 6.10: Sensitivity, specificity and accuracy during game play of the old model and new model for factorization nodes using the thresholds from the ablation study, calculated across students. The standard deviation is shown in brackets. N is the number of students used in the calculation. values of these thresholds were presented in table 5.3. This analysis gives us a measure of the accuracy of the agent's interventions. The results for the old model and new model using this measure of agent accuracy are shown in table 6.10 for the factorization nodes. Once again, we remind the reader that this table represents accuracy across only the numbers for which knowledge re-mained constant from pre-test to post-test, and thus may not be representative of the entire population of students and questions. Examining this table, we see that the old model has high sensitivity and low specificity, resulting in an accuracy on factoriza-tion nodes of around 50%. The new model shows a better balance between sensitivity and specificity, with a higher accuracy around 80%. The difference between the mean accuracy for the old model and the new model is significant (t=-9.265, p=0.000). Using this heuristic has increase the number of points classified as known by the model. In the case of the new model this causes great gains in the sensitivity, and thus accuracy overall. The old model, which was given the same threshold for correct and incorrect moves (see table 5.3), did not stand to gain as much from this heuristic. We can thus answer our original question by saying that in the course of this ablation Chapter 6. Ablation Study 126 - F A C T O R I Z A T I O N N O D E S " Old Model Model Test assessment Assessment K U Total K 77.5% 17.7% 95.2% U 4.0% 0.8% 4-8% Total 81.5% 18.5% 100% New Model Model Test assessment Assessment K U Total K 69.3% 5.3% 74.6% U 10.6% 14.9% 25.4% Total 79.8% 20.2% 100% Table 6.11: Confusion matrices for old model and new model across factorization nodes during game play using the thresholds used in the ablation study. study, the new model is more accurate than the old model across factorization nodes, when used with the new hinting algorithm. The reader may note that the standard deviation is quite high for the specificity calculation in the new model in table 6.10. Often variance is high when we do not have a lot of data. This is true of this calculation as well; the specificity for each student was based on having seen only 6.8 moves on average, as compared to the sensitivity which was based on 24.0 moves on average. A second point of interest in table 6.10 is that although both models have a high sensitivity, only the new model has high specificity as well. In the context of the agent interventions, specificity represents the percentage of times that the agent judged a skill to be unknown, intervened, and that intervention was justified. To investigate further, the percentage of times that the agent intervened, we look at the confusion matrices for the two models across students and all interaction points. These matrices Chapter 6. Ablation Study 127 are shown in table 6.11. The confusion matrix displays the percentage of data points which fall into each of the four possible categories; known on post-test and model assesses as known, known on post-test but model assesses as unknown, not known on post-test but model assesses as known, and not known on post-test and model assesses as unknown. Each data point represents one interaction in which a student encountered a number whose assessment was constant on pre-test and post-test and is compared to the model's assessment at that point (as described in the last section). Examining the last row of each confusion matrix in table 6.11, we see that the breakdown of known and unknown factorization nodes is approximately 80%:20% for both the old model and new model, thus the underlying student knowledge appears to be the same in both groups. However, looking at the last column in each matrix, we see that the old model judges nodes to be unknown 4.8% of the time, whereas the new model judges them to be unknown 25% of the time. This means that the agent using the new model intervenes much more often with factorization hints. Even if many of these hints are justified, this may have implications for the students' acceptance of the agent, as it is our experience that students do not like to be interrupted, even if the interruption is justified. As a final note, we can calculate from these matrices the percentage of justified factorization hints the agent would have given using these thresholds. This measure is the precision of the interventions, or the probability that an skill judged unknown by the model will in fact be unknown. Using the formula truejunknown Interventionjprecision = — — - (6-1) all jmodel-unknown we find that the intervention precision of the old model is only 17.4% (82.6% of hints are unjustified) whereas the intervention precision of the new model is 58.5% (41.5% of hints are unjustified). Once again, we conclude that in the course of this study, the agent's factorization interventions are more accurate for the new model than for the old model. For the common factor node, we again only look at students for whom the common Chapter 6. Ablation Study 128 C O M M O N F A C T O R N O D E Average Average sensitivity specificity Accuracy Old Model 66.1% 25.2% 45.7% (51.0) (37.3) N=3 N = l l New Model 96.2% 1.8% 49.0% (5.4) (2.6) N=2 N=15 Table 6.12: Average sensitivity, average specificity and accuracy during game play of the old model and new model for the common factor node using the thresholds from the ablation study, calculated across all students. The standard deviation is shown in brackets. N, the number of students, is shown for the average sensitivity and average specificity. factor assessment remained constant from pre-test to post-test. However, for these students, we look at the agent's assessment of the common factor node on every move, as the common factor concept was relevant at every point in the interaction. Table 6.12 shows the average sensitivity, average specificity, and accuracy for the common factor node, calculated using the algorithm in figure 6.5 using the thresholds from the new agent hinting strategy as described above. Examining this table we see that the heuristic which uses the move's correctness to determine whether factorization is known has actually decreased the accuracy for the old model and only marginally increased it for the new model. In both cases the sensitivity increased due to the fact that we are using a lower threshold for some nodes, however this was at a cost- to specificity, resulting in an overall change in accuracy. ' We take these measures with a grain of salt, noting that the number of students (N) is low in each of the calculations, which may be why we see such high standard deviations, particularly in the old model. Looking more closely at the sensitivity and Chapter 6. Ablation Study 129 C O M M O N F A C T O R N O D E Old Model Model Test assessment Assessment K U Total K 14.5% 60.6% 75.1% U 5.6% 19.4% 219% Total 20.0% 80.0% 100% New Model Model Test assessment Assessment K U Total K 10.9% 87.1% 98.0% U 0.3% 1.7% 2.0% Total 11.2% 88.8% 100% Table 6.13: Confusion matrices for the old model and new model across the common factor node during game play using the thresholds used in the ablation study. Chapter 6. Ablation Study 130 specificity, we examine the confusion matrix for each model, shown in table 6.13. The old model judges the common factor concept to be unknown about 25% of the time, whereas the new model judges it to be unknown only about 2% of the time. This means that the agent intervenes with a common factor hint more often with the old model than the new model. Examining the precision of these interventions using equation 6.1, we find that 77.7% of the old model common factor interventions are justified, and 84.2% of the new model interventions are justified. Thus, although both models are quite inaccurate on the common factor node, many of the interventions that are given for common factors are justified. 'As was mentioned in the last section, a likely cause of the low accuracy on the common factor node is that the prior probability of this node was quite different than the actual number of students that knew the common factor concept. The last row of each of the confusion matrices show that the breakdown of students who knew the common factor concept versus those who did not is approximately 20%:80% for the old model and 10%:90% for the new model. As was stated earlier, the common factor node had a prior probability of 83% for being known. Our future work is to repeat this calculation of accuracy using the new generic prior probability for the common factor node, and see how this affects the accuracy. Thus, we conclude that on the factorization nodes, the new model appears to be more accurate than the old model. On the common factor node the two models were approximately the same, although more of the interventions by the new model were justified. We can therefore reject Hypothesis 1 and conclude that overall the new model was more accurate than the old model, and thus equally accurate student models were not the reason that there were no differences in learning gains observed in the course of this study. We now turn our attention to the second hypothesis. Chapter 6. Ablation Study 131 No Agent Old Model New Model Pre-test /z=4.15 s.d. 2.32 p= 4.86 s.d. 1.62 p= 4.53 s.d. 1.68 Post-test p= 4.12 s.d. 1.69 p= 5.14 s.d. 1.34 p= 4.65 s.d. 1.91 Learning p= -0.04 s.d. 1.31 u-= 0.29 s.d. 1.34 p= 0.12 s.d. 1.82 Table 6.14: Average pre-test score, post-test score, and learning gains by condition when attempting to replicate the test in the study described in [28]. Only questions that were in common between the two tests were included, and were marked according to the scheme outlined in [28]. The maximum score is 6. 6.2.2 Hypothesis 2: The test we used does not properly assess learning Our second possible hypothesis for why there were no differences in learning observed amongst the three conditions is that, although there may have been significant dif-ferences between the three groups, the learning gains were not captured by the test which we were using to assess learning. Conati and Zhao [28] found marginally signif-icant differences between a group playing a no agent version of Prime Climb and one playing with a version that included an agent using a model similar to our old model. As our experimental set-up was similar to that of Conati and Zhao, we expected to at least observe a difference in learning between the with agent groups and the no agent group. However, many aspects of the test used to assess learning were changed from the version used in the study described in [28], as were outlined in section 6.1.1. Thus, it may be that these changes affected our ability to detect learning in the students. Although we cannot go back and determine how the results would have been dif-ferent if we had used the test from [28], we can make an effort to replicate their results as much as possible with the test that was administered in this study. In order to do this, we tried marking only questions on the current test which were identical to Chapter 6. Ablation Study 132 questions on the test administered in [28]. We also marked this shortened test using the marking method described in [28] (giving two marks for each correct response and deducting one mark for each incorrect response). The results of the pre-test, post-test, and learning for each condition on this modified test are shown in table 6.14. The maximum score on this modified test is 6. Performing an A N O V A on this measure of difference between pre-test and post-test score, however, revealed no significant dif-ferences between the three conditions (p=0.861), nor did a general linear model on post-test score using pre-test as a covariate (p=0.'478). Thus, we examine other possible problems with the test which was used to assess learning and how these may have intefered with the test's ability to assess the learning that occured. Setting Table 6.4 shows the average pre-test score, post-test score, and learning in each con-dition. We can see by examining this table that in all three conditions the students actually did worse on the post-test than the pre-test on average. One possible reason for the decline in test performance may have been due to the different test settings between pre-test and post-test. In order to reduce the length of time that each pair of students was out of the classroom, the entire class wrote the pre-test together, in their regular classroom with their teacher and the experimenter present. When writing the post-test they were in a separate room in only the presence of the experimenters and a peer. It may be the case that without teacher presence st.udents are less likely to put their full effort into answering questions, or that students were less able to concentrate in a non-classroom setting. Without a further study, we will not be able to isolate whether this was the cause of the decrease in test scores from pre-test to post-test. Fatigue A second possible reason for the decline in performance may have been a fatigue effect. The original test used in [28] had only ten questions, whereas ours had twenty-one. Chapter 6. Ablation Study 133 No Agent Old Model New Model Pre-test p= 15.38 s.d. 4.59 p= 17.79 s.d. 2.75 fi= 17.35 s.d. 4.51 Post-test p= 14.00 s.d. 7.51 p= 17.43 s.d. 3.23 p= 17.70 s.d. 3.60 Learning p= -1.38 s.d. 4.25 p= -0.36 s.d. 1.34 p= 0.35 s.d. 1.32 Table 6.15: Average pre-test score, post-test score, and learning gains by condition when controlling for fatigue by marking only the first page of the pre-test and post-test. The maximum score is 20. Students may become fatigued or bored during the post-test, resulting in lower post-test scores. To attempt to account for this problem, we tried marking only the first page of the pre-test, and corresponding questions on the post-test (appendix B) and use the difference between these scores as our measure of learning. The results of the pre-test, post-test, and learning for each condition on this modified test are shown in table 6.15. The maximum score on this modified test'is 20. However, an A N O V A reveals no significant effects for.condition on this measure of learning (p=0.195), nor does a general linear model on post-test score using pre-test as a covariate (p=0.255). This does not mean necessarily mean that there was no fatigue; students may give up on the test as soon as they see its length and before they even begin writing,. As was already mentioned in the previous section when we were analysing the ac-curacy of the models during game play, we noticed that our assessment of the common factor concept was not very accurate, and hinted that this may have been due to the fact that the questions about common factors appeared at the end of the test. The students in the ablation study do much worse overall on the common factor questions than they do on the factorization questions. The average assessment on the pretest for the factorization concepts was 77.2%, and 85.4% on post-test. The average assessment on the pretest for the common factor concept was 26.7%, and 22.2% on post-test. This is very different from the 83% knowledge of the common factor concept observed on Chapter 6. Ablation Study 134 the study in [28] and leads us to believe that students were answering randomly by the end of the test (as each question was worth more than one point, answering randomly would not necessarily give the students a score of 50% on common factor concept). Thus, it is likely that fatigue plays some role in the students' effort on the test. In fact it was the observation of the experimenters that the students grew bored by the end of the test. Ceiling effect Another possible reason for why we did not see improvement on the test from pre-test to post-test is that there was a ceiling effect for the test. What this means is that students performed so well on the pre-test that there was little room for improvement on the post-test. Across all of the students in all of the conditions, the average pre-test score was 24.2 out of a possible 30 marks. Given that the lowest possible grade on the test was -30 (as marks were taken away for all incorrect answers), the possible range of marks was from -30 to 30. A n average pre-test score of 24.2 means that students were already obtaining 90.3% of the possible marks available on the test, thus there was only room for an improvement of 9.7%. This is another likely reason that we did not see a significant improvement on the post-test. Test numbers not encountered in the game Finally, the test that was administered may not have been targeting numbers for which we would expect to see improvement from pre-test to post-test. The first column of table 6.16 shows the percentage of students that knew the factorization of each number on the pre-test. We see that for 10 of the 16 numbers, more than 70% of the students already knew the number's factorization on the pre-test, thus there was a ceiling effect for these numbers. Of the students that did not know the number's factorization on pre-test, and thus stood to gain from encountering it in the game, the percentage of students that actually did see the number (moved to it or moved when the partner was on the number) are shown in the second column of table 6.16. We see that in 4 Chapter 6. Ablation Study 135 Number % of all students that knew the number on pre-test % of those that did not know on pre-test that encountered the number at least once in the game 9 95.6% 100.0% 11 84.4% 100.0% 14 84.4% 42.9% 15 93.3% 66.7% 25 84.4% 85.7% 27 80.0% 100.0% 30 71.1% 0.0% 31 84.4% 85.7% 33 80.0% 66.7% 36 62.2% 47.1% 42 44.4% 80.0% 49 73.3% 33.3% 81 46.7% 100.0% 88 46.7% 50.0% 89 64.4% 93.8% 97 46.7% 83.3% Table 6.16: Percentage of students that did already knew the factorization of a test number on the pre-test, broken down by the numbers on the test. Percentage of students that encountered the number during the game, of the students that did not know the factorization on pre-test. Chapter 6. Ablation Study 136 cases, less than 50% of these students even encountered the number during game play. The number 30 was never even seen once, thus, we cannot expect any students to have improved on this number's factorization on the post-test! We can break this down further and investigate the average number of times that each number was encountered in the game by students that did not know the number's factorization on the pre-test. These are broken down by condition and can be seen in table 6.17. This table shows that some of the numbers (14, 15, 30, 31, 36, 49, 88) were encountered only two or fewer times during the game. Thus, there were not many opportunities for learning the factorizations of these numbers during the game. In summary, there are many aspects of the test - setting, fatigue effects, ceiling effects, and assessing inappropriate numbers - which may have contributed to the test's insensitivity to detecting learning in the students that played Prime Climb. Thus, it is possible that hypothesis 2 is one of the reasons that we were not able to prove our learning gains hypothesis. A further study would be required to isolate the effect that the test has on learning gains. We now move on to the third hypothesis for why differences in learning gains were not observed; learning has been obstructed by changes to the agent's interventions. 6.2.3 Hypothesis 3: Learning has been obstructed by changes to the agent's interventions In order to explain our inability to replicate the results of Conati and Zhao [28] showing that students playing with a version of Prime Climb which has a pedagogical agent learn more than those that play with a version without an agent, we look for other differences between their study and ours, other than the changes to the pre-test and post-test which were explored in the discussion of hypothesis 2. One difference between our study and that of Conati and Zhao is that their model (our old model) did not contain a common factor node. We are using an old model with a common factor node to determine when to intervene. However, as was shown in section 4.4, the model with the common factor node does not differ in overall accuracy Chapter 6. Ablation Study 137 Test number N o Agent Old Model New Model 9 3.0, N = l n/a, N=0 3.0, N = l 11 3.3, N=4 n/a, N=0 3.0, N=3 . 14 0.3, N=3 2.0, N=2 0.7, N=3 15 1.0, N=2 n/a, N=0 0.0, N = l 25 5.0, N=2 3.0, N=2 3.7, N=3 27 5.5, N=2 4.3, N=3 4.5, N=4 30 0.0, N=7 0.0, N=2 0.0, N=4 31 2.0, N=2 2.0, N = l 1.0, N=4 33 2.0, N=4 3.0, N = l 0.8, N=4 36 0.8, N=4 0.8, N=5 0.6, N=8 42 1.0, N=5 1.4, N=8 2.3, N=12 49 1.6, N=5 0.7, N=3 0.3, N=4 81 3.8, N=7 5.0, N=6 5.1, N = l l 88 0.9, N=8 1.2, N=5 0.7, N = l l 89 3.8, N=6 4.0, N=5 2.8, N=5 97 2.1, N=7 2.7, N=6 1.9, N = l l Table 6.17: The average number of times that each of the numbers on the post-test was encountered in the game by the students that did not know the number on pre-test. N is the number of students that make up the average. The averages are broken down by condition; no agent, old model, and new model. Chapter 6. Ablation Study 138 from the "old" model used in [28], so we would not expect this to be the cause of no observed learning gains. A more significant difference between the two studies was the changes that we made to the agent's hinting style. One reason for not being able to prove our learning gains hypothesis is that learning may have been obstructed by interface issues which are unrelated to the student model or lack-thereof. Thus, we turn to the log files to see if we can find evidence of how these changes may have impacted learning. The two main differences between the original and the new hinting style are; • When to hint: how we determined when to hint affects the frequency of hints as well as what is hinted on each time; and • How to hint: changes included the introduction of new hints, the length of hints, and the examples shown in dialogue boxes. We discuss each of these changes in turn, and their implications for the old model and the new model conditions. When to hint One of the major changes in this new agent hinting style is that we have two types of hints, factors hints and common factors hints, and we chose whether to intervene on these based on the model's assessment of the relevant factorization nodes or common factor node. In the previous agent hinting style, when to hint was determined in a different way (described in section 5.1). We wished to find out if our new algorithm for determining when to hint caused the two models to hint in a way that differentiated them and thus could potentially affect learning gains in different ways. As we saw in the the analysis of the agent accuracy when investigating Hypothesis 1, the new model assessed the factorization nodes as being unknown more often than the old model. The reverse was true for the common factor node. We wished to see how this played out when used by the new agent hinting algorithm, as the algorithm selects only one of the two hints to be given at any particular point in time, even if the two nodes are assessed as below threshold (see algorithm in figure 5.2). Chapter 6. Ablation Study 139 O l d mode l N e w mode l Difference t-test Average number of hints 12.9 (14.7) 16.3 (5.5) not significant Average number of common factor hints . 7.9 (13.4) 2.9 (0.8) not significant Average number of factors hints 4.9 (2.1) 13.5 (5.2) p=0.000 Table 6.18: Average number of total hints, common factors hints and factors hints given per student in the old model and new model conditions. Standard deviations are shown in brackets. Table 6.18 displays the average number of hints that were given during the ablation study, across all students, for each of the two conditions. There is no significant difference between the old model and the new model on the total number of hints (see the first row of table 6.18); both students received between 13 and 16 hints on average. However, the standard deviation for the old model is much higher than the standard deviation for the new model. Breaking the number of hints down further into common factors and factors hints (see table 6.18), we see that the variability in the number of hints for the old model stems from the number of common factors hints (which has high variability), not from the number of factors hints. We looked to the log files to determine the source of this high variability on common factors hints. Examining the log files of students in the old model condition, we observed that two students in the old model condition received a common factor hint after almost every move they made. The old model had assessed their common factor knowledge very low after falling at the beginning of the interaction repeatedly on the same numbers. When the two students finally overcame those numbers, they continued to fall frequently, and were never quite able to recover, although they did make some correct moves to numbers. One of these students was assessed on pre-test and post-test as knowing the common factor concept, the other as not knowing it. Other than these two students, most of the other students in the old model condition received very few common factor hints at all. The average number of hints, common factor hints, and Chapter 6. Ablation Study 140 O l d mode l N e w mode l Difference t-test Average number of hints 7.6 (3.6) 16.3 (5.5) p=0.000 Average number of common factor hints 3.0 (1.8) 2.9 (0.8) not significant Average number of factors hints 4.6 (2.0) 13.5 (5.2) p=0.000 Table 6.19: Average number of total hints, common factors hints and factors hints given per student in the old model and new model conditions after we remove two students with outlying behaviour. Standard deviations are shown in brackets. factor hints with these two students removed from the analysis are shown in table 6.19. Without these two students the actual number of common factor hints that were given are much more similar for the two models. There is no significant difference between the number of common factor hints given by the agent using the old model and the agent using the new model. However, examining table 6.19 we see that there is a significant difference between the number of factors hints given in each of the two conditions. Consistent with the finding that the new model assesses the factorization nodes as unknown more often than the old model, we see that the students in the new model condition received more factorization hints on average than students in the' old model condition, and thus also more total hints on average. To give an idea of how often the agent was intervening, table 6.20 shows the average number of hints each student received on each level, as well as the percentage of moves which were followed by a hint. It is worth remembering that although the model intervenes more often for factors hints, a higher proportion of these interventions are justified. Thus, we see that the two models are causing the agent to behave differently in its interventions. Wi th the agent intervening differently for the two conditions, we would have expected to see a difference between the two groups on learning gains, Chapter 6. Ablation Study 141 Old Model New Model Mountain Moves Fact Hints CF Hints Moves Fact Hints CF Hints 1 7.3 N=15 0.9 (12.8%) 1.9 (25.7%) 7.5 N=17 0.6 (8.6%) 0.9 (11.7%) 2 7.0 N=15 0.9 (13.3%) 2.7 (37.9%) 7.9 N=17 3.2 (40.4%) 0.8 (10.5%) 3 10.3 N=14 1.2 (11.8%) 3.1 (29.9%) 10.8 N=17 2.4 (22.3%) 0.6 (5.4%) 4 10.1 N=14 0.6 (5.6%) 2.0 (19.7%) 13.8 N=16 4.1 (29.9%) 0.5 (3.6%) 5 15.2 N = l l 2.8 (18.6%) 1.6 (10.8%) 15.4 N = l l 3.9 (25.4%) 0.2 (1.2%) 6 9.9 N=8 ' 0.5 (7.6%) 0.4 (3.8%) 10.0 N=5 1.8 (18.0%) 0.0 (0.0%) 7 4.3 N=4 0.0 (0.0%) 0.0 (0.0%) 4.0 N = l 0.0 (0.0%) 0.0, (0.0%) 8 7.0 N=4 1.0 (14.3%) 0.8 (10.7%) 2.0 N = l 0.0 (0.0%) 0.0 (0.0%) Table 6.20: Average number of moves per student per mountain (N denotes the number of students who made it to that mountain) and the average number of common factor and factorization hints given per mountain. The numbers in brackets denote the percentage of moves for which a factorization or common factor hint was given. These calculations are across all of the old model and new model students. Chapter 6. Ablation Study 142 particularly since the new model students received more justified hints. There must have been another barrier to learning which arose from the new hinting strategy, so we look next at what was actually contained in the hints that were given to the students. H o w to hint One of the major differences with the new hinting style is that the hints given were longer than those in the previous version. We would like to find out if students are taking the time to read the hints and examples carefully, because if they are not, then the hints will not help them. This is especially important because the hints were not spoken aloud in this study, as they were in [28], thus it was possible for students to ignore the hints altogether. Across all of the students in the two with agent conditions, the average time be-tween a hint being presented and the student moving was 12.82 seconds (s.d. 4.22), averaged across all of the hints that the student saw. However, all of this time is not spent just reading the hint and example; the student also spends some time thinking about where to move next. We also calculate the average time between moves for each student (9.14 sec on average, s.d. 3.02), and subtract this from her average time be-tween a hint and a move, thus getting an estimate of the average amount of time each student spent reading a hint. The average time spent reading hints across students was 3.42 seconds (s.d. 2.62). There were no significant differences between time spent reading hints for the old model and new model groups as measured by an independent samples t-test (p=0.664). There was also no significant correlation between the time spent reading hints and learning (r=-0.276, p=0.132). The average adult reader can read 3.4 words per second [42]. Wi th hints that were 22.5 words on average, adults should have been taking 6.62 seconds on average to read the hints. Thus, it is concievable that students were not taking time to read the hints thoroughly and think about their meaning in only 3.42 seconds. However, average reading times do not tell the whole story, as this measure repre-sents an average time across all hints. We might expect a student to read hints more Chapter 6. Ablation Study 143 factors stream common factors stream A v g . time 38.68 26.96 cycle 1 (sec) (6.36) (8.31) N=25 N=25 A v g . time 34.93 21.67 cycle 2 (sec) (7.85) (3.51) N=15 N=3 A v g . decrease -3.75 3.00 significance p=0.08 p=0.48 N=12 N=3 Table 6.21: Average time spent reading the hints in each stream (minus the focus hints) the first cycle through and second cycle through. Reading times are not adjusted for moving time. closely at the beginning, and less-thoroughly the second time around. To confirm this, we calculated the average time that a student spends between recieving a hint and moving, the first time through the cycle and the second time through the cycle for each of the common factor stream of hints and the factors stream of hints. We do not subtract the average time between moves for these values, because we do not expect this value to be constant3. These results are shown in table 6.21. The calculated time for each stream does not include the level 1-focus hint because it is not repeated the second time through the cycle. We see from examining the results in table 6.21 that students spend less time reading hints in the factors stream then moving, the second time they have seen the 3 I t may be that students spend more time between moves at the beginning of the game when they are learning the rules, and less time later on. Or it is possible that students spend less time between moves at the beginning of the game when the numbers are easy, and less time later. Although it would have been possible to mine this information from the log files by picking two particular moves to represent " the beginning of the game" and " later on", this would require an investigation of which moves are representative and at which point the difference occurs, thus we chose not to investigate it at this time. Chapter 6. Ablation Study 144 c o m m o n f a c t o r s s t r e a m f a c t o r s s t r e a m H i n t Focus Def. 1 Def. 2 Focus Def. 1 Def 2 Tool - Bottom W o r d s 20 34 32 19 26 27 25 24 A v g . t i m e 13.56 13.63 13.36 12.03 13.57 13.00 11.67 10.64 1st v i e w (6.42) (6.45) (4.53) (4.53) (4.41) (3.75) (3.72) (4.65) ( s e c o n d s ) N=27 N=27 N=25 N=30 N=30 N=29 N=27 N=24 A v g . t i m e 13.00 12.00 11.33 12.12 11.71 9.14 2 n d v i e w n / a (7.35) (1.73) n / a (3.33) (3.76) (3.35) (6.40) ( s e c o n d s ) N=4 N=3 N=18 N=17 N=17 N=21 Table 6.22: Average reading time across all students for each hint on the first time it is given and the second time it is given. Standard deviations are shown in brackets. N denotes the number of students used in the calculation. The number of words in each hint is also shown. hints. This result is marginally significant (p=0.08). Students also spend less time reading hints in the common factors stream then moving, the second time they have seen the hints. However, this difference between cycle 1 and cycle 2 is not significant (p=0.48). We were also interested in finding out if students read particular hints more than others. Table 6.22 breaks down the average time spent reading each hint the first time and second time it was presented, averaged across all students. Note that some of these times may be misleading as far fewer students saw many of the hints the second time around, as indicated by the N values. Using only students who saw each hint twice, we calculate differences between the time spent reading each hint using paired-samples t-tests between the mean reading time for hint x view 1 and hint x view 2. The only one of these t-tests that was significant was that for the factors stream hint definition 1 (p=0.029). It may be the case that by the first viewing of hint definition 2, students are already starting to tune out, although there are no statistically significant differences between the time reading hint definition 1 and hint definition 2 for either the common factors stream or the factors stream. There was no significant difference between the first time seeing definition 1 and the second time seeing definition 1 for Chapter 6. Ablation Study 145 old hinting strategy (study in [28]) new hinting strategy (current ablation study) hint % of total hints % followed by correct action hint % of total hints % followed by correct action l _ l a 21% 87% Focus (F) 6.0% 83.3% l_ lb 9.2% 83% Def 1 (F) 12.5% 75.3% 1_3 22.4% 65% Def 2 (F) 11.7% 86.1% 2.1 7.9% 100% Tool (F) 6.8% 83.3% 3.1 11.8% N . A . Focus (CF) 6.0% 81.1% ' Def 1 (CF) 12.8% 59.5% Def 2 (CF) 10.6% 55.4% Bottom 33.6% 84.1% Table 6.23: Comparison between old hinting strategy hints and new hinting strategy hints on the percentage of each hint type given, percent followed by correct action, and correlation of that hint with learning. the common factors stream, although it is worth noting that the N is smaller as well. We do notice by examining table 6.22 that, as expected, students spend more time reading the hints that involve examples (definitions 1 and 2) than those that do not. However, the additional time spent reading these hints does not account for the longer length of these hints, and the fact that they contain mathematics examples that the student must think about. The definition hints are 10 words longer on average than the corresponding focus hints (see table 6.22), thus we would expect an average reader to spend approximately 3 seconds longer to read them, plus time to examine the example. However, examining table 6.22 we see that students are not taking the time, and thus are probably not reading the hints thoroughly. By way of comparing the hints from this new hinting style with the hints given in the study described in [28], we present table 6.23 which provides us with an indication of how much each hint contributes to learning. For each hint we show the percentage Chapter 6. Ablation Study 146 of the total hints that were of this type. We also show the percentage of these hints that were followed by a correct move, as this is a form of improvement that can be gleaned from the hint. We see that the percentage of correct moves which followed each hint are similar for most of the hints in the two studies, with the exception of hint 1_3 ("Do you know why you are correct this time") in the previous study, and hints definition 1 common factors and definition 2 common factors in the current study. It would be worth examining new ways to express each of these hints and see if they improve the number of correct moves which follow them. Conati and Zhao [28] also provide ah analysis of the correlation that each hint has with learning. We performed such an analysis as well, however none of our hints were significantly correlated with learning. This is probably due to the problems with the test which assesses learning, as was discussed in the last section. We conclude this section by determining that although the two models cause the agent to behave differently in its hinting style towards the students, we have indications that the students are not taking the time to read the hints thoroughly, and thus the hints do not help them. Hypothesis 3 - learning has been obstructed by changes to the agents hinting style - is likely one of the reasons that we did not observe differences in learning between the old model and the new model conditions in this study. Therefore, with respect to our original hypothesis, we have determined that both changes to the test used to determine learning (hypothesis 2) and changes to the agent interventions (hypothesis 3) are likely causes of why we failed to prove the learning gains hypothesis. To answer the question of whether or not a more accurate student model contributes to learning, we would need to re-run our study, controlling for both the test and the agent hinting style. • We now turn away from learning gains, and look to the students' perceptions of the agent and whether these differed between the old model and new model conditions. Chapter 6. Ablation Study 147 Study a) with someone to b) without other's Total help help present study 3 1 4 Conati & Zhao 2004 4 4 8 Total 7 5 12 Table 6.24: Number of students in the no agent group that responded a) or b) to the question "If you.played Prime Climb again, would you rather play Results are shown for both the study discussed here, and that of Conati and Zhao [28]. 6.2.4 Student perceptions.of the agent To examine how the students felt about the agent, we look to the questionnaires that the students and experimenters filled out after game play. We decided to collect information on the students' assessments of the agent only after running the study at the first school, thus we only have data from one school, resulting in 9 students in the new model condition, 10 in the old model condition, and 4 in the no agent condition. In the no agent condition the responses to the question "If you played Prime Climb again, would you rather play a) with someone to help, or b) Without other's help" are shown in table 6.24. We also show the results obtained by Conati and Zhao [28] with the same question. These results are difficult to generalize with so few students, but if we pool the data from both studies it seems to indicate that about 50% of the students who did not have an agent to help during the game would have liked to receive some form of assistance. We do not know, however, how these students would feel towards the agent if such help were provided. The average scores on each of the seven questions rated by the students in the two with agent conditions from 1 (strongly disagree) to 5 (strongly agree) are shown in table 6.25 broken down by old model, new model, and overall. We have confidence that the students correctly interpreted each question because in general the responses given by the students correlate in ways that we would expect. Significant correlations are shown in table 6.26. The more helpful the student finds Chapter 6. Ablation Study- US Question Old Mode l New Model t-test Overall Q l : helpful 3.60 (0.22) 2.56 (0.34) p=0.017 3.11 (0.99) Q2: understands 3.00 (1.05) 2.67 (1.41) n.s. 2.84 (1.21) Q3: play better 2.80 (0.92) 2.44 (1.13) n.s. 2.63 (1.01) Q4: learn factorization 3.20 (0.92) 2.56 (1.13) n.s. 2.89 (1.04) Q5: too often 3.20 (1.48) 3.89 (1.05) n.s. 3.53 (1.31) Q6: not enough 1.50 (0.53) 1.56 (0.73) n.s. 1.53 (0.61) Q7: liked 3.60 (1.07) 3.11 (1.36) n.s. 3.37 (1.21) Table 6.25: Average responses to student questions on a Likert scale from 1 (strongly disagree) to 5 (strongly agree), grouped by condition and overall. The standard devi-ation appears in brackets. The difference between the old model and new model are tested with a t-test where n.s. denotes not significant. Correlation r P Q l : helpful with Q4: learn factorization 0.757 0.000 Q l : helpful with Q7: liked 0.658 0.002 Q3: play better with Q6: not enough 0.600 0.007 Q4: learn factorization with Q7: liked 0.732 0.000 Table 6.26: Significant Pearson correlations between student-answered questions about Merlin Chapter 6. Ablation Study 149 Merlin, the more she thinks that he has helped her learn number factorization. The more helpful she finds Merlin, the more she likes him. The more she thinks Merlin helped her play better, the more she wished he would have intervened. And finally, the more the student thinks Merlin helped her learn number factorization, the more she likes him. Thus, there do not appear to be any conflicting ratings in table 6.25, so we feel confident that the students correctly interpreted the questions. Looking at table 6.25, we see that in general the students think that the agent is intervening too often. Students rate Q5: too often at 3.53/5, and Q6: not enough at 1.53/5 across both groups. This could be rectified by further reducing the thresholds, however it highlights yet again the trade-off between maintaining engagement by not intervening too often, and fostering learning. Although students don't like to be interrupted, it might be what's best for them. A decision-theoretic approach which combines a model of student learning with a model of student affect is a way around this issue. Although at a glance, it would appear from table 6.25 that the old model is rated higher than the new model, the only difference that is statistically significantly different is Ql-helpful (p=0.017); students in the old model condition find Merlin more helpful than students in the new model condition. We recall that the students in the new model received more hints on average than the students in the old model. Also, more of these hints were justified. However, we also saw that students were not reading the hints thoroughly, perhaps because they did not find the hints helpful. If more of the students in the new model needed help and received a hint, but the hint was not helpful, we might expect more of these students to rate the agent as "unhelpful". This again points us to the need to improve the hints so that the students will read them and find them useful. 6.2.5 Experimenter questionnaires The results of the experimenter questionnaires are shown in table 6.27. Because we were able to mine information about the magnifying glass usage and students' requests Chapter 6. Ablation Study 150 Question Scaling Old New Overall E l : listened 1 (always) -4 (not at all) 2.70 (0.67) 2.56 (1.13) 2.63 (0.90) E2: act w_out listening 1 (always) -4 (not at all) 2.00 (0.82) 1.44 (0.88) 1.74 (0.87) E3: like to listen 1 (no) - 5 (yes) 2.40 (0.97) 2.67 (1.22) 2.52 (1.07) E4: help too much 1 (no) - 5 (yes) 1.70 (0.95) 2.78 (1.39) 2.21 (1.27) E5: not enough help 1 (no) - 5 (yes) 1.90 (1.52) 1.22 (0.44) 1.58 (1.17) Table 6.27: Average responses to experimenter questions, grouped by condition (old model and new model) and overall for more hints directly from the log files, we do not include questions 6 and 7-that addressed these. Of the measures shown in table 6.27, the only difference that is marginally signifi-cant is E4: help too much (p=0.072). The experimenters judged that in the case of the new model, the agent intervened too much. However as we have covered in previous section, this is likely because the new model intervenes more than the old model. What is of more interest are the comments that the experimenters jotted beside the questions. In three of the new model cases, and one of the old model cases, the experimenter wrote a comment beside E4: help too much saying that the partner had rested on the number 81 a long time in level 3, and this was when the student received too many hints. Subsequent looking through the log files revealed that this was usually the case. The number 81 has a very low prior probability - 0.31 - and this number appears for the. first time at a tricky part on the mountain. Thus, students receive hint after hint as they attempt to climb past this point and their partner is still on the number 81. Looking back to table 6.17, we see that the number 81 is encountered more times on average than any other number, probably for this very reason. It appears that this tricky part of the mountain around the number 81 the area where the experimenters judged that the agent helped too much. One possible Chapter 6. Ablation Study 151 % of students that used the magnifying 31.8% glass average number of times magnifying 8.6 glass used (12.5) N=14 % of tool hints that are followed by a 0.1% magnifying glass use (0.2) N=28 Table 6.28: Magnifying glass usage statistics including (i) the percentage of students that used the magnifying glass at least once, (ii) of the students that did use the magnifying glass, the average number of times that it was used, and (iii) the percentage of all of the tool hints given that were followed by the use of the magnifying glass. Standard deviations are shown in brackets. N denotes the number of students used in the calculation. solution to this problem would be to not have the agent hint on the same number twice in x moves. We now turn our attention to other interesting findings in the data which can be used to inform future design of the system. 6.2.6 Other mined student actions In this section we look at other actions which were mined from the data of all of the students' actions. We pool the data together across the three condition with the purpose of determining how students play Prime Climb and what improvements we could make to encourage students to use the features more effectively. Magnifying glass use Statistics for magnifying glass use are shown in table 6.28. In the first row we see that only 31.8% of all students used the magnifying glass even once. This was despite Chapter 6. Ablation Study 152 % of students that requested hints 18.2% average number of times hints 2.0 requested (1.2) N=8 % of hints given that were 3.4% requested (7.6) N=44 Table 6.29: Statistics on hints requested by the students, including (i) the percentage of students that requested a hint at least once, (ii) of the students that requested hints, the average number of hints that were requested, and (iii) the percentage of all hints given that were requested by the student. Standard deviations are shown in brackets. N denotes the number of students used in the calculation. the fact that the magnifying glass was pointed out and its use was demonstrated in the introduction given prior to game play. And also, it was clear from observing the students that they would have benefitted from using the magnifying glass. In the second row of table 6.28 we see that of the 14 students that did use the magnifying glass, each used it on average 8.6 times. Note that there is a high standard deviation, indicating that there was a lot of variability across students. Thus, many students are not using the magnifying glass, despite the fact that it is a tool which allows the student to see the factorization of any number on the game board. We would like to encourage students to make more use of the tools available, however thus far our prompting has fallen on deaf ears; only 0.1% of the hints given which prompt students to use the magnifying glass are followed by a use of the magnifying glass. We need to determine a better way to encourage students to make use of this tool, for example having Merlin use the magnifying glass for the student on a number that is stumping her. Chapter 6. Ablation Study 153 Reques ted hints Table 6.29 shows statistics on the hints requested by the students, that is to say the hints which arose from the student responding " Y E S " to the dialogue box question "Do you want another hint?". This table shows that only 18.2% of the students requested a hint even once. Of the students that did request hints, on average they requested only 2.0 hints. The requested hints make up only 3.4% of all of the hints that were given in the game. This is further evidence that the students do not find the hints helpful or worth taking the time to read. 6.2.7 Results conclusion In this chapter we have presented the results from an ablation study in which students played with one of three versions of the Prime Climb game, with the goal of uncovering an answer to the question of whether a more accurate student model induces signif-icantly more learning in students playing Prime Climb. Although we were able to conclude that the new model was more accurate than the old model, we were hindered in answering this question by problems with the test used to assess learning and by changes to the hints which caused students not to read them thoroughly. In order to determine which of these two were the main cause of not being able to find learning gains in the students playing Prime Climb, we would need to run a study in which we isolated each of these two factors. Much of our discussion has frequently come back to the issue of the tradeoff between intervening when it is necessary, but not bothering the student too much. It is possible that in the study described in [28] the hints that were provided were inobtrusive enough to ensure that the students would read them, in contrast our hints are too obtrusive, thus the students ignore them. The long-term solution to the problem of balancing this tradeoff is to devise a hinting strategy which combines the model of student learning with a model of student affect, as it appears to be very important to take affect into account. Other findings from this study are that students are not making use of the mag-Chapter 6. Ablation Study 154 nifying glass or the opportunity to ask for more hints. Some suggested improvements to the agent's hinting style gleaned from our discussion of the data analysis include; • Implement a heuristic in which the agent doesn't hint more than twice on the same number in x moves, • Implement a rule which states that we should not intervene for at least x moves after the student has indicated that she does not want more help, • Reduce the amount of text in the hints and examples, • Break each definition/example pair down into two separate hints or only show an example if the student requests it, • Reduce the obtrusiveness of the dialogue boxes. For example, the "do you want more hints" dialogue box could just disappear after a period of time if the student does not respond rather than requiring the student to select " N O " , and • Encourage students to use the magnifying glass by having Merlin hint by demon-strating its use on a number. Further follow-up research ideas that might be worth pursuing are; • Implement a decision-theoretic agent that uses both a model of student learning and model of student affect to determine its actions, • Carry out a comparison study between the old agent hinting style and new agent hinting style to see if this is the main cause of the lack of learning in the three conditions, and • Carry out a study using the test,from [28] to assess learning. Thus, although we have not been able to prove our original hypothesis true, we have clear directions for where to go with this research. Chapter 7 155 C o n c l u s i o n s a n d F u t u r e W o r k 7.1 Satisfaction of thesis goals In this thesis we set out to achieve four goals. At this point we review those goals and whether we have achieved them. G l : Use data from a user study to make changes to a model for student learning with an educational game, and assess empirically the improvement in accuracy brought about by these changes. • In chapter 4 we described the incremental changes that we'made to the original student model to improve its accuracy. Using data collected from a user study, we assessed the accuracy of the old model and each of the two new models; the new model with common factor node and the new model without. We were able to improve the accuracy of the model from 50.8% with the old model to a high of 83.8% with individualized'priors. We also compared the three models statistically using the area under the R O C curve metric. Unfortunately this does not appear to generalize to the new data from the ab-lation study. One possible reason for not being able to achieve high accuracy is that many of the nodes across which we calculated accuracy did not begin with accurate priors. Our future work is to use the priors from the ablation study and determine how these improve the accuracy on the same data. G2: Investigate the role that the various parameters and prior probabilities play in the model. Chapter 7. Conclusions and Future Work 156 • This analysis was carried out in section 4.2.4 for the a, guess, edu_guess and max parameters and section 4.2.4 for the prior probabilities of the nodes. We found that the ideal parameter settings had high values for the guess parameters, confirming that students can be successful at playing a game without having the underlying knowledge. The ideal setting for the max parameter was low, indicating that the teacher-suggested relationship between factorization nodes is not as important as we originally anticipated. We found that the model is not very sensitive to small changes in the model parameters. Wi th the model prior probabilities, we concluded that although individualized priors are best, generic priors also significantly improve the accuracy of the model over default priors. As we learned in Chapter 6, it is important to begin with accurate priors. G3: Assess the role that the student model plays in the learning outcomes achieved by the pedagogical agent's interventions. • This was investigated in chapter 6. We learned that a model of student learning alone is not sufficient for effective pegagogical interventions. Despite the fact that many of the interventions were justified, students were still not taking the time to read the hints, and thus to learn from them. We conclude that a pedagogical agent must take both student knowledge and also student affect into account when determining whether to intervene. G4: Assess the role the model plays in students' assessment of the pedagogical agent and suggest improvements to the agent's interventions. • We determined that the students playing with the more accurate student model deemed the agent to be less helpful, however this was likely due to the hints that the agent was providing. We offered suggestions for changes to the agent's interventions in section 6.2.7. Chapter 7. Conclusions and Future Work 157 7.2 Future Work This thesis has defined some clear goals for future work. Some avenues worth pursuing are listed in the three sections below for the model of student learning, determining learning gains playing Prime Climb, and longer-term projects. 7.2.1 Model of student learning There are several ways that we may be able to improve the model's ability to accurately assess student knowledge. 1. Collect more data to refine the parameters in the conditional probability tables. As described in section 4.4, accuracy on the common factor node may be low because we have simplified the conditional probability tables for this node. We noted that there may be different types of guesses, which we have lumped to-gether in the parameter guess, because we did not have enough data to estimate the parameters in a more complex model. Collection of more data would allow us to refine this parameter. As well, as noted in the last chapter, collecting data to set generic priors for more of the factorization nodes may also improve accuracy. 2. Devise methods for setting priors for the factorization nodes when data is not available. It may not be possible to collect data to set the priors for all of the factorization nodes in the network, however quick heuristics could be used to.set the priors for nodes that we do not have data for, rather than using 0.5. For example, perhaps higher priors could be set for numbers that are divisible by 2 or 5 as these are factors that students often know, or very large numbers should have low priors, etc. We may look for these patterns of student knowledge in either data that we have collected thus far, or by asking a domain expert (6th grade teacher) what is reasonable for students this age. 3. Model the agent's interventions as each time the agent provides a hint about a particular number, the student's knowledge of that number (or 'the common factor concept) may increase if the student reads the hint. Chapter 7. Conclusions and Future Work 158 There are also interesting investigations that can be made about the accuracy of the current student model, for example; 1. Determine how long it takes for the student model to become accurate. Explore methods of assessing student knowledge mid-way through the game using online quizzes to determine how well the student model performs during the game. We have done an initial investigation of this phenomena, however we were limited by the fact that we could only assess accuracy on data points that were constant from pre-test to post-test. 2. Investigate how much evidence the student model requires. Compare accuracy on nodes for which the model had direct evidence versus nodes for which it did not see direct evidence. 7.2.2 Learning gains with Prime Climb The question of whether our more accurate student model can increase learning in students playing Prime Climb remains an open question. As was discussed in the last chapter, the ablation study would need to be repeated, accounting for the two confounding variables; the test which was used to assess learning and the agent hinting style. 7.2.3 Long term projects One long-term project that has been mentioned often in this thesis is the introduction of a decision-theory mechanism for determining when to hint, which takes into account both a model of student learning and a model of student affect. This is the long-term goal of the research group to which the author belongs. In this thesis we have presented a model of student learning for a rather simplistic educational game, Prime Climb. A n open research question is whether this approach can be extended to modeling learners in more complex games. Does the approach scale up? What are the limiting factors? Chapter 7. Conclusions and Future Work 159 7.3 Conclusion In this thesis we have evaluated a model of student learning for an educational game, Prime Climb. We made incremental improvements to this model and assessed its ac-curacy directly using a cross-validation approach and indirectly via an ablation study. We concluded by suggesting improvements to the agent interface for Prime Climb and suggesting directions for future research. 160 B i b l i o g r a p h y [1] S.E. Ainsworth, D.J. Wood, and C . C M a l l e y . There is more than one way to solve a problem: Evaluating a learning environment that supports the development of children's multiplication skills. Learning and Instruction, 8(2):141-157, 1998. [2] V . Aleven and K.R. Koedinger. Limitations of student control: Do students know when they need help? ITS '00: Proceedings of the 5th International Conference on Intelligent Tutoring Systems, pages 277-288, 2000. [3] V . Aleven, K . R . Koedinger, and K . Cross. Tutoring answer explanation fosters learning with understanding. In AIED '05: Proceedings of the 12th International conference on Artificial Intelligence in Education, pages 199-206, 1999. [4] J.R. Anderson, A.T. Corbett, K.R. Koedinger, and R. Pelletier. Cognitive tutors: Lessons learned. The Journal of the Learning Sciences, 4(2):167-207, 1995. [5] I. Arroyo, J .E. Beck, B.P. Woolf, C R . Beal, and K . Schultz. Macroadapting animalwatch to gender and cognitive differences with respect to hint interactiv-ity and symbolism. In ITS '00: Proceedings of the International conference on Intelligent tutoring systems, pages 574-583, 2000. [6] I. Arroyo and B.P. Woolf. Inferring learning and attitudes from a bayesian network of log file data. In AIED '05: Proceedings of the 12th International conference on Artificial Intelligence in Education, 2005. [7] P. Baffes and R. Mooney. Refinement-based student modeling and automated bug library construction. Journal of Artificial Intelligence in Education, 7(1):75-116, 1996. Chapter 7. Conclusions and Future Work 161 [8] R.S. Baker, A . T . Corbett, K . R . Koedinger, and A . Z . Wagner. Off-task behavior in the cognitive tutor classroom: when students "game the system". In CHI '04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 383-390, New York, N Y , USA, 2004. A C M Press. [9] M . Barnett, K . Squire, T. Higgenbotham, and J . Grant. Electromagnetism super-charged! In Proceedings of the International Conference of the Learning Sciences, 2004. [10] C R . Beal, J . Beck, D. Westbrook, M . Atkin, and P.R. Cohen. Intelligent model-ing of the user in interactive entertainment. American Association for Artificial Intelligence Spring Symposium, 2002. [11] C R . Beal, W . L . Johnson, R. Dabrowski, and S. Wu. Individualized feedback and simulation-based practice in the tactical language training system: A n experimen-tal evaluation. In AIED '05: Proceedings of the 12th International conference on Artificial Intelligence in Education, pages 747-749, 2005. [12] J . Beck. Engagement tracing: using response times to model student disengage-ment. In AIED '05: Proceedings of the 12th International conference on Artificial Intelligence in Education, 2005. [13] J . Beck and J . Sison. Using knowledge tracing to measure student reading'pro-ficiencies. In ITS '04: Proceedings of the International conference on Intelligent Tutoring Systems, 2004. [14]"J.E. Beck, I. Arroyo, B.P. Woolf, and C R . Beal. A n ablative evaluation. In AIED '99: Proceedings of the International conference on Artificial Intelligence in Education, pages 611-613, 1999. [15] J .E. Beck, P. Jia, and J . Mostow. Assessing student proficiency in a reading tutor that listens. In UM '03: Proceedings of the International conference on User Modeling, pages 323-327, 2003. Chapter 7. Conclusions and Future Work 162 [16] J .E. Beck, M . Stern, and B . Woolf. Using the student model to control problem difficulty. In UM '97: Proceedings of the 7th International conference on User Modeling, pages 277-288, 1997. [17] B.S. Bloom. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13:4-16, 1984. [18] R.R. Burton and J.S. Brown. A n investigation of computer coaching for informal learning activities. In Intelligent tutoring systems. Academic Press, 1982. [19] R. Chabay and B . Sherwood. The electricity project and the ct programming language. Technical Report 8910, Center for Design of Educational Computing, Carnegie Mellon University, 1989. [20] N . Christoph, J . Sandberg, and B . Wielinga. Added value of task models and metacognitive skills on learning. In AIED '05 Workshop on Educational Games as Intelligent Learning Environments, 2005. [21] Civilization, http://www.civ3.com/. [22] C. Conati. How to evaluate models of user affect? In ADS 0C Tutorial and Workshop on Affective Dialogue Systems, pages 288-300, 2004. [23] C. Conati and M . Klawe. Socially intelligent agents in educational games. Socially Intelligent Agents - Creating Relationships with Computers and Robots. Dauten-hahn K., Bond A., Canamero D, and Edmonds B,. editors., 2002. [24] C. Conati and J .F. Lehman. Efh-soar: Modeling education in highly interac-tive microworlds. Lecture Notes in Artificial Intelligence. Advances in Artificial Intelligence, AI-IA, 1993. [25] C. Conati and H . Maclaren. Data-driven refinement of a probabilistic model of user affect. UM '05: Proceedings of the 10th International conference on User Modeling, 2005. Chapter 7. Conclusions and Future Work 163 [26] C. Conati and K . VanLehn. Toward computer-based support of meta-cognitive skills: A computational framework to coach self-explanation. International Jour-nal of Artificial Intelligence in Education, 11, 2000. [27] C. Conati and K . VanLehn. Using bayesian networks to manage uncertainty in student modeling. Journal of User Modeling and User-Adapted Interaction, 12(4), 2002. [28] C. Conati and X . Zhao. Building and evaluating an intelligent pedagogical agent to improve the effectiveness of an educational game. IUI '04: Proceedings of the International conference on Intelligent User Interfaces, pages 6-13, 2004. [29] A . Corbett, M . McLaughlin, and K . C. Scarpinatto. Modeling student knowledge: Cognitive tutors in high school and college. User Modeling and User-Adapted Interaction, 03:81-108, 2000. [30] BioWare Corporation, http://www.bioware.com/. [31] E . Croteau, N .T . Heffernan, and K . R . Koedinger. Why are algebra word problems difficult? using tutorial log files and the power law of learning to select the best fitting cognitive model. In ITS '04-' Proceedings of the International conference on Intelligent Tutoring Systems, 2004. [32] Honghua Dai and Gang L i . A n improved approach for the discovery of causal models via mml. In PAKDD '02: Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 304-315, London, U K , 2002. Springer-Verlag. [33] T. Dean and K . Kanazawa. A model for reasoning about persistence and causa-tion. Technical report, Brown University, Providence, RI, USA, 1989. [34] M . Druzdzel and L . van der Gaag. Building probabilistic networks: where do the numbers come from? In IJCAI '95 Workshop at the International Joint conference on Artificial Intelligence, 1995. Chapter 7. Conclusions and Future Work 164 [35] T. Fawcet. Roc graphs: Notes and practical considerations for data mining re-searchers. Technical report, Intelligent Enterprise Technologies Laboratory, H P Laboratories Palo Alto, USA, 2003. [36] Elisabeth Goodridge. Java programming game is latest craze. Information Week, November 2001. [37] A . Haldane, G. van Heijst, N . Shalgi, R. de Hoog, and T. de Jong. Is knowl-edge management just a game? the knowledge management interactive training system. Inside Knowledge, 4, Apr i l 2001. [38] E . Horvitz. Principles of mixed-initiative user interfaces. In CHI '99: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 159— 166, New York, N Y , USA, 1999. A C M Press. [39] S. Hudson J . Fogarty, R. Baker. Case studies in the use of roc curve analysis for sensor-based estimates in human computer interaction. In Gl '05: Proceedings of Graphics Interface, pages 129-136, 2005. [40]-W.L. Johnson and C. Beal. Iterative evaluation of a large-scale, intelligent game for language learning. In AIED '05: Proceedings of the 12th International confer-ence on Artificial Intelligence in Education, pages 290-297, 2005. [41] W . L . Johnson, S. Marsella, and H . Vilhjalmsson. The D A R W A R S tactical lan-guage training system. In I/ITSEC '04: Proceedings of the Interservice/Industry Training, Simulation, and Education conference, 2004. [42] M . Just and P. Carpenter. The psychology of reading and language comprehension. A . A . Bacon, 1986. [43] D. Kelly and B . Tangney. Predicting learning characteristics in a multiple in-telligence based tutoring system. In ITS '04: Proceedings of the International conference on Intelligent Tutoring Systems, pages 679-688, 2004. Chapter 7. Conclusions and Future Work 165 [44] M . Klawe. When does the use of computer games and other interactive multimedia software help students learn mathematics? NCTM Standards 2000 Technology Conference, Arlington, VA, 1998. [45] M . Klawe. Computer games, education and interfaces: The e-gems project. In Gl '99: Proceedings of Graphics Interface, 1999. [46] K . R . Koedinger, J.R.Anderson, W . H . Hadley, and M . A . Mark. Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education, 8:30-43, 1997. [47] J . Lawry, R. Upitis, M . Klawe, A . Anderson, K . Inkpen, M . Ndunda, D. Hsu, S. Leroux, and K . Sedighian. Exploring common conceptions about boys and electronic games. Journal of Computers in Math and Science Teaching, 1993. [48] J . Lee, K . Luchini, B . Michael, C. Norris, and E . Soloway. More than just fun and games: assessing the value of educational video games in the classroom. In CHI '04: Proceedings of the conference on Computer-Human Interaction extended abstracts on Human factors in computing systems, pages 1375-1378, New York, N Y , USA, 2004. A C M Press. [49] H . Leemkuil, T. de Jong, R. de Hoog, and N . Christoph. K m quest: A collab-orative internet-based simulation game. Simulation and Gaming, 34(1):89—111, 2003. [50] H . Leemkuil and R. deHoog. Is support really necessary within educational games? In AIED '05 Workshop on Educational Games as Intelligent Learning Environ-ments, 2005. [51] T . W . Malone and M . R . Lepper. Making learning fun: A taxonomy of intrinsic mo-tivations for learning. Aptitude, learning and instruction: Volume III Cognative and affective process analyses. R.E. Snow and M.J. Farr, editors., 1987. Chapter 7. Conclusions and Future Work 166 [52] M . Mayo a n d A . Mitrovic. Optimising its behaviour with bayesian networks and decision, theory. International Journal of Artificial Intelligence in Education, 12:124-153, 2001. [53] A . McFarlane, A . Sparrowhawk, and Y . Heald. Report on the educational use of games; an exploration by teem of the contribution which games can make to the education process. Technical report, Cambridge, 2002. [54] J . McGrenere. Design: Educational multi-player games a literature review. Tech-nical report, University of British Columbia, Vancouver, B C , Canada, Canada, 1996. [55] T . N . Meyer, T . M . Miller, K . Steuck, and M . Kretschmer. A multi -year large-scale field study of a learner controlled intelligent tutoring system. Artificial Intelligence in Education, 50:191-198, 1999. [56] E . Millan, C. Carmona, R. Sanchez, and J . Perex de-la Cruz. M I T O : an educa-tional game for learning S p a n i s h othography. In AIED '05 Workshop on Educa-tional Games as Intelligent Learning Environments, 2005. [57] R . J . Mislevy and D . H . Gitomer. The role of probability-based inference in an intelligent tutoring system. User-Mediated and User-Adapted Interaction, 5:253-282,1996. [58] T. Murray, K . Rath, B . Woolf, D . Marshall, M . Bruno, T. Dragon, K . Kohler, and M . Mattingly. Evaluating inquiry learning through recognition-based tasks. In AIED '05: Proceedings of the 12th International conference on Artificial Intel-ligence in Education, pages 515-522, 2005. [59] A . E . Nicholson, T. Boneh, T . A . Wilkin, K . Stacey, L . Sonenberg, and V . Steinle. A case study in knowledge discovery and elicitation in an intelligent tutoring application. In UAI '01: Proceedings of the 17th conference on Uncertainty in Artificial Intelligence, 2001. Chapter 7. Conclusions and Future Work 167 [60] University of Canterbury. Sql-tutor http://www.cosc.canterbury.ac.nz/tanja.mitrovic/sql-tutor.html. [61] University of Pittsburgh and United States Naval Academy. Andes physics tutor http://www.andes.pitt.edu/. [62] J . Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [63] J . M . Randel, B . A . Morris, C D . Wetzel, and B . V . Whitehill. The effectiveness of games for educational purposes: a review of the research. Simulation and Gaming, 25:261-276, 1992. [64] E . Reiter, R. Robertson, and L . M . Osman. Lessons from a failure: Generating tailored smoking cessation letters. Artificial Intelligence, 144:41-58, 2003. [65] M . D . Roblyer, J . Edwards, and M . A . Havrilik. Integrating educational technology into teaching. Prentice-Hall Inc., Upper Saddle River, N J , USA, 1997. [66] S. J . Russell and P. Norvig. Artificial intelligence: a modern approach. Morgan-Kaufmann, San Mateo, C A , USA, 1995. [67] R. Schafer and T. Weyrath. Assessing temporally variable user properties with dynamic bayesian networks. In UM '97: Proceedings of the 6th International conference on User Modeling, pages 377-388, New York, N Y , USA, 1997. Springer. [68] V . Shute. Smart evaluation: Cognitive diagnosis, mastery learning and reme-diation. In AIED '95: Proceedings of the International conference on Artificial Intelligence in Education, pages 123-130, 1995. [69] V . J . Shute. A comparison of learning environments: A l l that glitters... Computers as Cognitive Tools, S.P.L. and S.J.Derry, editors, pages 47-73, 1993. [70] V . J . Shute and R. Glaser. A large-scale evaluation of an intelligent discovery world: Smithtown. Interactive Learning Environments, 1:51-77, 1990. Chapter 7. Conclusions and Future Work 168 [71] V . J . Shute and J . Psotka. Intelligent tutoring systems: Past, present and fu-ture. In Handbook of Research on Educational Communications and Technology. Scholastic Publications, 1996. [72] K . Squire. Replaying history: Learning World History through playing Civilization III. PhD thesis, University of Indiana, 2004. [73] K . Squire and S. Barab. Replaying history: Engaging urban underserved students in learning world history through computer simulation games. In Proceedings of the International Conference of the Learning Sciences, 2004. [74] K . Stacey, E . Sonenberg, A . Nicholson, T. Boneh, and V . Steinle. A teaching model exploiting cognitive conflict driven by a bayesian network. In UM '03: Proceedings of the International conference on User Modelling, pages 352-362, 2003. [75] J . Tan, C. Beers, R. Gupta, and G . Biswas. Computer games as intelligent learning environments: A river ecosystem adventure. In AIED '05: Proceedings of the 12th International conference on Artificial Intelligence in Education, pages 646-653, 2005. [76] Unreal Technology, http://www.unrealtechnology.com/. [77] Carnegie-Mellon University. Listen http://www-2.cs.cmu.edu/ listen/. [78] K . VanLehn. Student modeling, foundations of intelligent tutoring systems. M. Poison and J. Richardson. Hillsdale, NJ, Lawrence Erlbaum Associates., pages 55-78, 1988. [79] K . VanLehn and Z. Niu. Bayesian student modeling, user interfaces and feed-back: A sensitivity analysis. International Journal of Artificial Intelligence in Education, 12:154-184, 2001. [80] L . Vanlehn, C. Lynch, K . Schulze, J . A . Shapiro, R. Shelby, L. Taylor, D. Treacy, A . Weinstein, and M . Wintersgill. The andes physics tutoring system: Five years Chapter 7. Conclusions and Future Work 169 of evaluations. In AIED '05: Proceedings of the 12th International conference on Artificial Intelligence in Education,, pages 678-685, 2005. [81] C. Webber. From errors to conceptions - an approach to student diagnosis. In ITS '04-' Proceedings of the International conference on Intelligent Tutoring Systems, 2004. [82] S. Weibelzahl and G. Weber. Evaluating the inference mechanism of adaptive learning systems. In Peter Brusilovsky, Albert Corbett, and Fiorella de Rosis, editors, UM '03: Proceedings of the International conference on User Modeling, pages 154-168, Berlin, 2003. Springer. [83] D.J . Wood, J . D . M . Underwood, and P. Avis. Integrated learning systems in the classroom. Computers and Education, 33:91-108, 1999. [84] X . Zhao. Adaptive support for student learning in educational games. Masters Thesis, University of British Columbia, 2002. [85] X . Zhou and Conati C. Inferring user goals from personality and behavior in a causal model of user affect. In IUI '03: Proceedings of the International conference on Intelligent User Interfaces, pages 211-218, 2003. Appendix A T e s t f o r o r i g i n a l m o d e l PRE POS? MM- KM Part A • Example- The factors of 10 are. 2. 5. and(don't .include 1) .1. The factors .of 2 are 2 The-factors of 3 are ,'3:: The factors of 4 are-4. The factors of 11 are 5. The factors cf 15 are . Part B Example: Tilt- prune lactoi.-* vi. 10 aie @ 3 •4 ® 7 none •6. The:primei factors of 36 are: 2 3 4 5 7 none 7 The.prime-factorslof3Q:are:. 2 3 4 5 7 none 3 The prime factors of 42 are: 2 3 4 5. 7 none 9 The prime factors oi 50 are: 2 3 4 5 7 none 10. Tlie prirr.e factcrs of 81 are 2 3; 4 5 7 hone Appendix A. Test for original model 171 PRE- POST. MM HM P a i t C Example: ^ ! j ; The common factors of 12 and 18 are: 2 3 . 4 5 7 none 1 .__^i^__..i___iiiilL_ , . Lr--;r ,. I 11. Thecommon factors of 4-alicl 40 are: 2 3 4 5 7 hone 12. The coihinoiifactors of-15 aiid.li.are: 2 3 4 5 7 none 13. Tlie coimnon factors of 15 and 42 are: 2, 3 4 5 7 :none Part I) 14. Here is tlie factor tree of-105v 105 Vvliatsare;tlie; prime; factors^ of 105? 21 5 ' , 3 15. Here is the factor hee of 50. 50 What are thepmnefactorsvof50? 172 Appendix B r e - t e s t PRE-TEST Student Number Factors - circle all the factors of the number - you may circle more than one answer - if none of the factors are 1) The factors of 15 are 2 3 5 2) The factors of 30 are 2 3 5 3) The factors of 25 are 2 3 5 4) The factors of 42 are 2 3 5 5) The factors of 14 are 2 3 . 5 6) The factors of 49 are 2 3 5 7) The factors of 9 are 2 3 5 8) The factors of 27 are 2 3 5 9) The factors of 11 are 2 3 5 .10) The factors of 33 are 2 3 5 11) The factors of 31 are 2 3 5 12) The factors of 36 are 2 3 5 13) The factors of 81 are 2 . 3 5 14) The factors of 97 are 2 3 5 listed, circle "none" Example: The factors of 50 are 0 3 07 11 none Because 2x25 = 50 5x10 = 50 Example: The factors of 7 are 2 3 5 Q 11 none Because 7x1 = 7 Appendix B. Pre-test 173 PRE-TEST Student Number 15) The factors of 89 are 2 3 5 7 11 none 16) The factors of 88 are 2 3 5 7 11 none Common Factors circle all the common factors of the two numbers you may circle more than one answer if these numbers have no common factors, circle "none" 1) 15 and 30 share _ 2 3 2) 25 and 42 share. 2 3 3) 14 and 49 share _ 2 3 4) 9 and 27 share _ 2 3 5) 11 and 33 share 2 3 _ as common factors. 7 11 none _ as common factors. 7 11 none _ as common factors. 7 11 none _ as common factors. 7 11 none _ as common factors. 7 11 none Example: 12 and 18 share as common factors 5 7 11 none 0 © < Because 2 is a factor of 12 and 18 3 is a factor of 12 and 18 i 174 Appendix C P o s t - t e s t f o r a g e n t c o n d i t i o n s POST-TEST WITH AGENT Student Number Questions - Please circle the answer that best suits you. Strongly Disagree Strongly agree I think the agent Merlin was helpful in the game. 1 2 3 4 5 I think the agent Merlin understands when I need help 1 2 3 4 5 The agent Merlin helped me play the game better. 1 2 3 4 5 The agent Merlin helped me learn number factorization 1 2 3 4 5 The agent Merlin intervened too often 1 2 3 4 5 The agent Merlin did not intervene enough 1 2 3 4 5 I liked the agent Merlin 1 2 3 4 5 Factors - circle all the factors of the number - you may circle more than one answe - if none of the factors are listed, circli 1) The factors of 15 are 2 3 5 7 2) The factors of 30 are 2 3 5 7 3) The factors of 25 are 2 3 5 7 4) The factors of 42 are 2 3 5 7 5) The factors of 14 are 2 3 5 7 6) The factors of 49 are 2 3 5 7 7) The factors of 9 are 2 3 5 7 Example: The factors of 50 are ( D 3 © 7 11 none Because 2x25 = 50 1x10 = 50 Example: The factors of 7 are 0 2 3 5 ( 7 ) 1 1 none Because 7x1 = 7 Appendix C. Post-test for agent conditions 175 POST-TEST WITH AGENT Student Number 8) The factors of 27 are 2 3 5 9) The factors of 11 are 2 3 5 10) The factors of 33 are 2 3 5 11) The factors of 31 are 2 3 5 12) The factors of 36 are 2 3 5 13) The factors of 81 are 2 3 5 14) The factors of 97 are 2 3 5 15) The factors of 89 are 2 3 5 16) The factors of 88 are 2 3 5 Common Factors circle all the common factors of the two numbers - you may circle more than one answer if these numbers have no common factors, circle "none" 1) 15 and 30 share 2 3 5 2) 25 and 42 share 2 3 5 3) 14 and 49 share 2 3 5 . as common factors. 7 11 none . as common factors. 7 11 none . as common factors. 7 11 none 4) 9 and 27 share as common factors. 2 3 5 7 11 none 5) 11 and 33 share as common factors. 2 3 5 7 11 none Example: 12 and 18 share as common factors Q(j£)5 7 11 none Because 2 is a factor of 12 and 18 3 is a factor of 12 and 18 176 Appendix D P o s t - t e s t f o r n o - a g e n t c o n d i t i o n POST-TEST NO AGENT Student Number If you play Prime Climb again, would you rather play: With someone to help • Without other's help Factors - circle all the factors of the number - you may circle more than one answer - if none of the factors are listed, circle "none" 1) The factors of 15 are 2 3 5 7 2) The factors of 30 are 2 3 5 3) The factors of 25 are 2 3 5 4) The factors of 42 are 2 3 5 5) The factors of 14 are 2 3 5 6) The factors of 49 are 2 3 5 7) The factors of 9 are 2 3 5 8) The factors of 27 are 2 3 5 9) The factors of 11 are 2 3 5 10) The factors of 33 are 2 3 5 11) The factors of 31 are 2 3 5 12) The factors of 36 are 2 3 5 Example: The factors of 50 are (2) 3 (5) 7 11 none Because 2x25 = 50 5x10 = 50 Example: The factors of 7 are 2 3 Because 11 none 7x1 = 7 Appendix D. Post-test for no-agent condition 177 POST-TEST NO AGENT Student Number 13) The factors of 81 are 2 3 5 7 11 none 14) The factors of 97 are 2 3 5 7 11 none 15) The factors of 89 are 2 3 5 7 11 none 16) The factors of 88 are 2 3 5 7 11 none Common Factors circle all the common factors of the two numbers you may circle more than one answer if these numbers have no common factors, circle "none" 1) 15 and 30 share as common factors. 2 3 5 7 11 none I Example: 2) 25 and 42 share as common factors. 2 3 5 7 11 none 12 and 18 share as common factors 3) 14 and 49 share 2 3 as common factors. 7 11 none ( D © 5 7 11 none 5 Because 4) 9 and 27 share as common factors. 2 3 5 7 11 no 2 is a factor of 12 and 18 3 is a factor of 12 and 18 ne 5) 11 and 33 share as common factors. 2 3 ' 5 7 11 none Appendix E O b s e r v a t i o n s h e e t f o r n o - a g e n t c o n d i t i o n Student Number: Observation sheet (for No Agent group) 1. Did the student try to look for help during the game? Yes No 2. Did the student use the magnifying glass? No Yes 1 2 3 4 5 3. When did the student use the magnifvine glass? Before choosing a hex after falling down after agent suggestion 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051720/manifest

Comment

Related Items