Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Trade-offs in data representations for learner models in interactive simulations Fratamico, Lauren 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_november_fratamico_lauren.pdf [ 1.91MB ]
JSON: 24-1.0220737.json
JSON-LD: 24-1.0220737-ld.json
RDF/XML (Pretty): 24-1.0220737-rdf.xml
RDF/JSON: 24-1.0220737-rdf.json
Turtle: 24-1.0220737-turtle.txt
N-Triples: 24-1.0220737-rdf-ntriples.txt
Original Record: 24-1.0220737-source.json
Full Text

Full Text

    Trade-offs in data representations for learner models in interactive simulations  by  Lauren Fratamico       A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF SCIENCE  in  The Faculty of Graduate and Postdoctoral Studies  (Computer Science)   THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  October 2015 © Lauren Fratamico, 2015  ii Abstract Interactive simulations can foster student driven, exploratory learning. However, students may not always learn effectively in these unstructured environments. Due to this, it would be advantageous to provide adaptive support to those that are not effectively using the learning environment. To achieve this, it is helpful to build a user-model that can estimate the learner’s trajectories and need for help during interaction. However, this is challenging because it is hard to know a priori which behaviors are conducive to learning. It is particularly challenging in complex Exploratory Learning Environments (like in PhET’s DC Circuit Construction Kit which is used in this work) because of the large variety of ways to interact. To address this problem, we evaluate multiple representations of student interactions with the simulation that capture different amounts of granularity and feature engineering. We then apply the student modeling framework proposed in [1] to mine the student behaviors and classify learners. Our results indicate that the proposed framework is able to extend to a more complex environment in that we are able to successfully classify students and identify behaviors intuitively associated with high and low learning. We also discuss the trade-offs between the differing levels of granularity and feature engineering in the tested interaction representations in terms of their ability to evaluate learning and inform feedback.         [1] Samad Kardan and Cristina Conati. 2011. A Framework for Capturing Distinguishing User Interaction Behaviours in Novel Interfaces. Proceedings of the 4th International Conference on Educational Data Mining, 159–168.    iii Preface This dissertation is an analysis of data collected during the user study described in Section 4 and is a part of the larger PhET research project.  A version of the research described in this thesis has been published as Conati, C., Fratamico, L., Kardan, S., & Roll, I. (2015). Comparing Representations for Learner Models in Interactive Simulations. In Proceedings of AIED 2015, 17th International Conference on Artificial Intelligence in Education, Springer.     iv Table of Contents  Abstract ................................................................................................................... ii	  Preface ................................................................................................................... iii	  Table of Contents ................................................................................................... iv	  List of Tables .......................................................................................................... vi	  List of Figures ....................................................................................................... vii	  Acknowledgements ............................................................................................... viii	  1	   Introduction ...................................................................................................... 1	  1.1	   Thesis Goals and Approach ....................................................................... 2	  1.2	   Contributions ............................................................................................... 3	  1.3	   Outline ........................................................................................................ 3	  2	   Related Work ................................................................................................... 4	  3	   The CCK Simulation ........................................................................................ 7	  4	   User Study ....................................................................................................... 9	  5	   User Modeling Framework ............................................................................. 11	  6	   Representing the User Actions ...................................................................... 13	  6.1	   Structured Representation of Action Events ............................................. 13	  6.2	   Generating Feature Sets for the Student-Modeling Framework ............... 17	  7	   Evaluating Representations for Assessment and Support ............................. 21	  7.1	   Quality of the Clusters .............................................................................. 21	  7.2	   Classification Accuracy ............................................................................. 23	  7.2.1	   Classification Accuracy at End of Interaction ..................................... 23	  7.2.2	   Classification Accuracy Over time ...................................................... 25	  7.3	   Usefulness for Providing Adaptive Support .............................................. 30	    v 8	   Discussion ...................................................................................................... 36	  9	   Conclusion and Future Work ......................................................................... 37	  References ............................................................................................................ 40	        vi List of Tables  Table 1. Summary of all families and the actions that comprise them. .......................... 16	  Table 2. Summary of all feature sets explored. Bolded ones are those analyzed further in this thesis. ................................................................................................................... 18	  Table 3. Summary statistics for the clustering results .................................................... 22	  Table 4. Classifier accuracy measures for different feature sets. Baseline is the accuracy of the most likely classifier. ............................................................................. 24	  Table 5. Confusion matrix for the FOAC_f feature set. Percentages displayed are percent of students in that category out of all students. ................................................. 25	  Table 6. Average accuracies for each classifier and its baseline. ................................. 28	  Table 7. Classifier accuracy over time for each of our three feature sets. Bolded values highlight the time Slices where the classifier is significantly beating the baseline. ........ 28	  Table 8. Number of distinct patterns for each of the feature sets. This table shows the breakdown of patterns coming from high learners and low learners. ............................. 31	  Table 9. Sample patterns for each feature set (raw form and English description) ....... 32	        vii List of Figures  Figure 1. The DC Circuit Construction Kit (CCK) testbed.  The left image shows a voltmeter testing the voltage difference across a light bulb. The right image shows the increased brightness of the light bulb and resulting fire after increasing the voltage of the battery. ....................................................................................................................... 8	  Figure 2. The user modeling framework used in this thesis, highlighting the two main phases of Behavior Discovery and User Classification. The input is user interaction data and the output is the label of the new users interacting with the system. ...................... 11	  Figure 3. The four layers of action-events: Outcome, Family, Action, and Component along with the elements they contain and the frequency of each. .................................. 15	  Figure 4. Over time performance of each of the three feature set’s classifiers. ............ 26	  Figure 5. Over time performance at classifying low learners of each of the three feature sets. ................................................................................................................................ 30	       viii Acknowledgements I owe enormous thanks to my supervisors Cristina Conati and Ido Roll for their guidance in my research analyses and for giving me many opportunities to publish my work. I would also like to thank Samad Kardan for his work in the initial processing of the PhET log files and for his continued advice after I took over the project. This research was supported in part by the Social Sciences and Humanities Research Council of Canada and the MOORE foundation.    1 1 Introduction Interactive simulations are educational tools that can allow students to engage in inquiry-based, exploratory learning by enabling them to experiment with concepts and processes that they may not yet have formally learned. In addition, the learning may otherwise be limited to abstract concepts or real-world environments that require expensive setups to explore [10,11]. Due to the unstructured nature of these exploratory environments, students are not always able to learn effectively [1,16,19]. In these environments, there is often a wide variety of actions to explore and a lack of well-defined correct answers. This makes learning more complex and makes it hard for the students to know if they are learning correctly [25]. For these students that are not effectively using the simulations, it would be beneficial to be able to provide support to guide their interactions or promote more effective behaviors. There is increasing research in Intelligent Tutoring Systems (ITS) to endow these interactive simulations, and other types of Exploratory Learning Environments (ELEs from now on), with the ability to provide student-adaptive support for the students who may not be learning effectively from these open-ended activities [2,6,8,10]. In order to provide this support, it is essential to build a user-model that can: capture the needs of the learners, provide help during interaction when the learner needs it, and provide the proper kind of help. Building models for unstructured environments like ELEs is challenging because it is hard to know a priori what behaviors characterize optimal learning.  Some previous work has dealt with this challenge by limiting the exploratory nature of the interaction [7,33]. In contrast, Kardan and Conati [13] proposed a student modeling framework that did not require the user’s interactions to be restricted. This framework uses the logged actions of the students as they are working with an ELE to learn which behaviors should trigger help during interaction. Clustering is first used to identify groups of students who behave and learn similarly from the interaction. Association rule mining is applied to extract the distinguishing interaction behaviors from each group of clustered students. These behaviors are then used to classify new users of the system, trigger real-time adaptive interventions when needed, and form the basis for the hints that are provided during the adaptive intervention. This student modeling   2 framework was successfully applied to provide adaptive support in the CSP applet, an ELE for a constraint satisfaction algorithm [12]. The part of the CSP applet that was used in [12] involves only a limited number of actions. Thus, in [12], it was sufficient to use the raw student actions to represent their behavior. This simple representation did not scale up when we tried to apply the framework to the more complex simulation evaluated in this work. The PhET DC Circuit Construction Kit (CCK), seen in Figure 1, provides over a hundred types of interactions for users to explore concepts related to electricity. Thus, a richer representation was needed. In [15] Kardan et al. proposed a multi-layer representation of action-events that includes information on individual actions (e.g., join), the components manipulated during those actions (e.g., light bulbs), the relevant family of actions (e.g., revise), and the observed outcome (e.g., changes to light intensity or if a fire was started). We also showed that clustering interaction behaviors based on this representation succeeds in identifying students with different learning outcomes in CCK. 1.1 Thesis Goals and Approach In this thesis, we compare the multi-layered representations proposed in [15]. We also add additional representations that capture varied amounts of feature engineering - all features remain to capture user behaviors at different levels of specificity. The addition of the new representation to our investigation allows for a better understanding of how different amounts of feature engineering contribute to a data representation’s ability to capture learner interactions in a complex environment. We also provided a comprehensive evaluation of these representation as the basis to apply the student modeling framework proposed in [13] to CCK. We define the evaluation in terms of ability to identify learners with high- or low- learning gains, suitability for user modeling to classify new students in terms of their learning performance as they interact with CCK, and for potential to inform quality content of adaptive support to deliver during interaction. We used these evaluation dimensions to compare alternative representations derived from the multi-layer structure in [15] and from our additional representations, which capture different aspects of interaction behaviors at different levels of granularity.   3 1.2 Contributions The main contribution of this work is that we show that the user modeling framework, first introduced in [13], is able to work in a complex ELE such as CCK. As desired, the framework is able to separate students based on their learning, achieve a high classification accuracy (both at end of interaction and over time), and identify behaviors from the association that can be leveraged to design interactive support. This provides evidence that the framework may be able to further generalize across representations. A secondary contribution is the discussion of trade-offs between the representations in terms of the impact of various amounts of feature engineering when constructing the features of the representations and in terms of evaluation dimensions that need to be considered when choosing the most suitable representation for assessing and supporting students during interaction.  1.3 Outline The rest of this thesis is organized as follows. We first discuss related work in Section 2. Then, we describe the CCK simulation (Section 3) and the study used for collecting data (Section 4). Next, we summarize the user modeling approach that we used (Section 5), and present the different representations we evaluated and how the features within them were formed (Section 6). We then describe how we evaluate these representations (Section 7) based on pieces of the user modeling framework and present the evaluation results (Subsections 7.1 through 7.3). We conclude with a general discussion of the trade-offs to explore between our representations, contributions, limitations, and future work (Sections 8 and 9).     4 2 Related Work Many have shown the positive benefits of providing adaptive support in learning environments. Stamper et al. [31] compared the usage of two logic tutors – one with adaptive hints and one without. They found that the students using the adaptive tutor completed significantly more logic problems than those in the non-adaptive condition. Others also evaluated adaptive vs non-adaptive systems (Najar et al. [21] in the context of a SQL tutor and Salden et al. [28] in the context of a physics ELE) and similarly found that learners in the adaptive condition learned significantly more than their peers. As previously mentioned, Kardan and Conati [12] have provided evidence towards the effectiveness of adaptive support in an ELE. The framework they used to provide the content for the adaptive support and to determine when to intervene is the same framework that was used in the research in this thesis. Based on these findings, we believe that our long-term goal of providing adaptive support to learners as they are working with the simulation will indeed be beneficial to them.  This thesis builds on our paper that was accepted to the 2015 AIED conference [4]. In that paper, we build a student model that used clustering to form groups of learners, mined the rules that define each of those groups (based on a method described in [15]), and used the rules to classify new users of the simulation. The behavior discovery, student modeling, and adaptive feedback portions of this approach each have challenges associated with them that have been solved in different ways by other researchers. Clustering, on the other hand, is a common approach in the field of Educational Data Mining to discover groups of similar users. For example, Perera et al. [24] applied this strategy to find teams that had strong or weak collaboration skills while working in a collaborative programming environment. Shih et al. [30] used clustering to discover different student learning tactics in a geometry tutor. Romero et al. [6] applied clustering and pattern mining to discover how students use a web-based course. Others have used clustering to fit parameter settings for models in problem solving activities [8,23,32]. This is done by using clustering to group users, then fitting a model to each group that assess student knowledge. Our work applied a similar strategy (a method known as k-means clustering) to group the different types of learners.   5 Two common strategies used to model users in more open-ended exploratory interactions, like the one used in this work, are to rely on: 1) expert knowledge to identify behaviors or 2) data mining to identify suitable feedback strategies. EXpresser, a simulation environment for learning algebra, provides feedback based on a set of strategies defined by experts [22]. In contrast, our work aims to learn the feedback strategies from data, as was done in [12]. A data-driven approach was also used in [17] to provide scaffolding to students using Betty’s Brain, an environment designed to foster self-regulated learning by allowing students to teach the relevant aspects of this meta-cognitive ability to an artificial student. One difference with our approach is that [17] relies on knowing a priori which students learned or not from Betty’s Brain to group the students and mine the relevant feedback strategies of those groups. In contrast, our work groups learners based only on interaction data. The work in [29] relies on manually hand-coding log files to train the student models. In contrast, the approach successfully evaluated in [8], and adopted in this research, groups learners via clustering on their interaction behaviors alone (with little processing), without using additional information (like test scores). Eagle et al. [5] also start with log interaction data with the goal of understanding how students learned in an interactive environment. They defined Approach Maps to mine the different behaviors of two groups of students as they explore possible solution spaces in an ELE aimed at teaching propositional logic concepts. This approach relies on building a graph of the paths students took to solving the logic problem. A solution such as this is infeasible for our interaction data because of the large number of interactions available and configurations possible in the CCK simulation.  Many of the challenges related to providing adaptive feedback in interactive simulations relate to determining how to help a student after it has been identified that help is needed. One common approach is to limit the exploratory nature of the interaction. This is often done in conjunction with providing hints that follow a procedural fashion. For instance, the simulations developed by Hussain et al. [9] provide feedback on how to behave in pre-defined cultural and language-related scenarios with a clear definition of correct answers and behaviors. The Chemistry VLab [2], on the other hand, is an ELE that does not limit the interaction by allowing for exploration of chemical   6 reactions. It, however, only provides help on well-defined steps required to run a scientific experiment. Another type of support given to students in ELEs are Cognitive Tools. Cognitive Tools, such as hypothesis builders, aid students by helping them structure their learning process. They both limit (and pre-label) the type of interaction and are easier to interpret and evaluate, reducing the solution space. An example is Science ASSISTments [7] which scaffolds the inquiry process using cognitive tools so it becomes more linearized and tractable. This may hinder the learning by limiting the exploratory nature of the simulation, so solutions like these are not optimal. In addition, PhET Simulations attempt to stay closer to an authentic inquiry environment, and thus they prefer not to limit the exploratory nature of the environment. In this work we aim to support the students without limiting their ability to explore freely in the learning environment, seeking to have a balance between guidance and exploration.    7 3 The CCK Simulation The CCK simulation is part of PhET [34], a freely-available and widely-used suite of simulations in different science and math topics. Simulations in the PhET family are being used over 45M times a year, and CCK is its most popular simulation. CCK includes 124 different types of interactions that allow learners to build and test DC circuits by connecting different components including wires, light bulbs, resistors, batteries, and measurement instruments (see components in left image of Figure 1). The available interactions include adding, moving, joining, splitting, and removing components, as well as changing the attributes of components (such as voltage and resistance). Additional interactions relate to the interface (such as enabling different components) or the simulation itself (such as resetting the simulation). CCK provides animated responses with regard to the state of the circuits on the testbed. For example, when students add light bulbs to their circuit, as can be seen in Figure 1, changes in current through the light bulb affect the amount of illumination. In addition, the magnitude of the current is visualized by the speed of the “bubbles” of electrons (shown by the blue dots inside the wires). A faster speed implies that more current is flowing through that wire. If the amount of current through a part of the circuit ever gets too high, the speed of the electron bubbles will become very fast and a fire will start. Notably, CCK is a tool, not an activity. That is, instructors can use CCK with a variety of activities; most of them are given on paper outside the simulated environment. Our long-term goal is to help all students make optimal use of CCK by providing adaptive interventions when we detect a sub optimal behavior. To achieve that, the system should be able to assess the effectiveness of students’ behaviors and provide explicit support to foster learning. While we do not specifically know that learners using CCK are performing sub optimally, we do know that some learners were able to achieve a larger learning gain after working with the simulation than others. This may imply that those learning more are using the simulation in better ways. The system should be able to do three things: 1) determine the behaviors in the simulation that are conducive to learning, 2) classify new students working with the simulation, so that we can target those that are performing sub optimally, and 3) adaptively make suggestions of behaviors that are more conducive to learning so that we can help those performing sub   8 optimally as soon as they are identified. While eventually support may be productive across activities and simulations, we first address just one typical activity in the CCK simulation: understanding how resistors work.       Figure 1. The DC Circuit Construction Kit (CCK) testbed.  The left image shows a voltmeter testing the voltage difference across a light bulb. The right image shows the increased brightness of the light bulb and resulting fire after increasing the voltage of the battery.     9 4 User Study Data used in this research were collected from 100 first-year physics students who participated in a laboratory user study described in references 14 and 25. The goal of this study was to understand how students were using PhET simulations - both in terms of the mindset they approach the simulation with (e.g., are they using it to build and test hypothesis or just to memorize information about circuits) and the individual actions that they are carrying out as they use the simulation. Students were given two activities that involved use of the simulation, a pre- and post- test to assess learning gains, and a survey that included background information about previous physics courses taken, self-reported strategies about how they typically use simulations like this, and attitudinal factors about how successful they feel using these types of simulations and why [26]. The survey was used to assess the students’ mindsets that they are approaching this simulation with. Log interaction data that was collected while the students completed two 25-minute activities were used to assess the individual actions taken while using the simulation. The first activity, on the topic of light bulbs, had two different conditions of external scaffolding whereby half of the students were assigned to a high scaffolding condition and the other half to a low scaffolding condition. Those in the low scaffolding condition received only two high level pieces of guidance: i) the general learning goal “investigate how light bulbs affect the behaviors of circuits” and ii) a general recommendation to explore several light bulbs on the same loop, on different loops, and a combination of the two. Students in the high scaffolding condition received the same learning goal and recommendation, however, in addition, were also given i) diagrams instructing them which circuits to build; ii) tables asking them to document the parameters of the different circuits; and iii) guiding questions asking them to reflect and contrast the different circuits. The low scaffolding condition was modeled after the recommended activities for this environment by the PhET project team - activities that define specific learning goals, but give only minimal guidance and instructions on how to make use of the simulation. The second activity given to students in the study, on the topic of resistors, was identical for all learners, and everyone received low scaffolding that included a learning goal, to “investigate how resistors affect the behaviors of circuits”, and three guiding recommendations: to investigate what happens to the   10 current and voltage when resistors with different resistances are used, to investigate circuits that include multiple resistors with different resistances with a variety of arrangements, and to explore the properties of different combinations of resistors with the same resistance. The students were expected to use CCK to help them explore the learning goal and these guiding recommendations. In this thesis, we focus only on data from the second activity because it allows us to observe how all students in our study use the CCK tool when they are not guided by strict instructions or scaffolding, that is, in a more exploratory and self-guided manner. In addition, as this was their second activity with the simulation, all students were familiar with the CCK functionalities. Students were assessed on their conceptual knowledge of circuits before and after the activities, with the pre-test being a subset of the post-test. Notably, students who were in the high scaffolding condition in the first activity may have been primed in the second activity by seeing diagrams and tables that other half of the students did not see in the first activity. This allows us to test our model in an environment with a variety of prior experiences and exposures, as often is the case in educational settings. Indeed, our analysis was still able to pick out patterns of student behaviors related to learning gains and not dependent on condition in activity 1. Students with perfect pretest scores were not included in our dataset. This removed 3 students. All other students scored below 90% on the pretest. One additional student was also removed due to logging errors as his computer was malfunctioning. This left us with 96 students.      11 5 User Modeling Framework As mentioned in the introduction, we aim to determine if the user modeling framework for ELE first proposed in [13] can be extended to a more complex environment. The framework consists of two main phases: Behavior Discovery and User Classification (Figure 2). This section provides a high level description of the phases. A more complete description can be found in [13].  Figure 2. The user modeling framework used in this thesis, highlighting the two main phases of Behavior Discovery and User Classification. The input is user interaction data and the output is the label of the new users interacting with the system.   In Behavior Discovery, each user’s interaction data are first pre-processed into feature vectors as described in the last section. Students are then clustered using these vectors so that we can identify users with similar interaction behaviors. k-means clustering was used for this, allowing the clustering algorithm to pick the k that generated the optimal number of clusters. This was done by choosing the k that gave the lowest value of 3 measures of clustering validity: C-index, Calinski and Harabasz [20], and Silhouettes [27]. Because we want to target students who are underperforming, we need to find clusters that identify groups of students with   12 statistically different learning outcomes. A one-way ANCOVA is conducted to determine if there was a statistically significant difference between clusters on post-test score controlling for pre-test. For the representations that had significant learning differences between clusters, we then identify the distinctive interaction behaviors in each cluster via association rule mining. This process extracts the common behavior patterns in terms of class association rules in the form of X à c, where X is a set of feature-value pairs and c is the predicted class label for the data points where X applies. For example, a pattern for low learners could be that they test with the voltmeter with frequency less than 0.075, meaning that, of all the actions they have done, less than 7.5% of them have been testing with the voltmeter. During the association rule mining process, the values of features are discretized into bins [13]. In the case of the last example, the feature is divided into two bins, labeled as lowest and highest, where the lowest bin (0 to .075) is assigned to low learners and highest bin (.075 to 1) is assigned to high learners.  In User Classification, the labeled clusters and the corresponding association rules extracted in Behavior Discovery are used to train a classifier student model. As new users interact with the system, they would be classified by this rule-based classifier in real-time into one of the identified clusters, based on a membership score that summarizes how well the user’s behaviors match the association rules for each cluster. If they are showing more behaviors associated with one group than another, then the former group is the one they will be assigned to. Thus, in addition to classifying students in terms of learning, this phase returns the specific association rules describing the learner’s behaviors that caused the classification. These behaviors can then be used to trigger real-time interventions designed to encourage productive behaviors and discourage detrimental ones, as described in reference 11.     13 6 Representing the User Actions 6.1 Structured Representation of Action Events As illustrated in Figure 2 and elaborated on in the previous section, clustering students based on their actions is the first step of our user-modeling framework. This step requires a representation that captures important aspects of these actions. CCK logs three pieces of interaction information: the type of action, the component used, and the response of the physical model. However, within CCK, outcomes of actions depend on their context. For example, connecting a wire leads to different outcomes based on the state of the circuit (e.g., current will flow through the wire if the wire is part of a live circuit, but won’t flow otherwise). In addition, actions with one component often affect other components. For instance, changes to batteries can affect both existing light bulbs and the readings on attached measuring equipment. Last, interpretation of actions depends on their context. For example, a user who connects one probe of a voltmeter, a testing instrument that measures voltage, is actively testing only if the other probe is also connected.  As described in [15], we created a structured representation that can capture these action-events, i.e., user actions and their relevant contextual information, at different levels of granularity. This representation contains four layers shown in Figure 3 along with the elements they contain and their frequency of occurrence in our dataset. The differing levels of granularity are achieved by using multiple combinations of these four layers as each layer captures a different level of information about the interaction with the simulation. The “Actions” layer describes the action that students took, (e.g., add) and includes 25 different actions. “Components” describes the component that can be manipulated by the actions taken (e.g., wire) and includes 22 possible components. “Outcomes” capture what happens in the circuit after an action is performed. There are 6 types of outcomes, including: None, Deliberate-measure (the value displayed on a measurement device is updated as a result of using it), Current-change (a change in the current occurred, reflected in the speed of movement of electrons), and Light-intensity-change (the brightness of a light bulb changes). It should be noted that an action-event may cause more than one outcome. Also, outcomes had to be associated with the   14 relevant action-event during post-processing, as they are logged as independent system-events. For instance, we had to associate a “light intensity change” outcome with the user action that occurred just before it, possibly the joining of a wire. Our last layer, “Family”, denotes one of 8 general types of action. While Actions, Components, and Outcomes are logged directly by the simulation, the Family layer was defined by the researchers on this project. Its purpose was to abstract the action and component layers. For example, one of the families is build. Build captures all building actions done before the circuit is live (e.g., adding wires, joining light bulbs, removing resistors). The family layer was defined via extensive conversation with the researchers on this project about which events are important to abstract to this level, and we hoped that adding the family layer would allow us to capture if the specific actions used mattered or if just the type of action being performed was enough to distinguish the learners. Other families include: Test (describes active measurements of the circuit using the measurement instruments), Organize (describes actions that re-arrange circuit components without making any structural changes, and thus have no outcomes), and Revise (describes all build actions that take place on an already live circuit, and before the user resets the simulation). The actions making up each family can be seen in Table 1. The same action can belong to a different family depending on the context of the action. For example, “join” could be in the Build family if a student is first building his circuit and in the Revise family if he is modifying an existing circuit. As an additional family of actions, we captured “Pauses”. We chose to add the Pause family to abstract the time information – perhaps it is not important that certain action-events took certain amounts of time, but instead that, as a whole, students are spending different amounts of time pausing to plan or reflect on their actions. Pause was defined as inactivity for longer than 15 seconds. We chose this threshold as, when plotting frequency of pause-lengths, 15 seconds marked the beginning of the long tail. That is, around this value, the rate of pauses is fairly insensitive to the specific parameter. This structured representation adds contextual information to the data. For example the action-event current_change.revise.join.wire describes joining (Action) a wire (Component) that led to a current-change (Outcome) when revising a circuit (Family). If a light bulb was also connected properly to the circuit, the action-event   15 light_intensity.revise.join.wire would also occur at the same time, describing a second outcome of joining the wire. Figure 3 shows the frequency of each element in the action-event representation.    Figure 3. The four layers of action-events: Outcome, Family, Action, and Component along with the elements they contain and the frequency of each.  While in reference 14 all 4 layers of the structure were used to represent actions-events, subsets of the layers can represent events at different levels of granularity. The different combinations of layers also give rise to feature sets with different amounts of feature engineering. As mentioned, the Family layer was defined via extensive discussion among the authors, thus being a completely engineered level. The Action and Component layers, on the other hand, require minimal feature engineering, as they are present in the logs. Changes that were applied to these layers included the addition of two actions, joinX and traceMeasure, to describe creating 3-way junctions and tracing measurements along wires, respectively. When joining 2 or more components, “join”   16 was logged as the action. We wanted to be able to distinguish the joining of just 2 components to create a single loop and the joining of more than 2 components to create more than one loop, and thus, more complex circuits. We introduced joinX to serve this purpose, and called all join actions that involved more than 2 components joinX. traceMeasure was introduced to allow us to distinguish different types of testing. Students either used their testing instruments to examine one location on the circuit at a time or moved the testing instrument over many parts of the circuit in quick succession (“tracing” over many components). The latter action was defined as traceMeasure. The outcome layer required more modifications, such as adding the additional outcomes reading updated, deliberate measure, and light intensity. The only outcomes logged in the log files were current change, fire started, and if there was a change in the reading on a testing instrument. We defined light intensity as also occurring when there was a current change and a live light bulb in the circuit. After the two activities were completed in the lab study, we had a few students walkthrough the actions they took in the simulation and their though process behind them. Many alluded to observing the light bulbs to gauge changes to their circuits. Because of this, we felt it important to capture this outcome. An additional modification we made was to create a distinction between testing that occurred when a testing instrument was actively being used and testing when another component was actively being used but a testing instrument was connected to the circuit. We defined deliberate measure (actively using testing instrument) and reading updated (passively using testing instrument) to distinguish these two types of testing. We also associated outcomes with the actions that caused them, as described above.  Family Build Revise Extra Organize Test Interface Reset Pause Abstracted Actions add changeResistance changeVoltage join joinX reverse sliderEndDrag switch add changeResistance changeVoltage join joinX remove reverse sliderEndDrag split switch add changeResistance join joinX moreVoltsOption organizeWorkspace reverse sliderEndDrag organizeWorkspace endMeasure playPause startMeasure traceMeasure deiconified disableComponent enableComponent exitSim help iconified view reset pause Table 1. Summary of all families and the actions that comprise them.     17 6.2 Generating Feature Sets for the Student-Modeling Framework Each representation at the different levels of granularity can be used to generate different feature sets based on the types of information used to summarize the action-events for each user. These measures include frequency of the action-event, i.e., the proportion of each type of action-event over total action-events, as well as timing information, and specifically, mean and standard deviation of the time spent before each action-event. We chose to use the time between the last action and the present action as the time for an action. In this way, we captured the time it took for the student to plan and carry out each action. When students took pauses, we kept their average time for that action, and treated the rest of the inactive period as a pause. For example, if a student took 21 seconds to attach a certain wire (longer than 15 seconds, hence include a pause), and their mean time before attaching wires is 2 seconds, then we relabeled 19 seconds as “pause”, and kept 2 seconds (their mean) for attaching the wire.  In [15], we described the performance of a feature set built on all 4 layers in Figure 3 and on all three summative measures when used to cluster students who learn similarly with CCK. Here, we generated feature sets that use different subsets of layers in the action-event structure, in order to investigate the effect of representation granularity on both generating meaningful clusters, as well as on building effective user models and informing feedback that can improve student learning from CCK, as in [12]. For each representation, we also experimented with using only frequencies vs. adding time-related summative measures, for a total of 22 different feature sets. All feature sets investigated can be seen in  . We also generated feature sets that do not have any feature engineering and use only information present in the log files. This feature set was similar to the one that only included the Outcome, Action, and Component layers in that it also did not include the Family layer, but the feature engineered actions and outcomes were also removed so that the action-events present are only the ones that exist in the log files. In addition, this feature set no longer includes the “pause” actions as the authors defined the length of those. For this feature set, we also experimented with using only frequencies vs. adding time-related summative measures. This gave us an additional 2 feature sets.   18  Event Type Statistical Measures Engineered Number of Features Action Frequency Minimally 25 Action Frequency, Mean Time, Std. Dev. Time Minimally 75 Family Frequency Yes 8 Family Frequency, Mean Time, Std. Dev. Time Yes 24 Outcome Frequency Minimally 6 Outcome Frequency, Mean Time, Std. Dev. Time Minimally 18 Action, Component Frequency Minimally 76 Action, Component Frequency, Mean Time, Std. Dev. Time Minimally 228 Family, Component Frequency Yes 44 Family, Component Frequency, Mean Time, Std. Dev. Time Yes 132 Outcome, Family Frequency Yes 21 Outcome, Family Frequency, Mean Time, Std. Dev. Time Yes 63 Family, Outcome, Component Frequency Yes 99 Family, Outcome, Component Frequency, Mean Time, Std. Dev. Time Yes 297 Family, Action, Component Frequency Yes 99 Family, Action, Component Frequency, Mean Time, Std. Dev. Time Yes 297 Outcome, Action, Component Frequency Minimally 207 Outcome, Action, Component Frequency, Mean Time, Std. Dev. Time Minimally 621 Outcome, Family, Action Frequency Yes 102 Outcome, Family, Action Frequency, Mean Time, Std. Dev. Time Yes 306 Outcome, Family, Action, Component Frequency Yes 211 Outcome, Family, Action, Component Frequency, Mean Time, Std. Dev. Time Yes 633 Outcome, Action, Component Frequency No 203 Outcome, Action, Component Frequency, Mean Time, Std. Dev. Time No 609 Family Blocks Frequency, Mean Duration, Mean Number of Actions, Outcome Qualifiers Yes 60 Table 2. Summary of all feature sets explored. Bolded ones are those analyzed further in this thesis.  Finally, we tested another representation, Blocks, that goes beyond actions as units of operation by grouping consecutive actions of the same type (similar to families) into entities called a block. We have 6 different types of blocks, including Test (all actions related to using measurement devices), Construct (any action that changes the circuit before testing), Modify (any action that changes the circuit after testing it), and Reset (removing the whole circuit). Initially we tried clustering solely on basic summative features for each block, namely, frequency, average duration, and average number of actions within, and no significant results were found. We thus added   19 qualifiers to the blocks to add a finer grained layer. These included specific features about each outcome within the block (for instance, frequency of light-intensity-changes within a construct block). In this way, blocks represents information about the Family and Outcome layers of the action-events. We chose these two layers so that blocks would remain a less granular representation by abstracting action and component again, but also adding sequential information that was not present in the other data sets. As previously mentioned, the first step in the user modeling framework is to see if we are able to identify groups that differ in terms of the amount they learned while working with the simulation. For each of the 25 feature sets described above (and listed in  ), we first clustered them using k-means clustering. We then looked at if there was a statistically significant difference in learning gain between the clusters. Only the following 4 feature sets generated clusters of students with statistically different learning gains (more on this process can be found in the following section). Feature sets with only frequency as a summative measure are denoted by _f while feature sets that include frequency and time information are denoted by _fms. 1) FOAC_f: Set including all action-events elements (Family, Outcome, Action, Component) with frequency information (211 features) 2) FAC_f: Same as the first feature set, but without the Outcome layer (99 features) 3) OAC_f: Same as the first feature set, but without the Family layer (207 features) 4) Raw_OAC_fms. Same as the third feature set, but with less feature engineering, as described above, and with time-related summative measures (609 features: 203 action-events x 3 types of summative measures) Interestingly, the three feature sets that have some amount of feature engineering include only information on action frequency, indicating that summative statistics capturing how much time student spend before actions are not contributing to identify different learning outcomes. This can be explained by the fact that we capture significant inactivity before actions via pauses. The Raw_OAC_fms feature set, on the other hand, does not have pause information, and this seems to need the timing information to form meaningful clusters.    20 In addition, we had defined the family layer to abstract the action and component layers. However, we were not able to find a significant feature set that included the family layer without the action and component layers. This suggests that, in this case, the specific actions taken by the users provide us with more distinguishing information that the abstracted family layer does – it is the individual actions that make a difference, not the general type of behavior.     21 7 Evaluating Representations for Assessment and Support Based on the user modeling framework mentioned in the previous section, the three measures we use to evaluate the four feature sets described in Section 6 are: (i) Quality of the generated clusters, measured by effect size of difference in learning performance between students in the different clusters. That is, how informative are the resulting clusters with regard to learning gains? (ii) Classification accuracy of user models trained on the obtained clusters, evaluated both at the end of interaction and over time throughout interaction. That is, how accurate can the model assign students to their respective cluster? (iii) Usefulness of the generated association rules in identifying behavior patterns that can be used to design and trigger support to students. That is, do the clusters provide sufficient information for meaningful feedback? In this section, we will evaluate our significant classifiers on these three metrics. 7.1 Quality of the Clusters Table 3 shows the outcome of clustering on the four feature sets. Each cluster has one row per cluster, as defined by the optimal number of clusters identified. Each cluster is labeled based on the learning performance of the users in that cluster. This learning performance is computed as the mean posttest score for the users in the cluster, corrected for pretest score, and these are reported in the fourth and fifth columns in Table 3, respectively. The third column reports the cluster size. Clusters that contained only one member (singletons) resulted in that member being removed as an outlier and the clustering algorithm rerun. This impacted only the Raw_OAC_fms feature set, reducing the number of students from 96 to 92). The last two columns report the p-value and effect size of the difference in learning performance among the clusters found for each feature set, obtained via an ANCOVA on the post-test scores, controlling for pre-test. Thus, a larger effect-size suggests a representation that better separates students with different learning levels.      22 Feature Set Cluster Label #Members Average Post-test Score Average Pre-test Score p-value Effect Size (partial eta squared) FOAC_f  High 61 .609 .465 .013 .065 Low 35 .511 .470 FAC_f  High 67 .596 .455 .048 .041 Low 29 .534 .494 OAC_f  High 66 .609 .463 .007 .076 Low 30 .509 .475 Raw_OAC_fms Very High 3 .840 .513 .003 .122 High 67 .595 .475 Low 22 .489 .445 Table 3. Summary statistics for the clustering results  All feature sets except for Raw_OAC_fms generated two clusters, identifying groups of students with high vs. low learning. Raw_OAC_fms generated 3 clusters. Post-hoc pairwise comparison showed that all three clusters had statistically different learning performances1 from each of the other two.  Effect sizes of the difference in learning performance varied for different feature sets. Raw_OAC_fms achieves the best score with a medium effect size [3], due to these three clusters having better discrimination power to capture learning differences. Of the feature sets with 2 clusters, effect size ranges from small effect size (for FAC_f) to medium-small effect (for OAC_f and FOAC_f). FAC_f performs worse than the other three, signifying that outcome may be an important separator between high and low learners as FAC_f is the only feature set without Outcome. Interestingly, OAC_f achieves the highest effect size, showing that the addition of slightly more feature-engineered information (the Family) reduced the ability to detect differences in learning between the two clusters using clustering.   It should be noted that one of the clusters in Raw_OAC_fms, the one corresponding to a very high level of learning, included only 3 students. Thus, we will only focus on the two larger clusters in Raw_OAC_fms for the rest of our analysis.                                              1 Statistical significance is based on p-level < 0.05 throughout the paper unless otherwise specified.    23 7.2 Classification Accuracy Classification accuracy was evaluated in two ways. First, we looked at just the accuracy at the end of interaction, after all interaction data had been seen for a new user. Since it is critical to be able to help the users as they are working with the simulation and not after they have finished interacting, we did this only to weed out any classifiers that did not even perform well after seeing all the data. For the ones that did, we evaluated classification accuracy over time. This allowed us to see at which point during the interaction the classifier could reliably start to classify students. This would be the time at which an adaptive environment could begin to provide hints to the user. We’ll first look at classification accuracy at end of interaction, then classification accuracy over time.  7.2.1 Classification Accuracy at End of Interaction For each of the four feature sets, a classifier user model is trained on the generated clusters, using 8-fold nested cross validation to set the model’s parameters and find its cross-validated accuracy at the end of the interaction (when the classifier is trained on the complete interaction logs). Table 3 reports the classification performance of each classifier in terms of overall accuracy, class accuracy for high and low learners, and kappa scores. The table also reports, for each classifier, the accuracy of a baseline majority classifier that predicts the accuracy of assigning all students to the largest cluster found with the corresponding feature set in the previous evaluation phase (see Table 3). As Table 3 shows, the accuracy of the majority classifier for each feature set is different. We use Kappa scores as one of our performance measures because they account for agreement by chance, thus providing a fair comparison among classifiers with different baselines. At the same time, Kappa scores are hard to interpret in terms of a model’s practical effectiveness. Thus, we also discuss classifiers performance in terms of whether their overall accuracy is statistically better than each corresponding baseline. In order to do so, a one-sample t-test was conducted to compare classifier accuracy and baseline accuracy for each feature set (4 total). The overall accuracies for three of the action-event sets, FOAC_f, FAC_f, and OAC_f are significantly above their respective baselines, with kappa scores values ranging from 0.564 to 0.702, indicating   24 that our user-modeling framework can effectively classify students working with CCK with three of the feature-sets. The Raw_OAC_fms feature set, on the other hand, does not significantly beat its baseline, has the lowest overall accuracy, the lowest kappa-score, and has extremely imbalanced classes. It is also worse than the OAC_f feature set on all measures, signifying that even some minimal feature engineering is helpful for more accurately classifying students. Because Raw_OAC_fms does not beat its baseline after all data have been observed, we exclude it from all remaining evaluation, as it cannot be used to classify students better than chance.  Feature Set Baseline % Overall Accuracy % (Std. dev.) High Learner  Class Accuracy % Low Learner Class Accuracy % Kappa FOAC_f 65.3 86.5 (8.8) 91.8 77.1 .702 FAC_f 69.8 83.3 (5.9) 85.1 72.4 .564 OAC_f 68.8 84.4 (9.4) 90.9 70.0 .626 Raw_OAC_fms 75.3 79.8 (10.6) 88.1 54.5 .439 Table 4. Classifier accuracy measures for different feature sets. Baseline is the accuracy of the most likely classifier.   The feature set based on the most detailed representation, FOAC_f, is superior to the other 3 sets. In particular, although its overall accuracy, at 86.5% is higher but comparable to FAC_f and OAC_f, its kappa-score reaches 0.7 compared to the second best value of 0.626 for OAC_f, and it is better than OAC_f at classifying low learners (77% vs 70% accuracy for this class), with on-par performance for high learners.2 This indicates that the additional level of representation added by the Family level is beneficial for classifier accuracy when all information (Action, Outcome, Component) is leveraged. Also, of the three feature-engineered feature sets, the two that include Outcome show higher accuracy compared with FAC_f, suggesting that the outcome of                                             2 We don't perform a formal statistical analysis in this section comparing the accuracies of our classifiers against each other because that will done as part of the subsequent analysis of accuracies over the whole course of the interaction.    25 students’ actions, rather than the actions themselves, are most beneficial to identify low vs. high learners. A high classification accuracy is critical to providing effective adaptive support to students. Ideally we would only give interventions to low learning students, so as to be giving help to students who need it without interrupting students who don’t. In general, all our classifiers have better accuracy at classifying high learners than low learners, mostly due to the fact that there are more high learners in the training data. Looking at the false positive and negative rates, in the case of the FOAC_f classifier for instance (Table 5), only 10.4% of all students were clustered as high learners, but classified as low learners. These are the students that will be receiving interventions when they do not need them, potentially being distracted by this. On the other hand, only 13.5% of all students will be low learners but not classified as such. Thus, the system will miss the chance to help these students when they need it. Still, even for low learners, the percentage of misclassified students is small, indicating that our approach has good potential for improving the pedagogical effectiveness of the CCK simulation overall. Table 5 shows the confusion matrix for the FOAC_f classifier. The confusion matrices of the other two classifiers that significantly beat baseline are similar.   FOAC_f  Classified Class HL LL Clustered Class HL 55 (57.3%) 10 (10.4%) LL 13 (13.5%) 18 (18.8%) Table 5. Confusion matrix for the FOAC_f feature set. Percentages displayed are percent of students in that category out of all students. 7.2.2 Classification Accuracy Over time The accuracies reported in the previous section relate to the performance of the classifier after seeing all the data for one student at the end of the interaction. While this information could still be leveraged for providing the student with adaptive summative feedback, or for personalized instructions the next time a student uses the system, we also want to know whether our classifiers can be used for providing adaptive interventions during a specific session of CCK. Thus, for each of the three feature sets that beat the baseline at the end of interaction (which excludes Raw_OAC_fms) we   26 calculated their accuracy over time. Namely, for each feature set we trained and evaluated different classifiers on incremental portions of interaction data (time slices), simulating what happens when a new user gets classified in real-time while using CCK. Since the students worked with CCK for about 25 minutes in our study, we calculated the data points at cumulative 10% time slices (each slice including ~2.5 minutes more interaction data than the time slice before it). For example, the 50% time slice includes all actions taken up to half of the way through the interaction (~15 minutes). We used percent of interaction observed instead of minutes of interaction for ease of presentation, given that each student worked with the PhET simulation for different amounts of time (average length = 24.7 minutes, standard deviation = 4.3 minutes). We aim to answer the following questions with the analysis of the computed over time accuracies that we present in the rest of this section: 1) Which classifiers significantly outperform their corresponding baselines in terms of accuracy over time? 2) At which specific time slices do the classifiers begin to significantly outperform the baseline? 3) Which classifier has the best over time performance?  The over time accuracy of each of the FOAC_f, OAC_f and FAC_f feature sets are shown in Figure 4.  Figure 4. Over time performance of each of the three feature set’s classifiers. 50	  55	  60	  65	  70	  75	  80	  85	  90	  10	   20	   30	   40	   50	   60	   70	   80	   90	   100	  Accuracy	  %	  Percent	  of	  Interac0on	  Observed	  FOAC_f	  FAC_f	  OAC_f	    27 To assess which of the three classifiers outperforms its baseline (not reported in Figure 4), we first ran a 2-way repeated measures ANOVA for each classifier, with classification accuracy as the dependent measure. The factors in the ANOVAs were classifier type (with 2 levels: baseline or our classifier) and time slice (with 10 levels: one level for each 10% of interaction data observed). This gives us 20 cells (2 classifier types x 10 time slices). Each cell had 10 data points, one data point for the accuracies of each of 10 runs of 8-fold cross-validation classification with different random seeds. 8-fold cross-validation was used in previous sections and was chosen so that we would have even sized folds for our 96 users. For the ANOVAs, we chose to use the accuracies from 10 runs of the cross-validation instead of the accuracies of each fold in one run of the cross-validation so that we could get a more accurate standard deviation between the data points. There was high variation amongst the accuracies of the folds due to uneven splits of user types into different folds. For each of the three ANOVAs, Mauchly’s Test of Sphericity indicated that the assumption of sphericity had been violated, and therefore, a Greenhouse-Geisser correction was used. There was a significant interaction between classifier type and time slice for each of the three feature sets: FOAC_f (F(3.833,34.494) =  82.579, p < .001, η² = .902), FAC_f (F(4.715,42.438) = 60.117, p < .001, η² = .870), and OAC_f (F(4.206,37.855) =  56.045, p < .001, η² = .862). Because there were interaction effects, we ran simple main effects analysis to determine which classifier significantly beat its baseline. We found a significant effect for classifier type for each of the three feature sets: FOAC_f (F(1,9) =  864.794, p < .001, η² = .990), FAC_f (F(1,9) =  14.192, p < .01, η² = .612), and OAC_f (F(1,9) =  550.159, p < .001, η² = .984). This indicates that each classifier outperforms its baseline in terms of accuracy over time (the means for each are shown in Table 6). There was also a significant effect for time slice for each of the three feature sets: FOAC_f (F(3.833,34.494) =  82.579, p < .001, η² = .902), FAC_f (F(4.715,42.438) = 60.117, p < .001, η² = .870), and OAC_f (F(4.206,37.855) =  56.045, p < .001, η² = .862).  This indicates, not surprisingly, that classifier accuracy is significantly impacted by how much data the classifier has seen from the user.     28 Feature Set Classifier Accuracy % Baseline Accuracy % FOAC_f 73.9  63.5 FAC_f 71.0 69.8 OAC_f 76.5 68.8 Table 6. Average accuracies for each classifier and its baseline.   FOAC_f  FAC_f  OAC_f Time Slice Classifier Accuracy % (SD) Baseline t-value p-value Time Slice Classifier Accuracy % (SD) Baseline t-value p-value Time Slice Classifier Accuracy %  (SD) Baseline t-value p-value 10% 63.9 (1.5) 63.5 .758 .468 10% 64.3 (3.5) 69.8 -4.984 .001 10% 68.4 (1.0) 68.8 -1.160 .276 20% 68.0 (4.0) 63.5 3.582 .006 20% 59.8 (4.1) 69.8 -7.805 .000 20% 70.4 (1.8) 68.8 2.866 .019 30% 67.0 (4.3) 63.5 2.536 .032 30% 69.6 (2.9) 69.8 -0.240 .816 30% 78.3 (2.4) 68.8 12.328 .000 40% 68.3 (2.4) 63.5 6.462 .000 40% 64.4 (2.1) 69.8 -8.057 .000 40% 72.3 (3.1) 68.8 3.591 .006 50% 69.1 (2.1) 63.5 8.432 .000 50% 69.2 (2.2) 69.8 -0.931 .376 50% 77.2 (2.8) 68.8 9.632 .000 60% 73.5 (2.6) 63.5 12.399 .000 60% 69.1 (3.3) 69.8 -0.708 .497 60% 73.6 (1.7) 68.8 8.990 .000 70% 79.8 (1.9) 63.5 27.842 .000 70% 76.4 (1.9) 69.8 10.880 .000 70% 83.9 (2.9) 68.8 16.329 .000 80% 81.9 (2.2) 63.5 27.006 .000 80% 75.8 (2.5) 69.8 7.649 .000 80% 77.1 (2.6) 68.8 10.266 .000 90% 81.9 (2.4) 63.5 24.567 .000 90% 79.8 (2.9) 69.8 10.845 .000 90% 79.9 (2.4) 68.8 14.571 .000 100% 85.2 (3.0) 63.5 22.738 .000 100% 81.9 (3.7) 69.8 10.373 .000 100% 84.1 (2.4) 68.8 20.472 .000 Table 7. Classifier accuracy over time for each of our three feature sets. Bolded values highlight the time Slices where the classifier is significantly beating the baseline.   In order to ascertain, in particular, how much available data impact a classifier’s ability to perform better than its baseline, we conducted a one-sample t-test to compare each classifier’s accuracy with its baseline. One t-test was run for each of the time slices for each of the classifiers (30 total t-tests). Table 7 shows the outcome of the t-tests. Reported effects are significant at the p < .05 level after all tests are corrected for multiple comparisons using the Holm-Bonferroni correction. We see that FAC_f does not significantly beat the baseline until 70% of interaction has been observed. However, OAC_f and FOAC_f are able to consistently significantly beat their baselines after only   29 30% and 40% of data has been observed, respectively. In addition, they are never statistically worse than the baseline implying that they could be used at any point during interaction to do no worse than baseline at classifying the students. It is important to know whether a classifier can beat its baseline to understand if the classifier has any predictive value at all. However, beating the baseline does not say much about the value of the classifier in practice, especially if, like in our case, one wants to compare classifiers with different baselines that might be easier or more difficult to beat. To assess which of the 3 classifiers performs best in terms of sheer value of accuracy over time, we ran a 2-way repeated measures ANOVA with classification accuracy as the dependent measure, and factors consisting of classifier (with 3 levels: FOAC_f, FAC_f, and OAC_f) and time slice (with 10 levels: one level for each 10% of interaction data observed). There was a significant effect for classifier type (F(2,18) =  78.513, p < .001, η² = .897) and for time slice (F(9,81) =  164.094, p < .001, η² = .948). There was also a significant interaction between classifier type and time slice (F(18,162) =  11.632, p < .001, η² = .564). The post-hoc analysis for the classifier type main effect shows that all three classifiers are statistically different from each other at the p < .01 level (after all tests are corrected for multiple comparisons using the Holm-Bonferroni correction). The ordering, from best to worst feature sets, in terms of over time classifier accuracy is: OAC_f, FOAC_f, FAC_f. This may seem to contradict what we found in the last section, where we had seen that FOAC_f was the classifier that was able to classify students with the highest accuracy. However, OAC_f is able to perform better over time because it outperforms FOAC_f in many of the earlier time slices. Figure 4 and Table 7 show that OAC_f, the best performing classifier over time, can also be used the earliest. Starting after only 20% of the interaction has been observed (roughly 5 minutes in), OAC_f is able to correctly classify at least 70% of the students. The other two feature sets are not able to achieve this high of an accuracy until about 60% (FOAC_f) and 70% (FAC_f) of the interaction has been observed. In addition, in the last section, we did not perform statistical tests comparing the classifiers to each other, so we were not able to say with certainty that FOAC_f was significantly best. In both sections we did see that Outcome adds additional information allowing for better overall classification of the students. This can be seen in this section as both   30 FOAC_f and OAC_f are able to beat the baseline at a much earlier point of interaction observed than FAC_f.  Given that our intervention targets low-performing students, it is important to evaluate specifically the accuracy over time of that cluster. As plotted in Figure 5, after 20% of the interaction we are only able to classify 23% of the students in this cluster successfully. No feature set is able to constantly beat 60% low-learners class accuracy until 50% of interaction has been observed.  Figure 5. Over time performance at classifying low learners of each of the three feature sets. 7.3 Usefulness for Providing Adaptive Support In this section we focus on the third criterion we leverage to evaluate our feature sets, namely, what is the usefulness of the association rules uncovered in the corresponding clusters for designing and triggering support to students. Association rules identify behavioral patterns that are representative of what students in a given cluster do with CCK (see [14] for a discussion of how patterns are derived from rules). These patterns are useful if they are associated with low (or high) learning performance that can inform adaptive interventions. Specifically, if a student is classified as a “Low Learner” (LL) at any given point of working with CCK, adaptive interventions can be provided to discourage the LL patterns she is showing and to encourage the HL patterns she is not showing.  0	  10	  20	  30	  40	  50	  60	  70	  80	  90	  100	  10	   20	   30	   40	   50	   60	   70	   80	   90	   100	  Accuracy	  %	  for	  low	  learners	  Percent	  of	  Interac0on	  Observed	  FOAC_f	  FAC_f	  OAC_f	    31 The number of identified patterns varies to some degree among feature sets, ranging from 15 in OAC_f to 17 in FAC_f to 23 in FOAC_f, showing that the most complex representation captures finer grained variations in learner behaviors. With more distinct patterns, we are able to deliver a wider variety of interventions to the students. More distinct patterns also means more behaviors that can trigger an intervention, possibly allowing us to intervene more during a student’s interaction. The breakdown of patterns by high and low learners is shown in Table 8 and example patterns are shown in Table 9. We observe that in all feature sets LL produce more distinct patterns than HL even though there are fewer learners in the LL group. Thus, LL have more distinct patterns (engaging infrequently in productive behaviors), compared with HL (engaging infrequently in unproductive behaviors), even though the LL clusters have fewer members.  Feature Set HL distinct patterns LL distinct patterns Total distinct patterns FOAC_f 11 12 23 FAC_f 6 11 17 OAC_f 6 9 15 Table 8. Number of distinct patterns for each of the feature sets. This table shows the breakdown of patterns coming from high learners and low learners.     32 Feature Set Cluster Pattern [Description] FOAC_f HL None.Build.add.lightBulb_f = Low [When building, they added light bulbs resulting in no outcome with low frequency]   light_intensity.Revise.split.junction_f = Low [When revising, they split junctions resulting in changes to light intensity with low frequency] LL deliberate_measure.Test.startMeasure.voltmeter_f = Low [When testing, they used the voltmeter with low frequency]  pause_f = Low [They paused with low frequency] FAC_f HL Build.add.lightBulb_f = Low  [When building, they added light bulbs with low frequency]  LL Build.changeResistance.resistor_f = Low  [When building, they changed the resistance of resistors with low frequency]   Revise.changeResistance.resistor_f = Low  [When revising, they changed the resistance of resistors with low frequency]   Test.endMeasure.voltmeter_f = Low [When testing, they used the voltmeter with low frequency]  pause_f = Low [They paused with low frequency] OAC_f HL light_intensity.join.wire_f = Low [They joined wires resulting in light intensity change with lower frequency] LL deliberate_measure.traceMeasure.nonContactAmmeter_f = Low [They used the non contact ammeter by tracing with low frequency]  fire_started.changeResistance.resistor_f = Low [They changed the resistance of resistors resulting in a fire with low frequency] Table 9. Sample patterns for each feature set (raw form and English description)  While the patterns produced by the three feature sets varied, we identified 4 trends that occurred in at least two feature sets each. This shows that our general approach for behavior discovery is able to uncover core behaviors that are stable   33 across representations. The following patterns are presented in their raw form for brevity, but the English description can be found in Table 9. One of these trends is related to use of the voltmeter and ammeter (to measure the voltage difference between and current through different parts of the circuit, respectively). Intuitively, using testing devices is an effective behavior for understanding the changes in a circuit. We observed that LL perform testing with low frequency. This pattern occurs for LL in all feature sets: FOAC_f (deliberate_measure.Test.startMeasure.voltmeter_f = Low), FAC_f (Test.endMeasure.voltmeter_f = Low), and OAC_f (deliberate_measure .traceMeasure.nonContactAmmeter_f = Low. The next trend we identified is also only present in LL patterns and is related to changing the resistance of resistors. This pattern is consistent with experimenting with a range of resistances, as suggested by the activity, and is  an effective behavior for understanding how resistors work. We see that LL engage in this pattern infrequently, and the trend is observed in 2 feature sets: FAC_f (Build.changeResistance.resistor_f = Low and Revise.changeResistance.resistor_f = Low) and OAC_f (fire_started .changeResistance.resistor_f = Low). While it may seem counterintuitive that it is productive to start fires, fires only occur in a circuit when the current through a component reaches a very high value. This happens when that segment of the circuit has relatively little resistance and a very high voltage. By causing a fire to happen after changing resistances, students are likely experimenting with extremes of resistance and gaining an understanding about that.  The next trend we observed is related to the frequency of pausing (possibly to plan, reflect, and take notes) and, as with the two before, it is only present in LL patterns. Specifically, LL show patterns of pausing infrequently, indicating that they are not taking adequate time to best leverage the learning activity. This trend is identified in 2 feature sets: FOAC_f (pause_f = Low) and FAC_f (pause_f = Low). The last trend is related to the addition of light bulbs and changes in light intensity. HL both add light bulbs infrequently and make infrequent changes in light intensity. Since the goal of the student’s while working with the sim was to understand how resistors work in circuits, light bulbs were likely distractors at best, and possibly   34 interfered with observing the behavior of other resistors (light bulbs are a type of resistor too). We see the adding light bulb behavior associated with HL in both the FAC_f feature set (Build.add.lightBulb_f = Low) as well as in the FOAC_f feature set further qualified with the outcome (None.Build.add.lightBulb_f = Low), as shown in Table 9. An outcome of none is expected in this case because that is the only outcome that can occur when adding a component as no current will flow through the newly added component until both sides are joined to other components. We also see HL making changes to light intensity with low frequency in both FOAC_f (light_intensity.Revise.split.junction_f = Low) and OAC_f (light_intensity.join.wire_f = Low). There are many ways to create changes in light intensity in a circuit, almost any change made to the circuit could make it occur.  In summary, the trends uncovered by association rules in the different feature sets indicate that it is productive to do the following activities: i) frequently use testing devices, ii) frequently change the resistance of resistors, iii) frequently pause, and iv) infrequently use light bulbs and change the light intensity. Next, we evaluate the usefulness of these patterns to inform adaptive support. One criterion for doing so is level of detail at which the support can be provided. Naturally, this depends on the granularity of the corresponding features in the different representations. Thus, behaviors in FOAC_f give the most contextual information and can be used to give students feedback with regard to the outcome of desired actions, what to do to achieve that outcome in terms of a high level behavior, and how to achieve it using specific actions and components. For example, a hint relating to the “frequently change the resistance of resistors” trend could deliver a variety of levels of support depending on the representation. In FAC_f, the LL rule is “Revise.changeResistance.Resistor_f = Low”. In this representation, a hint could tell students to revise more (what to do generally), and then give the specific suggestion of doing this by changing the resistance of resistors (how to do it). It is missing the layer of feedback related to outcomes and is therefore not able to emphasize the desired outcome the student should attain. This could be a drawback if certain outcomes are better to achieve than others. For example, it may be more useful if students have a testing instrument attached to the circuit when they are changing the resistance so that   35 they can be observing changes on the reading of the testing device. Or it may be better if they are creating fires as it likely means they are experimenting with extreme resistances. In contrast, the OAC_f representation is able to provide the desired outcome but is not able to provide the level about what to do in terms of high level behaviors. The related LL rule for OAC_f is “fire_started.changeResistance.resistor_f = Low”. With this representation, we can tell the students to start more fires (outcome), and then to specifically do this by changing the resistance of resistors (how to do it). We cannot provide information about what high level behaviors to perform (eg, should they be revising or building their circuit while doing this?). For hints coming from the FOAC_f representation, we would be able to provide all levels of detail.  The richer level of detail available due to the nature of the FOAC_f representation lends itself well to provide sequences of hints with narrowing specificity (a well-established approach to hint provision in ITS [18]). For instance, if the following LL pattern existed for FOAC_f (fire_started. Revise.changeResistance.Resistor_f = Low), a first-level hint could tell the student the outcome that they should try to achieve (“Try to observe more fires.”), then, if needed a second level of hint could suggest the family (what to do at the high level – “Do this by revising more.”), followed by a hint on how to do it in terms of a specific action and component involved (“Do this by changing the resistance of the resistors.”). The OAC_f and FAC_f feature sets do not support this hint progression.     36 8 Discussion In this thesis, we evaluated the different representations on three qualities: i) ability to distinguish types of learners; ii) ability to classify students so that the right students can be targeted for assistance; and iii) ability to provide adaptive support. Our representations differ in their level of granularity and amount of feature engineering. From these representations, we have identified a trade-off between suitability to provide support and quality of the clusters: hints generated by the most complex representation in FOAC_f would only be able to target over 70% of students correctly after 60% of the interaction, but can give the most detailed support and provide the largest number of hints. On the other hand, the representation with less feature engineering, OAC_f, generates rules that come from higher quality clusters and can target the correct students sooner (after only 20% of interaction), albeit offers fewer hints, with the ability to provide slightly less support (missing the general level of “what to do” that is described by the Family layer, though, as mentioned, in most cases this layer can be inferred). The Raw_OAC_fms feature set was the best at identifying different types of learners, but performed no better than baseline at classifying students. Because it performs terribly at the key task of classifying students, we feel that this is not a suitable representation for being able to offer assistance to students even though it was the feature set with the least amount of feature engineering needed. The fourth feature set, FAC_f, is also unsuitable as it does not provide classification that beats the baseline until very late in the interaction. An experimental evaluation between our two most promising feature sets (FOAC_f and OAC_f) is required to see how the trade-off between suitability to provide support and quality of clusters impacts the effectiveness of interventions in an adaptive version of CCK. Thus, generating different adaptive versions of CCK based on the classifiers and behavior patterns identified in this thesis is one of the next steps of this research.       37 9 Conclusion and Future Work In this research, we aimed to provide a comprehensive evaluation of the student modeling framework proposed in [13] when applied to multi-layer representations of student interactions with CCK. We evaluated the representations in terms of ability to identify learners with high- or low- learning gains, suitability for user modeling (i.e. ability to classify new students in terms of their learning performance as they work with the simulation), and for informing the content of adaptive support during interaction. The results presented above provide evidence on the generality of the user-modeling framework we used for our evaluation. This framework had already been successfully applied for modeling students and providing support in a rather simple simulation for an AI algorithm [12]. Here we showed that it can transfer to more complex ELE such as CCK in that it is able to successfully classify student learning throughout interaction (all classifiers were significantly better than baseline at some point during interaction and one classifier was able to classify over 70% of students correctly after just 20% of interaction data was observed) and identify interaction behaviors intuitively associated with more/less effective learning. One of the next steps of this research is investigating how to design real-time hints that can foster the productive patterns and discourage the others as we did in [12], eventually supporting learning. A key goal of this work is to foster productive interactions with the CCK simulation as students work with it. By providing adaptive hints while they are interacting, we can hopefully encourage them to perform more productive behaviors and fewer detrimental behaviors. Another step of future work is to further test the generality of this modeling framework by applying it to another simulation of the PhET family. This will allow us to identify productive patterns across simulations and domains and bring us closer to addressing the challenge of a general modeling framework for interactive simulations. In addition, we would like to try different types of feature sets on this PhET simulation. This research focused on variations of action-event features. However, we would also like to explore some other representations. Some we have already begun to explore include: sophistication, further block qualifiers, sequences, and hand picked. These representations are described below:   38 Sophistication: Sophistication begins to address the notion of complexity of the circuit built. Because this is an exploratory learning environment and there is no way to define the “correctness” of a circuit, this representation would allow us to explore if learners who perform better build different types of circuits than their lower scoring peers. For example, with this representation, we would explore if there is a difference between using series or parallel circuits or if more of one type of component in a circuit leads to better learning.  Block Qualifiers: For this research, we only qualified blocks by outcome (see the section on Representing the User Actions). This was done to keep blocks as a fairly high level representation. We were surprised that we did not see results for this level of qualification because this representation captured information about two layers in our hierarchy and added additional information about the sequential nature of the actions. Since we did not see any results for blocks with just qualifying with outcome, we would like to further explore blocks and try qualifying them with other layers of the action-event hierarchy, like actions or components. This would increase the granularity captured by the blocks. Other qualifiers could include sophistication items like types of circuits built, numbers of components added, or complexity of testing performed. Sequences: Another representation we would like to explore is sequences. Sequences would allow us to evaluate how students progress through interacting with the exploratory learning environment. This was somewhat captured with blocks. However, sequences would allow us to extract more information about which behaviors follow others. At a granular level, this could be sequences of actions or of complexity of circuits built. This would allow us to see if certain kinds of progressions of actions or circuit building are most effective (eg, is it best to start with a simple series circuit before progressing to a more complex parallel one when trying to understand resistors). At a higher level, this could be sequences of families. This would allow us to determine if series of types of actions (eg, “Build -> Test -> Revise”) lead to more or less effective learning, giving us a language of inquiry of the learners. Hand Picked: In this research we explored the differences among different levels of feature engineered feature sets. With the hand picked feature set, the authors would be selecting, with the help of physics teachers, which behaviors they think would lead to   39 good or bad learning. Depending on the types of behaviors chosen, this could be a highly feature engineered set and would allow us to compare if a fully human selected feature set leads to better performance on the measures outlined in this thesis. In conclusion, the research presented in this thesis was able to achieve two higher-level goals. First, we were able to extract valuable information about how students learn from interaction log files. And, second, we were able to show that the student-modeling framework used in this work was an effective way to extract this information.     40 References 1. H.R. Alfieri, L., Brooks, P.J., Aldrich, N.J., Tenenbaum. 2011. Does discovery-based instruction enhance learning? Journal of Educational Psychology, 103, 1–18. 2. Alexander Borek, Bruce McLaren, Michael Karabinos, and David Yaron. 2009. How Much Assistance Is Helpful to Students in Discovery Learning? Learning in the Synergy of Multiple Disciplines 4th European Conference on Technology Enhanced Learning, Springer Berlin / Heidelberg, 391–404. 3. J Cohen. 1988. Statistical power analysis for the behavioral sciences. 4. Cristina Conati, Lauren Fratamico, Samad Kardan, and Ido Roll. 2015. Comparing Representations for Learner Models in Interactive Simulations. Artificial Intelligence in Education, Springer International Publishing, 74–83. 5. Michael Eagle and Tiffany Barnes. 2014. Exploring differences in problem solving with data-driven approach maps. Educational Data Mining 2014. 6. Enrique García, Cristóbal Romero, Sebastián Ventura, and Carlos de Castro. 2008. An architecture for making recommendations to courseware authors using association rule mining and collaborative filtering. User Modeling and User-Adapted Interaction 19, 1-2, 99–132. 7. Janice D. Gobert, Michael A. Sao Pedro, Ryan S. J. d Baker, Ermal Toto, and Orlando Montalvo. 2012. Leveraging Educational Data Mining for Real-time Performance Assessment of Scientific Inquiry Skills within Microworlds. JEDM - Journal of Educational Data Mining 4, 1, 111–143. 8. Yue Gong, Joseph E. Beck, and Carolina Ruiz. 2012. Modeling Multiple Distributions of Student Performances to Improve Predictive Accuracy. User Modeling, Adaptation, and Personalization, Springer Berlin Heidelberg, 102–113. 9. Talib S. Hussain, Bruce Roberts, Ellen S. Menaker, et al. 2009. Designing and developing effective training games for the US Navy. The Interservice/Industry Training, Simulation & Education Conference (I/ITSEC), NTSA. 10. T. De Jong and W. R. Van Joolingen. 1998. Scientific Discovery Learning with Computer Simulations of Conceptual Domains. Review of Educational Research 68, 179–201.   41 11. T. de Jong, M. C. Linn, and Z. C. Zacharia. 2013. Physical and Virtual Laboratories in Science and Engineering Education. Science 340, 6130, 305–308. 12. Samad Kardan and Cristina Conati. Providing Adaptive Support in an Interactive Simulation for Learning: an Experimental Evaluation. Proccedings of CHI 2015. 13. Samad Kardan and Cristina Conati. 2011. A Framework for Capturing Distinguishing User Interaction Behaviours in Novel Interfaces. Proceedings of the 4th International Conference on Educational Data Mining, 159–168. 14. Samad Kardan and Cristina Conati. 2013. Evaluation of a Data Mining Approach to Providing Adaptive Support in an Open-Ended Learning Environment: A Pilot Study. AIED 2013 Workshops Proceedings Volume 2 Scaffolding in Open-Ended Learning Environments (OELEs), 41–48. 15. Samad Kardan, Ido Roll, and Cristina Conati. 2014. The Usefulness of Log Based Clustering in a Complex Simulation Environment. In Intelligent Tutoring Systems, Stefan Trausan-Matu, Kristy Elizabeth Boyer, Martha Crosby and Kitty Panourgia (eds.). Springer International Publishing, 168–177. 16. Paul A. Kirschner, John Sweller, and Richard E. Clark. 2006. Why Minimal Guidance During Instruction Does Not Work: An Analysis of the Failure of Constructivist, Discovery, Problem-Based, Experiential, and Inquiry-Based Teaching. Educational Psychologist 41, 75–86. 17. Krittaya Leelawong and Gautam Biswas. 2008. Designing Learning by Teaching Agents: The Betty’s Brain System. International Journal of Artificial Intelligence in Education 18, 3, 181–208. 18. M R Lepper, M Woolverton, D L Mumme, and J-L. Gurtner. 1993. Motivational techniques of expert human tutors: Lessons for the design of computer-based tutors. In Computers as Cognitive Tools. 75–105. 19. Manolis Mavrikis, Sergio Gutierrez-Santos, Eirini Geraniou, and Richard Noss. 2012. Design requirements, student perception indicators and validation metrics for intelligent exploratory learning environments. Personal and Ubiquitous Computing 17, 8, 1605–1620. 20. Glenn W. Milligan and Martha C. Cooper. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50, 2, 159–179.   42 21. Amir Shareghi Najar, Antonija Mitrovic, and Bruce M McLaren. 2014. Adaptive Support versus Alternating Worked Examples and Tutored Problems: Which Leads to Better Learning? In User Modeling, Adaptation, and Personalization. Springer, 171–182. 22. Richard Noss, Alexandra Poulovassilis, Eirini Geraniou, et al. 2012. The design of a system to support exploratory learning of algebraic generalisation. Computers and Education, 63–81. 23. Zachary A. Pardos, Shubhendu Trivedi, Neil T. Heffernan, and Gábor N. Sárközy. 2012. Clustered knowledge tracing. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 405–410. 24. D. Perera, J. Kay, I. Koprinska, K. Yacef, and O.R. Zaiane. 2009. Clustering and Sequential Pattern Mining of Online Collaborative Learning Data. IEEE Transactions on Knowledge and Data Engineering 21, 6, 759–772. 25. Ido Roll, Vincent Aleven, and Kenneth R. Koedinger. 2010. The Invention Lab: Using a Hybrid of Model Tracing and Constraint-Based Modeling to Offer Intelligent Support in Inquiry Environments. In Intelligent Tutoring Systems, Vincent Aleven, Judy Kay and Jack Mostow (eds.). Springer Berlin Heidelberg, 115–124. 26. Ido Roll, N. Yee, and A. Cervantes. 2014. Not a magic bullet: the effect of scaffolding on knowledge and attitudes in online simulations. International Conference of the Learning Sciences, 879–886. 27. Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 1, 53–65. 28. Ron J C M Salden, Vincent A W M M Aleven, Alexander Renkl, and Rolf Schwonke. 2009. Worked Examples and Tutored Problem Solving: Redundant or Synergistic Forms of Support? Topics in Cognitive Science 1, 1, 203–213. 29. Michael A. Sao Pedro, Ryan S. J. de Baker, Janice D. Gobert, Orlando Montalvo, and Adam Nakama. 2011. Leveraging machine-learned detectors of systematic inquiry behavior to estimate and predict transfer of inquiry skill. User Modeling and User-Adapted Interaction 23, 1, 1–39.   43 30. Benjamin Shih, Kenneth R Koedinger, and Richard Scheines. 2010. Unsupervised Discovery of Student Strategies. Proceedings of the 3rd International Conference on Educational Data Mining, 201–210. 31. John Stamper, Michael Eagle, Tiffany Barnes, and Marvin Croy. 2013. Experimental Evaluation of Automatic Hint Generation for a Logic Tutor. International Journal of Artificial Intelligence in Education, 3–17. 32. Shubhendu Trivedi, Zachary A. Pardos, and Neil T. Heffernan. 2011. Clustering students to generate an ensemble to improve standard test score predictions. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 377–384. 33. Giles Westerfield, Antonija Mitrovic, and Mark Billinghurst. 2013. Intelligent Augmented Reality Training for Assembly Tasks. In Artificial Intelligence in Education, H. Chad Lane, Kalina Yacef, Jack Mostow and Philip Pavlik (eds.). Springer Berlin Heidelberg, 542–551. 34. Carl E. Wieman, Wendy K. Adams, and Katherine K. Perkins. 2008. PhET: Simulations That Enhance Learning. Science 322, 5902, 682–683.    


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items