You are on page 1of 47

Analysis of the Student’s Model and the Intervention Mechanism in Prime Climb

Alireza Davoodi1 1 Department of Computer Science, University of British Columbia, { davoodi }@cs.ubc.ca

Abstract. This study aims to investigate the performance of the student’s model used in Prime Climb, an adaptive educational game, in assessing the number factorization skills in a student and providing the student with tailored and personalized supports in forms of hints during the interaction of the student with the game. To this end, two measures of effectiveness for the intervention mechanism in Prime Climb namely, hint precision and hint recall were defined and calculated. In addition, the precision, sensitivity and specificity of the student’s model during the game-play were also quantified. Moreover, the effects of using different prior probabilities settings on the aforementioned measurers were examined. The obtained results have provided insights into the limitations and deficiencies of the student’s model in Prime Climb. Keywords: Student’s model, Dynamic Bayesian Network, precision, recall, sensitivity, specificity, prior probabilities

1

Introduction

Assisting people to gain desired knowledge and skills while engaging in a game, distinguishes the educational games from the traditional video games. Effectiveness of the game aspects of an educational game depends on how well the game is capable of keeping the player in the affective states, known to be influential in learning such as motivation. On the other hand, from a pedagogical point of view, an educational game needs to embed the pedagogical contexts in the game’s scenario and narrative in such a way that the interactions of the player with the game eventually result in learning gains. While there exists promising evidence on effectiveness of educational games in keeping the players in affective states such as motivation and engagement, there does not exist decent evidence on usefulness of the educational games in assisting the players to learn the target knowledge and skills. Adaptive educational games as a sub-domain of the educational games aim to support tailored interactions with the player during the game-play and have been proposed as an alternative solution for the one-size-fits-all approach used in designing non-adaptive educational games. An adaptive educational game utilizes a student’s model to assess the current level of the student’s relevant knowledge during the game-play. Such a game not only needs to represent the educational concepts in forms of interactions of the players with the game, but also provide the players with precise and on-time supports. The performance of an educational game in providing personalized supports, meeting the student’s needs highly depends on how accurate the student’s model is in assessing the current level of desired knowledge or skill in the player. Prime Climb is an adaptive educational game which helps the students practice number factorization knowledge (factorizing a number to its factors) while participating in a 2-player game. In this game, the player and her partner will pair up the numbers which do not share a common factor to climb a

mountain. The main interaction of a player with Prime Climb consists of making a movement from a location on a mountain of numbers to another location on the mountain until she reaches the top of the mountain. Each location on a mountain either represents a number or is blocked. The other possible forms of interactions with the game is attending to the given the hints and using a tool called Magnifying glass which shows the factor tree of a number once the student uses the magnifying glass and clicks on a number on the mountain. Prime Climb utilizes a probabilistic student’s model, a Dynamic Bayesian Network (DBN) to track and assess the student’s number factorization knowledge during the interaction. The pedagogical agent embedded in the game, will provide the student with adaptive hints when, according to the student’s model assessment, the student needs such interventions. As an adaptive educational game, successfulness of Prime Climb in assisting the student learn number factorization knowledge depends on how accurate the student’s model is when evaluating the level of relevant knowledge and skills and providing supports to the student. The objective of this report is three folds. Firstly, we summarize the simulations carried out to improve the student’s model accuracy. To this end, a data-driven approach was applied to refine the parameters of the student’s model which are essential in defining the conditional probability tables of the nodes in the DBN which is designed for Prime Climb. Then, we report on the analysis of the performance of the pedagogical agent in providing adaptive interventions to the students during the game-play. To this end, two measures of intervention performance called hint precision and hint recall were defined and calculated. Finally, the accuracy of the student’s model in assessing the current level of number factorization knowledge is examined and analyzed. The rest of this manuscript is organized as following. In Section 2, we briefly summarize the results of the data-driven student’s model parameters refinement. Section 3, discusses the analysis of the intervention mechanism used in Prime Climb. Section 4 describes the effects of using different prior probabilities settings on the hinting mechanism in Prime Climb. Section 5 focuses on analysis the performance of the student’s model in evaluating the level of factorization knowledge during the interaction. Section 6 summarizes the statistical analysis of the effect of different prior probabilities settings on the student’s model. Section 7 presents some preliminary results on analysis of the pre-test and post-test. Finally, in Section 8, some future works are mentioned.

2

Data-Driven Model Refinement

Prime Climb utilizes a parametric probabilistic student’s model, a Dynamic Bayesian Network, DBN, to track the evolution of the student’s number factorization and common factor skills while the student interacts with the game. Essentially, there exists three steps in creating a DBN as following: 1. Determining the random variables and their domains. 2. Specifying the connections among different random variables. 3. Parameterizing the model by specifying the conditional probability tables, CPT, of the random variables and specifying the prior probabilities if available. In Prime Climb, an expert-driven approach has been used for defining the random variables, their domains and connections among them. In such expert-driven approach, a domain expert

determines the variables and their connections based on her own intuition and experience. On the Contrary, a data-driven mechanism has been used to find the optimal parameters setting to be used in the conditional probability tables, CPTs, of the nodes in the network. In such data-driven method, the values of the relevant parameters are calculated using sample training datasets. There are four parameters used in student’s model in Prime Climb. These parameters specify how the evidence (making a correct or wrong movement) propagates in the student’s model and represent the probability of a student making a correct or wrong movement under a specific situation as well as knowing/not knowing numbers factorization knowledge. These parameters are as following: • Guess: The probability that the student makes a correct movement while the student does not have the required skill for making such a movement. • Edu-Guess: Standing for Educational-Guess, Edu_Guess determines the probability that the student makes a correct movement while she partially has gained the required knowledge for making such a movement. • Slip: The probability that the student makes a wrong move, while based on the student’s model assessment, the corresponding skill is known to the student. • Max: Show how the evidence on a skill will propagate to other relevant skills. In addition, one other step of constructing a DBN, as mentioned earlier, is assigning the random variables with initial probabilities known as prior probabilities. In Prime Climb, three types of prior probabilities settings are considered namely, 1) Population, 2) User-specific 2) Generic. We elaborate on these prior probabilities settings and the model’s parameter in the subsequent subsection.

2.1 Sensitivity of the Model to Parameters
Given the structure (nodes and connections) of the DBN in Prime Climb, a more appropriate set of model’s parameters allows the model to more precisely track and assess the evolution of the desired skills (factorization and common factor) during the game-play and eventually at the end of the interaction, result in posterior probabilities for the skill’s corresponding nodes in the DBN, which accurately predicts the relevant knowledge in the students after the game-play. In order to find the best set of parameters, a comprehensive range of values between 0-1 was selected for each parameter. We then utilized a Receiver-Operator Curve (ROC) and found the best pair of sensitivity and specificity which results in the highest accuracy and balance between sensitivity and specificity. A ROC Curve plots the true positive rate (sensitivity) versus false positive rate (1-specificity) when a discrimination threshold varies. As our measure of accuracy, we chose accuracy=(sensitivity+specificity)/2. sensitivity is the true positive rate, the percentage of known skills that the model classifies as such. specificity is the true negative rate, the percentage of unknown skills classified as such. A simulator was developed to simulate the interactions of 45 students with Prime Climb. Table 1, represents the optimal values found for the model’s parameters. In the next subsection we report on the values of specificity, sensitivity and accuracy for the different prior probabilities settings.
Table 1: Optimal values of the model’s parameters

Parameter Value

Guess 0.6

Edu-Guess 0.7

Slip 0.1

Max 0.2

2.2 Sensitivity of the Model to Prior Probabilities of the Factorization Nodes In order to investigate the effect of using different sets of prior probabilities on the accuracy of student’s model, three prior probabilities settings were considered as following: 1. Population: In population prior setting, the prior probabilities of number factorization knowledge are determined based on the average performance of a group of subjects on the corresponding skills in the pre-test session which is conducted before starting the interaction with Prime Climb. The pre-test aims to capture the student’s level of knowledge on the number factorization and common factor before starting the interaction. 2. Generic: In generic-based prior setting, prior probabilities of a student having the number factorization skills are all equally set to 0.5. 3. User-Specific: In user-specific prior setting, the probability of a student knowing the fractionation of a number is determined on the basis of the performance of the student on the relevant skill in the pre-test. A prior probability of 0.9 is set for the node representing the student’s factorization knowledge of a number if the student has correctly responded to the number’s corresponding question in the pre-test and a prior probability of 0.1 is set if the answer to the question is wrong. Figure 1 illustrates the population prior and the generic prior used in Prime Climb. As shown in Figure 1, the population of subjects used for calculating the population prior has almost high prior factorization knowledge on the numbers appearing on the pre-test.

Prior Probabilities
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 9 15 25 14 33 31 36 27 30 49 11 97 89 42 81 88 Numbers on pre and post tests Figure 1: Average of prior probabilities in different prior settings

Probabilitis

Population Generic

We have simulated the interactions of 46 students with Prime Climb using each prior setting with different sets of model’s parameters as mentioned in Section 2.1. Table 2, summarizes the results. The maximum accuracy of 74.5% was gained using the population-based prior setting.

Table 2: Summary of the simulation results for different prior probabilities settings

Prior Setting Population User-specific Generic

Accuracy 0.755 0.713 0.684

Specificity 0.737 0.648 0.773

Sensitivity 0.779 0.77 0.612

A probable drawback of relying on this result and using the population prior probabilities for the future studies with different students could be that future subjects might have lower level of number factorization knowledge compared to the sample group of subjects used to refine the student’s model parameters and this would result in a model which initially might overestimate the student’s knowledge and might not perform as well as expected. On the hand, the userspecific prior probability setting is the one which is expected to more specifically represent the student’s prior number factorization knowledge than the other two settings. Yet, according to the results in Table 2, using user-specific prior probabilities resulted in a model the lowest specificity compared to the other two settings.

3

Hint Precision and Recall

A true intervention strategy in an adaptive educational game insures pedagogical effectiveness by providing decent tailored supports when required while does not intervene amply which might negatively affect the user’s engagement in the game. The intervention mechanism in Prime Climb has been developed in forms of providing different types of hints during the interaction of the student with the game. The hinting strategy in Prime Climb utilizes the student’s model’s assessment of the student number factorization and common factor knowledge during the gameplay to provide adaptive supports in terms of hints on unknown skills. To decide on when to intervene, the hinting strategy uses four thresholds namely: 1) Fact-CorrectMove, 2) FactWrongMove, 3) CF-CorrectMove and 4) CF-WrongMove. The first two thresholds, 1 and 2, determine the values, used to evaluate a number factorization (Fact) skill as known or unknown after a correct and wrong movement respectively. Similarly, the last two thresholds, 3 and 4, are used to assess the common factor (CF) skill as known or unknown immediately after a correct ore wrong movement. A human-adjusted approach has been applied to find an original setting for the four aforementioned thresholds in the intervention strategy in Prime Climb. To this end, subsequent to choosing some initial values for each of the thresholds, some graduate students played the game and their reports on timing the hints were used to adjust the initial values for the thresholds. The Table 2 shows the final values selected for each of the thresholds.
Table 3: The thresholds used in the hinting algorithm in Prime Climb

Threshold Fact-CorrectMove Fact-WrongMove CF-CorrectMove CF-WrongMove

Final value 0.5 0.8 0.1 0.5

The Algorithm 1 shows how these thresholds are used in the intervention mechanism in Prime Climb to decide when and on what skill to provide hints. Algorithm 1: Hinting strategy in Prime Climb
//Initializing variables if (Player made a correct move) { fact_unknown = (playerBelief < fact_correctMoveHintThreshold || partnerBelief < fact_correctMoveHintThreshold); cf_unknown = (cfBelief < cf_correctMoveHintThreshold); } else //Player made a wrong move { fact_unknown = (playerBelief < fact_wrongMoveHintThreshold || partnerBelief < fact_wrongMoveHintThreshold); cf_unknown = (cfBelief < cf_wrongMoveHintThreshold); } //When and what skill to hint on if (cf_unknown && (!fact_unknown)) { Hint on Common Factor Skill } else if (fact_unknown && (!cf_unknown)) { Hint on Factorization Skill } else if (cf_unknown && fact_unknown) { Hint on Common Factor and Factorization alternatively } Algorithm 1: The hinting strategy in Prime Climb

From a pedagogical perspective, it is essential to provide the student with “correct” supports when she needs it. A “correct” support is given on the correct skill when required and presented with helpful context in a way that encourages the student to attend to the support. As the intervention mechanism in Prime Climb uses real-time assessment of the student’s knowledge to determine when and on what to provide help, effectiveness of the mechanism is influenced by how accurately the student’s model tracks and assesses the evolution of desired skills. To investigate how well the hinting strategy and student’s model provides tailored supports to the student during the interaction, two measures of performance are defined: 1) Hint Precision and 2) Hint Recall. Generally, precision is defined as the fraction of retrieved instances which are relevant while recall is the fraction of relevant instances that are retrieved. Similarly hint precision is defined as the fraction of given hints which are justified and the hint recall is defined as the fraction of justified hints which are retrieved and given to the student.

An intervention provided to the user is called justified if it is given at the correct time and on the right skill. On the contrary, an unjustified intervention is presented to the student when it is not required and expected by the student. Similarly, if the intervention strategy fails to provide a justified intervention, it is said that a justified intervention has been missed. Finally, when no intervention is given when it is not required, the intervention mechanism has “correctly not given” the hint. Given these terminology, the hint precision and hint recall are defined using the following equations.
Equation 1: Hint precision

H int Pr ecision =

Number of justified h int s ( Number of justified h int s + Number of unjustified h int s )
Equation 2: Hint recall

H int Re call =

Number of justified h int s ( Number of justified h int s + Number of missed h int s )

3.1 Simulation of the intervention mechanism using the original threshold setting
In order to calculate the hint precision and hint recall in Prime Climb, the data from interactions of 45 students in grade 5,6 with Prime Climb was used to simulate the hint strategy using the original parameter settings (see Table 3). To this end, we initialized the student’s model with each of the settings of prior probabilities and used the optimal model’s parameters setting (See Table 1). Since there is no ground truth on how the student’s number factorization and common factor knowledge evolve during the interaction of the student with Prime Climb, in the process of calculating the hint precision and hint recall, we only considered the movements in which either the player’s number or the partner’s number or both keep the same score from pre-test to posttest. In each movement made by the student, there are two numbers involved: 1) Player’s number and 2) Partner’s number. The player’s number is the number to which the player has just moved while the partner had moved to the partner’s number on the mountain. All the numbers the students ever moved to during the game-play were assigned a label based on the performance of the student on that specific number in the pre and post tests. We used 5 labels to represent the status of the numbers from the pre-test to post-test as following: 1. KK: Stands for Known-Known and shows that the number has been known to the student both in the pre-test and post-test (student has answered correctly to the number’s corresponding question in both tests). 2. UU: Stands for Unknown-Unknown and shows that the number has been unknown to the student both in the pre-test and post-test. 3. KU: Stands for known-Unknown and shows that the student has correctly answer the number’s corresponding question in the pre-test and wrongly in the post-test. 4. UK: Stands for Unknown-Known and shows that the student has given a wrong answer to the number’s corresponding question in the pre-test and a correct answer in the post-test. 5. NAP: If the number does not appear on the tests.

Given the above terminologies, the types of the hints are defined based on the status of the numbers on which the hints are given as following: • • • • Justified hint: A hint which is given on a number with status of UU. Unjustified hint: A hint which is given on a number with status of KK. Missed hint: When the hinting mechanism fails to provide a hint on a number with status of UU. CorrectlyNotGiven hint: When the hinting mechanism correctly detects not to provide hint on a number with status of KK.

In calculation of hint precision and hint recall it has been assumed that a student should receive a hint following a movement which contains at least a number with status of UU and should never receive a hint on a number with a status of KK. For each set of prior probabilities, total numbers of different types of hints were calculated and the confusion matrix was constructed. Table 4 shows the structure of the confusion matrix calculated for the intervention mechanism in Prime Climb. For instance, in this confusion matrix, an unjustified hint is a hint given on a number which is known to the student according to pre-test and post-test scores of the student and is unknown on the basis of the student’s model assessment.
Table 4: Structure of the confusion matrix for the intervention mechanism

Pre-Post Test 3.1.1

Known Unknown

Model assessment of student knowledge Unknown Known Unjustified hint (UJ) Correctly Not Given (CN) Justified hint (J) Missed hint (M)

Simulation of the Intervention Mechanism Using Population Prior Settings

As previously discussed, three types of prior probabilities settings are used in Prime Climb to initialize the student’s model. Table 5 represents the confusion matrix for the hinting mechanism in Prime Climb when the population prior setting was used. The result was based on using the original thresholds for the hinting strategy (see Table 3) and optimal model’s parameters (see Table 1).
Table 5: Confusion Matrix (# of raw data points and [percentages]) when the population priors is used

Pre-Post Test

Known Unknown Total

Model assessment of student knowledge (Population-based Prior) Unknown Known 108 [12.3%] (UJ) 306 [34.8%] (CN) 122 [13.9%] (J) 343[39.0%] (M) 230 [26.2%] 649 [73.8%]

Total 414 [50.9%] 465 [49.1%] 879 [100%]

Given the equations 1 and 2, the hint precision and hint recall are 0.53 and 0.26 respectively. As calculated, the hint precision and hint recall are of low values which means that initializing the student’s model with the population prior probabilities and using the model as the basis for providing tailored supports to the student could result in many unjustified interventions (almost

47% of all interventions) and this has the potential of ceasing the student to benefit from the provided supports. It also could result in many missed hints (about 74% of the time the model fails to provide a justified hint) which could negatively affect the learning gain in the students. To find out which situations during the game-play make the most contribution toward lowering the hint precision and hint recall we made some further investigations. To this end, all the movements made by the student were extracted from the log files and each movement was assigned a label which comprised the status of the player’s number in the pre-test and post-test followed by the status of the partner’s number in the pre-test and post-test. A number’s status is of format of XY which X represents if, based on the pre-test result, the student knows (K) / does not know (U) the factorization of the number. Whether the student knows the factorization of the same number based on the post-test result is shown by Y. If a number does not appear in the pretest and post-test, it is assigned a NAP status. For instance, in the status (UK-NAP), UK represents the status of the player’s number in the movement and shows that factorization knowledge of the number is Unknown to the student in the pre-test and Known in the post-test. On the other hand, NAP represents the status of the partner’s number which means that the number does not appear in the pre-test and post-test. Then all the movements which have the potential of receiving unjustified and justified hints were extracted. Figure 2, illustrates the frequencies of the relevant movements to the hints. As depicted, in 3.95% of the time the model underestimates the known number factorization knowledge in the students. On the contrary, in 64% of the time, when a justified hint was required, no hint was given to the students, an indication of a high rate of overestimation of unknown number factorization skills. In addition, in “at least” (since we could not judge on given hints on numbers with status NAP) 22.8% of the time the model succeeded to provide a justified hint to the student.

Figure 2: Frequency (raw# and percentages) proportion of each hint types to its relevant possible movements for the population prior

Then all the movements which have the potential of receiving unjustified hints were extracted. There are 9 types of movements on which unjustified hints might be given. Figure 3 shows the labels of the 9 movement types.

Figure 3: Frequency (raw# and percentages) of the unjustified hints for each movement type

Figure 4: Frequency (raw# and percentages) of the missed hints for each movement type

Figures 3, 4 and 5 illustrate more detailed analysis of all relevant movements to the hints. Figure 3, represents all the statuses of the movements which are relevant to unjustified hints. Next to each movement’s status, the raw number and percentage of given unjustified hints relevant to the movement is given. For instance, 40 unjustified hints are given on movements with status of KK-KK which includes 11.5% of all movements with status KK-KK. Figures 4 and 5 represent similar information for the missed and justified hints. As shown in Figure 4, at least in 50% of all the relevant movements the model has failed to provide a hint and a justified hint

has been missed. In addition, Figure 5 represents the low rate of given justified hints for each relevant status of the movements.

Figure 5: Frequency (raw# and percentages) of the justified hints for each movement type

We can conclude from Figures 2-5 that the hinting strategy is successful in not giving many unjustified hints on the numbers on which the student’s model has population prior knowledge, although, as mentioned before, almost 47% of given hints are unjustified. Also the hinting strategy is in trouble in giving justified hints and there are too many missed hints meaning that the student’s model overestimates the student’s factorization knowledge on numbers with status of UU. This deficiency could hinder learning gains through receiving tailored helps during the interaction with Prime Climb. Similarly, in the next subsection the effect of initializing the model with the generic prior probabilities on hint precision and hint recall is discussed. 3.1.2 Simulation of the Intervention Mechanism Using Generic Prior Setting

To further investigate the effect of the prior probability settings on the hint precision and hint recall, a similar process was carried out on the model which was initialized by the generic prior probabilities. In the generic prior setting, the prior probabilities of all numbers on the mountains are set to 0.5 regardless of how the student has scored on that specific number on the pre-test. The confusion matrix of the intervention mechanism based on the generic prior is shown in Table 6. As calculated by using the Equations 1 and 2, the hint precision and hint recall are 0.378 and 0.363 when the generic prior setting. Figure 4 represents the frequencies of all the relevant movements as well as the frequencies and the percentages of the corresponding hints. The results show an increase in frequency of given unjustified hints and decrease in frequency of missed hints. A detailed statistical analysis and comparison will be discussed in Section 4. The results on using the generic prior probabilities provided the intuition that lowering the prior probabilities could result in higher rate of underestimation of known skills and lower rate of underestimation of unknown skills.

Table 6: Confusion Matrix when generic priors is used

Pre-Post Test

Known Unknown Total

Model assessment of student knowledge (Generic-based Prior) Unknown Known 379 (UJ) [34.4%] 257 (CN)[23.3%] 169 (J) [15.3%] 297 (M)[27.0%] 548[49.7] 554[50.3%]

Total 636[57.7%] 466[42.3%] 1102[100%]

Figure 6: Frequency (raw# and percentage) proportion of each hint types to its relevant possible movements for the generic prior probabilities

Figure 7: Frequency (raw# and percentages) of the unjustified hints for each movement type

Figures 7, 8 and 9 respectively, illustrate all the relevant movement statuses to the hints. Figure 7 shows that a low rate of unjustified hints given on the movements although almost 70% of all the given hints are unjustified (see the confusion matrix, Table 6). Furthermore, as shown in Figure 8, the student’s model has failed to provide a justified hint in at least 30% of each relevant movement statuses. Figure 9 also shows all relevant statuses, the raw frequency of each movement as well as the raw frequency and percentage of the given justified hints on each corresponding status. We also conducted the similar study using the user-specific prior setting as discussed in the next subsection.

Figure 8: Frequency (raw# and percentages) of the missed hints for each movement type

Figure 9: Frequency (raw# and percentages) of the justified hints for each movement type

3.1.3

Simulation of the Intervention Mechanism Using User-specific Prior Settings

In the user-specific prior setting, the prior probabilities of the numbers appearing in the pre-test and post-test are calculated based on the student’s performance on the number’s corresponding question in the pre-test. In other words, if the student has answered correctly to a number’s corresponding question in the pre-test, the prior probability of the number is set to 0.9 and 0.1 otherwise. Clearly, the prior probability of a known number in the user-specific prior setting is higher than the same number’s prior probability in the generic and population prior probabilities settings. To investigate the effect of initializing the student’s model with the user-specific prior probabilities, we have conducted a similar simulation to the simulations described in the 2 previous subsections. Table 7 represents the confusion matrix of the intervention mechanism when the user-specific prior setting is used.
Table 7: Confusion Matrix when the user-specific priors is used

Pre-Post Test

Known Unknown Total

Model assessment of student knowledge (User-Specific-based Prior) Unknown Known 79(UJ)[8.7%] 315(CN)[34.8%] 468(J)[51.7%] 44(M)[4.8%] 547[60.4%] 359[39.6%]

Total 394[43.5%] 512[56.5%] 906[100%]

When the user-specific prior is used, the hint precision and hint recall are 0.856 and 0.91 respectively. The results show a considerable improvement in the hint precision and hint recall compared to the results obtained when the population and generic priors were used. Figure 10 represents the raw frequencies of all relevant movements to the hints as well as the raw frequencies and the percentages of the hints. As shown in Figure 10, the student’s model initialized by the user-specific prior probabilities has succeeded to provide a justified hint on 87.3% of the relevant movements. Also, there are low rates of the unjustified and missed hints.

Figure 10: Frequency (raw# and percentage) proportion of each hint types to its relevant possible movements when the user-specific prior probabilities are used

Figures 11, 12 and 13 respectively represent all the relevant statues to the hints, the frequencies of each status’ corresponding movements and the frequencies and percentages of the hints. Figure 11 shows that the highest rate of the unjustified hints is related to the movements with status of KK-KK. This could be an indication that the students might have made enough wrong movements which involved numbers with status of KK. On the other hand, it could also be an indication for a not well adjusted slip parameter (see Section 2). Also the highest rate of the missed hints pertains to the movements with status of UU-KK. Moreover, Figure 13 represents a high rate of justified hints on each relevant movement. In the next section, a statistical comparison of the results will be presented.

Figure 11: Frequency (raw# and percentages) of the unjustified hints for each movement type

Figure 12: Frequency (raw# and percentages) of the missed hints for each movement type

Figure 13: Frequency (raw# and percentages) of the justified hints for each movement type

4

Comparison and Results of Hint Precision and Hint Recall

Table 8, summarizes the results of simulating the intervention mechanism using the three different prior settings. In the simulation, the total number of movements made by all the players was 8666 movements. The intervention mechanism used in Prime Climb, provides supports on two skills: 1) number factorization skills: the knowledge of factorizing a number to its factors and 2) common factor skill: the concept of two numbers having at least a factor in common. The results show that, on average, more than one hint is given on each three movements made by the player during the game-play.
Table 8: General statistics on the total number and [percentage] of hints using different prior probability settings

Number of hints Factorization hints Common Factor hints

Population 3344 [38.6%] 3256 88

Prior Setting User-Specific 3807 [43.9%] 3721 86

Generic 3561[41.1%] 3510 51

In the following subsections, the effects of initializing the student’s model with three prior probability settings are compared on the total number of given hints, total number of justified hints, total number of justified hints, total number of missed hints and total number of correctly not-given hints. In all the comparisons, we first conducted the test of homogeneity of variances and whenever there was a violation of the assumption of homogeneity of variance, the Welch test followed by the Games-Howell post-hoc test has been applied instead of the traditional single factor ANOVA.

4.1 Total Number of Given Hints
We found no statistically significant difference on the total number of hints given between different groups of prior settings. (F(2,132)=1.32, p=0.270203>0.05). On average, each student has made 193 movements (std.: 53) during interaction with Prime Climb. Table 9 represents the mean and standard deviation of the total number of given hints with respect to the different prior settings. Figure 14 illustrates the average number of given hints to each player during the interaction with Prime Climb when the different prior settings were used. Also Figure 15 compares the total hints given to each student in different prior settings.

Figure 14: Average number of total hints given to each player

Figure 15: Total number of hints given to each player (student)

Table 9: Mean and Standard Deviation of the total number (# raw data point) of given hints

Mean Standard Deviation

Population Prior 74.31 27.5

User Specific Prior 84.6 32.71

Generic Prior 79.13 29.66

4.2 Number of Given Justified Hints
Using the Welch test we found a statistically significant difference on the total number of given justified hints, among the different groups of prior settings (p<0.05). Table 10 represents the means, standard deviation and total number of justified hints for each prior probabilities setting. Also, Table 11 represents the results of the Games-Howell post-hoc test. (“*” indicates the significant difference)
Table 10: Descriptive statistics on total number of justified hints

Mean Standard Deviation Total number of given justified hints

Population Prior 1.83 2.65 55

User Specific Prior 13.8 10.841 414

Generic Prior 3.67 5.80 110

Table 11: Games-Howell Post-hoc test result (Dependent variable: total number of justified hints)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .649 .000 .002 .649 .002

Significant (*: Yes) * * * *

The results showed that there is no significant effect of using the population prior probabilities and the generic prior probabilities on the total number of justified hints. On the contrary there is a statistically significant difference between the user-specific and population as well as between the user-specific and the generic prior probabilities settings with respect to the total number of justified hints. Figures 16, 17, respectively illustrate the average number of given justified hints and total number of given justified hints to each student.

4.3 Number of Given Unjustified Hints
The Welch test showed that there was a statistically significant difference on the total number of given unjustified hints, among different groups of prior settings (p<0.05). Table 12 shows the

descriptive statistics on the total number of justified hints. Table 13 represents the results of the Games-Howell post-hoc test.

Figure 16: Average number of given justified hints

Figure 17: Total number of given justified hints to each student Table 12: Descriptive statistics on total number of unjustified hints

Mean Standard Deviation Total number of given unjustified hints

Population Prior 2.64 3.34 116

User Specific Prior 1.73 2.63 76

Generic Prior 7.93 7.41 349

Table 13: Games-Howell Test (Dependent variable: total number of unjustified hints)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.)
.546 .002 .546 .000 .002 .000

Significant (*: Yes) * * * *

The results showed that there is a significant difference between the generic prior probabilities setting and the population and user-specific prior probabilities settings on the total number of unjustified hints. Also there is no statistically significant difference between the user-specific and population prior probabilities settings on the total number of unjustified hints. Figure 18 and 19 respectively illustrate the average number of given justified hints, total number of given unjustified hints to each student.

Figure 18: Average number of given unjustified hints

4.4 Number of Missed Hints
The Welch test showed that there was a statistically significant difference on the total number of missed hints, among different groups of prior settings (p<0.05). Table 14 shows the descriptive statistics on the total number of missed hints. Table 15 represents the results of the Games-Howell post-hoc test. The results showed no significant difference on the total number of missed hints between the generic and population prior probabilities settings while there existed a significant difference between the user-specific prior probabilities setting and the population and generic prior probabilities settings on the total number of missed hints.

Figure 19: Total number of unjustified hints

Table 14: Descriptive statistics on total number of missed hints

Mean Standard Deviation Total number of given unjustified hints

Population Prior 10.47 8.50 314

User Specific Prior 1.37 2.73 41

Generic Prior 9.2 7.48 276

Table 15: Paired T-test results. (Dependent variable: total number of missed hints)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .770 .000 .000 .770 .000

Significant (*: Yes) * * * *

Figures 20 and 21 respectively illustrate the average number of missed hints, total number of missed hints of each student in the different prior probabilities.

Figure 20: Average number of missed hints

Figure 21: Total number of missed hints for each student

4.5 Number of Correctly Not-Given Hints
No significant difference between the total number of correctly not-given hints was found using a single factor ANOVA test (F(2,129)= 0.034 , p>0.05). Table 16 shows the descriptive statistics on the total number of correctly not-given hints. Figures 22 and 23 respectively illustrate the average number of correctly not-given hints and total number of correctly not-given hints for each student in different prior probabilities settings.

Table 16: Descriptive statistics on total number of correctly not-given hints

Mean Standard Deviation Total number of given unjustified hints

Population Prior 6.95 8.24 306

User Specific Prior 7.16 8.30 315

Generic Prior 5.84 7.483 257

Figure 22: Average number of correctly not given hints

4.6 Hint Precision
The Welch test showed that there was a statistically significant difference on the hint precision, among different groups of prior settings (p<0.05). Table 17 represents the results of the Games-Howell test and Table 18 shows the descriptive statistics on the hint precision. The results showed no significant difference between the population and the generic probabilities settings on hint precision. On the contrary there was a statistically significant difference between the user-specific prior setting and the population and the generic prior probabilities settings.
Table 17: Game-Howell post-hoc test result (Dependent variable: hint precision)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .001 .682 .001 .000 .682 .000

Significant (*: Yes) * * * *

Table 18: Descriptive statistics on the hint precision

Mean Standard Deviation

Population Prior 50.79% 43.07

User Specific Prior 85.2% 23.59

Generic Prior 41.7% 40.98

Figures 23 and 24 respectively illustrate the average hint precision and the hint precision of each student.

Figure 23: Average hint precision for the different prior settings

Figure 24: Total hint precision of each student for the different prior settings

4.7 Hint Recall
The Welch Single Factor ANOVA test showed that there was a statistically significant difference on the hint recall, among different groups of prior settings (p<0.05). Table 19 shows the descriptive statistics on the hint recall and Table 20 gives the results of the Games-Howell post-hoc test. The results showed no statistically significant difference between the population and the generic while there was a statistically significant between the user-specific prior probabilities setting and the population and the generic prior probabilities settings.
Table 19: Descriptive statistics on the hint recall

Mean Standard Deviation

Population Prior 21.27% 25.31

User Specific Prior 93.96% 12.01

Generic Prior 26.44% 29.33

Table 20: Games-Howell test results (Dependent variable: hint recall)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .746 .000 .000 .746 .000

Significant (*: Yes) * * * *

Figures 25 and 26 respectively illustrate the average hint recall and the hint recall of each student.

Figure 25: Average the hint recall for the different prior settings

Figure 26: Total hint recall of each student for the different prior settings

4.8 Thresholds Refinement in the Hinting Mechanism
As discussed in the previous section, an expert-based approach was used to find the optimal thresholds used in the intervention mechanism. Alternatively, a data-driven approach also can also be utilized to determine the values for the threshold possibly resulting in higher hint precision and hint recall. Similar to the student’s model parameter refinement discussed in the Section 2, a set of values for the Fact-correctMove and Fact-wrongMove thresholds were examined and the hint precision and hint recall were calculated. We defined another measure of performance, called accuracy=(hint precision + hint sensitivity)/2. Figures 27, 29, 31 illustrate how the hint precision and hint recall change while the value for Fact-WrongMove threshold varies and Fact-correctMove threshold holds its original values (ie. 0.5) for all three types of prior probabilities settings, population, user-specific, generic. Subsequently, Figures 28, 30, 32 plot changes in hint precision, hint recall and accuracy with respect to different values for FactCorrectMove threshold while Fact-WrongMove threshold holds its optimal value, the value resulting in highest hint precision and hint recall. The thresholds which resulted in the highest hint precision and hint recall are represented in the Figures. Table 21 and 22 summarize the optimal thresholds and total number of given hints, average and standard deviation of the total number of given hints for all prior probabilities settings.
Table 21: Summary of hinting strategy’s thresholds refinement

Prior setting Population Generic User-specific

CF-Correct Move 0.72 0.88 0.68

CF-Wrong Move 0.8 0.76 0.44

Hint Precision 55.2% 40.6% 92.8%

Hint Recall 56.2% 94.2% 95.1%

Table 22: Descriptive statistics of the hinting strategy’s thresholds refinement

Prior setting Population Generic User-specific

Total number of given hints 6703 8024 6556

Average number of given hints 148 178 145

Std. number of given hints 34 45 34

Figure 27: FACT-Wrong threshold refinement for the population prior probabilities

Figure 28: FACT-CorrectMove threshold refinement for the population prior probabilities

Figure 29: FACT-WrongMove threshold refinement for the generic prior probabilities

Figure 30: FACT-CorrectMove threshold refinement for the generic probabilities

Figure 31: FACT-WrongMovement threshold refinement for the user-specific prior probabilities

Figure 32: FACT-CorrectMovement threshold refinement for the user-specific prior probabilities

5

Model Precision and Sensitivity

In Sections 3 and 4, two measures of effectiveness of the intervention (hinting) mechanism in Prime Climb, hint precision and hint recall were calculated. On the contrary, the main objective of the current section is quantifying the ability of the student’s model to detect the level of number factorization skills in the player during the interaction with Prime Climb. Similar to the strategy followed in calculating the hint precision and hint recall, since there is no ground-truth

on how the number factorization knowledge evolves during the interaction of the student with the game from the pre-test to the post-test, only numbers with the same score in the pre-test and post-test were considered. To this end, four measures were defined namely, 1)model positive precision, 2)model negative precision 3)model sensitivity 4)model specificity. Before formulating the above measures, some terminologies need to be defined. In the following definitions, “a known/unknown factorization skill” to the player refers to a factorization skill on which the student keeps the same score from the pre-test to the post-test and the student has correctly/wrongly answered the skill’s corresponding question in the pre-test and post-test. • • • • True-Positive: The student’s model correctly assesses a known factorization skill as known to the student during the game-play. False-Positive: The student’s model fails to assess an unknown factorization skill as unknown to the student during the game-play. True-Negative: The student’s model correctly assesses an unknown factorization skill as unknown to the student during the game-play. False-Negative: The student’s model fails to assess a known factorization skill as known to the student during the game-play.

Given the above definitions, model positive precision, model negative precision, model sensitivity and model specificity are formulated as following:
Equation 3: Model Positive Precision

mod el positive precision =

# of True Positive (# of True Positive+ # of False Positive)

Equation 4: Model Negative Precision

mod el negative precision =

# of True Negative (# of True Negative+ # of False Negative)

Equation 5: Model Sensitivity

mod el sensitivity =

# of True Positive (# of True Positive+ # of False Negative)

Equation 6: Model Specificity

mod el specificity =

# of True Negative (# of True Negative+ # of False Positive)

5.1 Simulation of Interactions of the Users with Prime Climb
Log files of the interactions of 45 students in grade 5,6 with Prime Climb were parsed to simulate the movements the students made during the game-play. In sum, there are 8666 movements extracted from the log files. Then, a post-processing filtering was applied to exclude

the movements in which neither the player’s number nor the partner’s number keep the same score from the pre-test to the post-test resulting in 3203 left movements with at least one number with the same status in the pre-test and post-test and the movements were classified in 16 possible groups on the basis of the status of the player’s number and the partner’s number in the pre-test and post-test. Figure 33, represents the percentage frequency distribution of the statuses. Generally, there are 3083 (84.6%) and 559 (15.4%) data points which represent numbers with status of KK and UU respectively.

Figure 33: Percentage frequency of each movement types in total movements made by the players

The objective of the simulation was to evaluate how accurately the model could evaluate the level of factorization knowledge for the numbers with the status of KK and UU after each movement. To this end, we used two thresholds namely FACT-CorrectMove and FACTWrongMove . The former threshold represents the cut-off to evaluate a number factorization skill as known (above the threshold) or unknown (below the threshold) after a correct movement, and the latter threshold is the cut-off for evaluation a number factorization skill as known (above the threshold) or unknown (below the threshold) after a wrong movement. The initial values used for these two thresholds were 0.5 (for FACT-CorrectMove) and 0.8 (for FACT-WrongMove). These values are identical to the original values used for the thresholds in the hinting strategy (see Table 3). We counted the number of True-Positive, True-Negative, False Positive and False Negative cases of all the students and formed the confusion matrix and used the Equations 3-6 to calculate the mode positive precision, model negative precision, model sensitivity and model specificity. The structure of the confusion matrix is represented in Table 23.
Table 23: Structure of the confusion matrix

Pre-Post Test

Known Unknown

Model assessment of student knowledge Unknown Known False Negative (FN) True Positive (TP) True Negative (TN) False Positive (FP)

5.1.1

Population Prior Setting

The confusion matrix in Table 24 represents the percentages of TruePositive, FalsePositive, TrueNegative, FalseNegative when population prior probabilities were used. The results showed a low percentage, (3.4/11.4)%, for the TrueNegative which subsequently will lower the model specificity and model negative precision. On the contrary there is an almost high percentage, (76.7/0.88)%, for the TruePositive. Table 25 gives the values for the four measures. It can be concluded that the student’s model overestimates the student’s number factorization knowledge for those number which are unknown to the student. The results showed that the model which is initialized by the population prior probabilities is in difficulty with accurately evaluating the “unknown” factorization skills while doing an almost good job in assessing the “known” factorization skills. This means that the student’s model overestimates the “unknown” skills. Figure 34 illustrates the percentages of the elements of the confusion matrix for each status.
Table 24: Confusion Matrix (# of raw data points and [percentages]) for the population priors

Pre-Post Test

Known Unknown Total

Model assessment of student knowledge (Population-based Prior) Unknown Known 291[8.0%] (FN) 2792[76.7%] (TP) 123[3.4%] (TN) 434[11.9%] (FP) 414[11.4%] 3226[0.88%]

Total 3083[84.7%] 557[15.3%] 3640[100%]

Table 25: Summary of the results on the model analysis for the population priors setting

Prior Setting Measures Values
Model Positive Precision 0.866

Population-based
Model Negative Precision 0.298 Model Sensitivity 0.906 Model Specificity 0.221

Figure 34: Frequency (%) of each elements of the confusion matrix for each possible relevant status

5.1.2

Generic Prior Setting

Table 26 shows the confusion matrix when the generic prior probabilities were used to initialize the student’s model. Similar to the population prior probabilities setting, a low percentage of (4.7/ 15.3)% for the TrueNegative indicates that the model with the generic prior probabilities has problem with detecting the “unknown” factorization skills to the student during the game-play and consequently a low model negative precision and model specificity are expected. The student’s model has best performed on evaluating the “known” skills as “known” (68.4/84.7)%. Table 27 represents the values for the four measures of model positive precision, model negative precision, model sensitivity and model specificity. Figure 35, illustrates the percentages of the elements of the confusion matrix (TruePositive, FalsePositive, TrueNegative, FalseNegative) for each relevant status.
Table 26: Confusion Matrix (# of raw data points and [percentages]) for the generic priors

Pre-Post Test

Known Unknown Total

Model assessment of student knowledge (Generic-based Prior) Unknown Known 592[16.3%] (FN) 2491[68.4%] (TP) 171[4.7%] (TN) 386[10.6%] (FP) 763[21.0%] 2877[79.0%]

Total 3083[84.7%] 557[15.3%] 3640[100%]

Table 27: Summary of the results on the model analysis for the generic priors setting

Prior Setting Measures Values
Model Positive Precision 0.866

Generic-based
Model Negative Precision 0.225 Model Sensitivity 0.808 Model Specificity 0.307

Figure 35: Frequency (%) of the elements of the confusion matrix for each relevant status

5.1.3

User-specific Prior Setting

Table 28 gives the confusion matrix when the user-specific prior probabilities are used to initialize the student’s model in Prime Climb. The low percentages of the FalseNegative (4.3/ 84.7)% and FalsePositive (2.1% / 82.5%) and high percentages of TrueNegative (13.2% / 17.5%) and TruePositive (80.4/84.7)% has provided evidence on that the student’s model initialized by the user-specific prior probabilities performs well in assessing the “known” skill as “known” and “unknown skills” as “unknown”. Table 29 represents the values for four measures of model positive precision, model negative precision, model sensitivity and model specificity. Figure 36 also illustrates the percentages of the elements of the confusion matrix in their relevant statuses.
Table 28: Confusion Matrix (# of raw data points and [percentages]) for the user-specific prior setting

Pre-Post Test

Known Unknown Total

Model assessment of student knowledge (User-specific-based Prior) Unknown Known 156[4.3%] (FN) 2927[80.4%] (TP) 480[13.2%] (TN) 77[2.1%] (FP) [17.5%] [82.5%]

Total 3083[84.7%] 557[15.3%] 3640[100%]

Table 29: Summary of the results on the model analysis for the user-specific priors setting

Prior Setting Measures Values
Model Positive Precision 0.975

User-specific-based
Model Negative Precision 0.755 Model Positive Sensitivity 0.95 Model Negative Sensitivity 0.862

Figure 36: Frequency (%) of the elements of the confusion matrix for each relevant status

6

Comparison of the Model’s Performance for Different Prior Probabilities

In the previous section we showed that how the student’s model performs when the different prior probabilities settings were used to initialize the student’s model. In this section, the effects of using different prior probabilities settings on the number of TruePositive, FalsePositive, TrueNegative, FalseNegative, model positive precision, model negative precision, model sensitivity and model specificity are statistically discussed.

6.1 Total Number of True-Negative
The Welch test showed that there was a statistically significant difference on the TrueNegative, among the different groups of prior settings (p<0.05). Table 30 shows the descriptive statistics on the TrueNegative. Table 31 represents the results of subsequent GamesHowell test. The results showed that there was no statistically significant difference on the total number of TrueNegative between the population and generic prior probabilities while there was a statistically significant difference between the user-specific prior probabilities setting and the other two settings.
Table 30: Descriptive statistics on True Negative

Mean Standard Deviation

Population Prior 4.1 5.58

User Specific Prior 16 13.24

Generic Prior 5.7 8.08

Table 31: Games-Howell test result (Dependent variable: True Negative)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .648 .000 .002 .648 .002

Significant (*: Yes) * * * *

6.2 Total Number of False-Negative
The welch test showed that there was a statistically significant difference on the FalseNegative, among different groups of prior settings (p<0.05). Table 32 shows the descriptive statistics on the total number of FalseNegative. Table 33 represents the results of the GamesHowell tests. The results showed that there is a statistically significant difference between all three groups of prior probabilities settings on the total number of FalseNegative.
Table 32: Descriptive statistics on False Negative

Mean Standard Deviation

Population Prior 6.61 6.15

User Specific Prior 3.54 4.87

Generic Prior 13.45 10.45

Table 33: Games-Howell test result (Dependent variable: False-Negative)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .030 .001 .030 .000 .001 .000

Significant (*: Yes) * * * * * *

Figure 37: Average of TrueNegative for the different prior probabilities settings

Figure 38: Total TrueNegative for each student

Figures 39 and 40 respectively illustrate the average number of FalseNegative and the total FalseNegative for each student for each prior probabilities setting.

Figure 39: Average of FalseNegative for the different prior probabilities settings

Figure 40: Total number of FalseNegative for each student when different prior probabilities were used

6.3 Comparison of Total Number of True Positive
Following a non-significant difference between the variances of the three prior probabilities settings using the test of homogeneity of variance (Levene statistics), a traditional single factor ANOVA showed that there is no statistically significant difference on the total number of TruePositive among different groups of prior settings (F(2,129)= 0.63 ,p= 0.531758>0.05). Table 34 represents the mean and standard deviation of total TruePositive for the different settings.

Figures 41 and 42, respectively illustrate the average of the TruePositive and the total number of TruePositive of each student for different prior probabilities settings.
Table 34: Descriptive statistics on True Positive

Mean Standard Deviation

Population Prior 63.45 42.91

User Specific Prior 66.52 43.48

Generic Prior 56.61 40.24

Figure 41: Average of TruePositive

Figure 42: Total number of TruePositive of each student for each prior probabilities setting

6.4 Comparison of Total Number of False Positive
The Welch test showed that there was a statistically significant difference on the FalsePositive, among different groups of prior settings (p<0.05). Table 35 shows the descriptive statistics for the FalsePositive. Table 36 gives the result of the Games-Howell post-hoc test. The results showed that there was no statistically significant difference on FalsePositive between the population and the generic prior probabilities settings. Moreover, there was a statistically significant difference on FalsePositive between the user-specific prior probabilities setting and the other two settings. Figures 43 and 44 illustrate the average of the FalsePositive and the total number of FalsePositive of each student for each prior probabilities setting.
Table 35: Descriptive statistics on False Positive

Mean Standard Deviation

Population Prior 14.46 12.97

User Specific Prior 2.56 4.65

Generic Prior 12.86 11.23

Table 36: Games-Howell test results (Dependent variable: False Positive)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .866 .000 .000 .866 .000

Significant (*: Yes) * * * *

Figure 43: Average of the FalsePositive

Figure 44: The total number of FalsePositive of each student for each prior probabilities setting

6.5 Comparison of Model Positive Precision
The Welch test showed that there was a statistically significant difference on the model positive precision, among different groups of prior probabilities settings (p=<0.05). Table 37 shows the descriptive statistics on the model positive precision. Table 38, represents the result of the Games-Howell post-hoc test. The results showed that there is no statistically significant difference on model positive precision between the population and generic prior probabilities settings. Furthermore, there was a statistically significant difference on the model positive precision between the user-specific prior probabilities settings and the other two settings. Figures 45 and 46 respectively illustrate the average (in percentage) of the model positive precision and the model positive precision for each student for different prior probabilities settings.
Table 37: Descriptive statistics on Model Positive Precision

Mean Standard Deviation

Population Prior 84.68 20.73

User Specific Prior 96.87 8.78

Generic Prior 84.57 20.61

Table 38: Paired T-test results. (Dependent variable: Model Positive Precision)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .002 1.00 .002 .002 1.000 .002

Significant (*: Yes) * * * *

Figure 45: Average of the model positive precision

Figure 46: model positive precision for each student for different prior probabilities settings

6.6 Comparison of the Model Negative Precision
The Welch test showed that there was a statistically significant difference on the model negative precision, among different groups of prior settings (p<0.05). Table 40 shows the descriptive statistics on the model negative precision. Table 41 represents the results of GamesHowell test. The results showed that there is no statistically significant difference on the model negative precision between the population and generic settings. Also, there was a statistically

significant difference between the user-specific prior probabilities setting and the other two settings. Figures 47 and 48 illustrate the average (in percentage) of model negative precision and the total model negative precision for each student for each prior probability settings respectively.
Table 39: Descriptive statistics on Model Negative Precision

Population Prior Mean Standard Deviation
38.73 38.52

User Specific Prior
75.91 26.45

Generic Prior
32.94 36.25

Table 40: Games-Howell test results (Dependent variable: Model Negative Precision)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .821 .000 .000 .821 .000

Significant (*: Yes) * * * *

Figure 47: Average of the Model Negative Precision

Figure 48: Model Negative Precision for each student and each prior probabilities setting

6.7 Comparison of the Model Sensitivity
The Welch test showed a statistically significant difference on the model sensitivity, among different groups of prior settings (p<0.05). Table 41 shows the descriptive statistics on the model sensitivity. Table 42 gives the result of the Games-Howell test. The results showed that there existed a statistically significant difference among all settings of the prior probabilities. Figures 49, 50 illustrate the average (in percentage) of the model sensitivity and the total model sensitivity for each student and for each prior probabilities setting.
Table 41: Descriptive statistics on Model Sensitivity

Mean Standard Deviation

Population Prior 90.5 6.82

User Specific Prior 95.26 5.45

Generic Prior 80.93 9.95

Table 42: Games-Howell test result (Dependent variable: Model Sensitivity)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .002 .000 .002 .000 .000 .000

Significant (*: Yes) * * * * * *

Figure 49: Average (%) of the model sensitivity

Figure 50: Model sensitivity for each student for each prior probabilities setting

6.8 Comparison of Model Specificity
A Welch test found a statistically significant difference on the model specificity among different groups of prior settings (p<0.05). Table 43 gives the mean (in percentage) and standard deviation of the model specificity for different groups of the prior probabilities. Table 44 shows the results of the Games-Howell post-hoc test. The results show that there was no significant

difference between the population and generic prior probabilities settings on model specificity while there existed a statistically significant on the model specificity between the user-specific and the other two prior probabilities settings. Figures 51 and 52 illustrate the average (in percentage) of the model specificity and the total model specificity for each student for different prior probabilities settings.
Table 43: Descriptive statistics on Model Specificity

Mean Standard Deviation

Population Prior 19.39 24.03

User Specific Prior 91.67 14.15

Generic Prior 24.06 27.53

Table 44: Games-Howell test result (Dependent variable: Model Specificity)

GamesHowell Test

Prior Probabilities Population User-specific Generic

Comparison

Prior Probabilities User-specific Generic Population Generic Population User-specific

p-value (Sig.) .000 .764 .000 .000 .764 .000

Significant (*: Yes) * * * *

Figure 51: Average (%) of model specificity

Figure 52: Model specificity for each student for the different prior probabilities settings

7

Preliminary Analysis on Pre-Post Tests

As shown in the previous Section 4, the original values for the thresholds in the hinting mechanism in Prime Climb resulted in low hint precision and hint recall in the population and generic prior settings. On the contrary, it was shown that initializing the student’s model with the user-specific prior probabilities will result in high hint precision and hint recall. Moreover as already discussed in Section 3, in measuring the hint precision and hint recall we had to consider solely the movements involving at least one number (player’s number or partner’s number) which appears on the pre-test and post-test and the student keeps the same answer to the number’s corresponding question in both pre-test and post-test. Following this constraint, a few number of hints (Mean: 29.3%, Std: 9.96%) out of all hints given to the student could be consider in calculating the hint precision and hint recall. This fact could negatively affect the values of hint precision and hint recall when the user-specific prior is used as the prior probabilities are only set for the nodes in the BN whose corresponding numbers appear on the pre-test and post-test and the prior probabilities of the others are set to 0.5 which is equal to the prior probabilities used in the generic prior setting. To investigate the possibility of such negative effect, we have calculated some preliminary descriptive statistics on the numbers appearing on the pre-test and post-test. Table 45 represents the numbers with most frequency of appearance in the movements and whether or not they appear on the pre-test and post-test. (Y: yes, N: No)
Table 45: Numbers (15-top) with highest frequency of appearance in the movements

Number 17 25 76 4 27 40 81 89 99 97 96 19 37 31 9 Frequenc 713 644 578 554 515 498 463 439 412 407 391 373 366 345 325 y N Y N N Y N Y Y N Y N N N Y Y In pretest?

It can be resulted that more than 50% (8 out 15) of the numbers with highest frequency of visit do not appear on the pre-test and post-test. Table 46 and 47 also show the number with most frequency of visit in correct and wrong movements respectively. It is shown that 60% of the highest visited numbers involving in the correct movements do not appear on the pre-test and post-test. The situation is worse for the wrong movements (0.73%).
Table 46: Numbers with highest frequency of visit in the correct movements

Number Frequenc y In pretest?

17 71 3 N

25 58 0 Y

76 49 2 N

89 43 9 Y

27 42 7 Y

4 42 0 N

97 40 7 Y

81 39 9 Y

19 36 7 N

37 36 2 N

31 34 5 Y

13 31 0 N

99 30 8 N

40 30 4 N

71 28 3 N

Table 47: Numbers with highest frequency of visit in the wrong movements

Number 40 57 18 96 4 15 99 36 50 194 145 143 142 134 108 104 100 95 Frequency N N N N N Y N N N In pretest?

21 94 N

9 91 Y

33 89 Y

27 88 Y

76 86 N

69 74 N

8

Conclusion and Future work

This manuscript reports on the results on the student’s model parameters refinement, analysis of the intervention mechanism and the student’s model used in Prime Climb. It was discussed that the highest accuracy of predicting the performance of the students in the post-test, conducted after the students interacting with Prime Climb, is 75.5% when the population prior setting is used. It was also found that when the population and generic prior settings were used the hint precision and hint recall were of very low values. On the contrary, these values were high when the user-specific prior setting was used and there was significant difference on total number of justified, unjustified and missed hints with between the user-specific prior probability settings and the other two settings while in all cases (except for the total number of correctly not-given hints) there was no significant difference between the population prior probabilities and the generic prior probabilities settings. Furthermore, it was shown that the student’s model initialized with the user-specific prior probabilities setting resulted in higher model positive precision, model negative precision, model sensitivity and model specificity. As for future work, we would like to concentrate on the situations which negatively affect the model’s specificity and model negative precision and investigate if they follow some specific patterns. The other focus will be on finding the most appropriate time to intervene as it was shown that, although the student is interrupted too much during the interaction with the game and provided with hints, the hint precision and hint recall are very low when the population and generic prior probabilities are used to initialize the model.