Faktor SA Eng

Chemistry Education
Research and Practice

View Article Online
PAPER View Journal
Characterizing change in students’

self-assessments of understanding
Cite this: DOI: 10.1039/d0rp00255k
when engaged in instructional activities
Jenna Tashiro, *a Daniela Parga, b
John Pollard a
and Vicente Talanquer a
Published on 22 March 2021. Downloaded on 5/15/2021 3:47:49 PM.
Students’ abilities to self-assess their understanding can influence their learning and academic
performance. Different factors, such as performance level, have been shown to relate to student self-
assessment. In this study, hierarchical linear modeling was used to identify factors and quantify their
effects on the changes observed in chemistry students’ self-assessed understanding when engaging in
instructional activity. This study replicates and expands on previous findings regarding performance by
showing that the worse students performed on a task, the more likely they were to lower their self-
assessed understanding after that activity. Task difficulty was found to be a significant effect on change
in students’ self assessments with students being more likely to lower their self-assessed understanding
after a more difficult task and raise it following an easier task independent of performance. Perceived
comparative understanding (how students thought they compared to their surrounding peers) was also
found to be a significant effect. Students who later reported their understanding to be lower than their
peers, as compared to those who later reported their understanding to be about the same as their
Received 22nd August 2020, peers, were observed to have lowered their self-assessed understanding. Actual comparative
Accepted 16th March 2021 performance (difference in performance of the student to their surrounding peers), gender, and
DOI: 10.1039/d0rp00255k feedback were not found to be significant effects on change in students’ self-assessed understanding.
The results of this investigation may inform instructors on how their instructional decisions differentially
rsc.li/cerp impact changes in students’ judgements about their understanding.
Introduction compared to what the desired level of understanding may be

(Dunlosky and Thiede, 1998).
Students’ abilities to self-assess their understanding has consistently Research in many content areas, including chemistry, has
been shown to affect learning and academic performance shown that many students, especially lower performing students,
(Wang et al., 1990). Self-assessment during a learning situation tend to over-estimate what they know or have learned,
is a critical aspect of monitoring cognition which, together with (Kruger and Dunning, 1999; Austin and Gregory, 2007; Bell and
planning and evaluation, are considered essential components Volckmann, 2011; Mathabathe and Potgieter, 2014; Pazicni and
of regulation of cognition, a subcomponent of metacognition Bauer, 2014; Hawker et al., 2016; Sharma et al., 2016). Flawed self-
(Schraw et al., 2006). A variety of studies in recent years have assessment can impact academic performance. For example,
provided solid evidence of the significant impact that meta- chemistry students’ overconfidence negatively affects student
cognition has on student learning and academic achievement performance when not corrected, especially when that over-
(Ohtani and Hisasaka, 2018). confidence was developed during the learning process
Different authors have linked the effects of self-assessment (Mathabathe and Potgieter, 2014). Long-term consequences of
on academic performance to differences in study behavior. For flawed self-assessment, such as the dangers associated with a
example, in deciding what topics to focus on during study health professional’s inability to assess their competence, have
time, students self-assess their level of understanding of the been cited as reasons to focus on it while students are learning
targeted topics (Thomas et al., 2016). In deciding how long to (Austin and Gregory, 2007).
spend studying something, students self-assess their learning Researchers have looked at different factors that may affect
student self-assessment of understanding. Some of the studied
a
Department of Chemistry and Biochemistry, University of Arizona, Tucson,
factors are personal or characteristic of the student, such as
AZ 85721, USA gender (Hawker et al., 2016; Kim et al., 2016), while others are
b
Department of Physiology, University of Arizona, Tucson, AZ 85721, USA contextual or characteristic of the classroom, such as task
This journal is © The Royal Society of Chemistry 2021 Chem. Educ. Res. Pract.
View Article Online
Paper Chemistry Education Research and Practice
difficulty or the level and nature of feedback provided (Butler problematic as an individual’s self-assessment of understanding
et al., 2008; Thomas et al., 2016). Although research on the can be distorted.
factors that affect self-assessment can be found in a variety of Distortions in self-assessment of understanding were most
fields, there is no comprehensive study in chemistry education. notably studied by Kruger and Dunning (1999). These researchers
Much of the existing research has focused on the analysis of found that lower performers tended to overestimate their ability
self-assessment related to summative assessment events rather while higher performers often underestimated their ability. The
than during the learning process. Student engagement in participants estimated their ability following tasks of various
instructional activities in the classroom does not only facilitate topics outside of the normal classroom experience. These results,
cognitive development, but also impacts the way students often referred to as the ‘Dunning and Krueger effect,’ were also
perceive their own understanding. Completion of instructional found when topic relevant tasks were performed in groups.
tasks can help students realize that they may not know as much Studies focused on metacognitive monitoring have been
as they thought, affecting their study behavior and leading to conducted in different domains. For example, undergraduate
improved performance. senior-level pharmacy students of all performance levels were
Based on what areas were perceived to need additional found to overestimate their clinical knowledge and communication
research, this exploratory, theory-building study was designed skills (Austin and Gregory, 2007). Similarly, Sharma et al. (2016)
to characterize how engaging in instructional activity relates to found 74% of undergraduate medical students overestimated their
general chemistry students’ self-assessments of understanding. grades on a physiology test, but their self-grading had a high
The major findings of this investigation are summarized in this correlation with the instructors’ grading. High correlation indicating
paper. The central goal was to provide a comprehensive char- that although students overestimated, the general trend was that as
acterisation of student factors and classroom factors that self-grading increased, so did the instructors’ grading.
correlate to changes in students’ self-assessments of under- In chemistry education, several studies have looked into the
standing when engaging in classroom activities. This analysis is ability of students to self-assess their understanding in testing
important because instructional decisions may have unintended situations (Bell and Volckmann, 2011; Mathabathe and Potgieter,
consequences on students’ self-assessments of understanding 2014; Pazicni and Bauer, 2014). Students have been found to be
that instructors must recognize when planning their lessons generally overconfident in their understanding, with higher
(Rickey and Stacy, 2000; Sargeant et al., 2007). performing students being less overconfident. This trend was
observed in general chemistry courses when confidence was
Self-assessment of understanding measured at the specific question level using both an 11-point
Self-assessment of understanding is an important aspect of scale with their score on a low-stakes quiz as the performance
metacognition. Flavell (1979) provided the first comprehensive measure (Mathabathe and Potgieter, 2014), and a 3-point scale on
theoretical framework for metacognition and defined it as a knowledge survey with their score on a high-stakes final exam
‘‘knowledge and cognition about cognitive phenomena,’’ or as the performance measure (Bell and Volckmann, 2011). Similar
colloquially, ‘‘thinking about thinking.’’ Metacognition is often results were reported when undergraduate general chemistry
thought as similar to self-efficacy; however, Moores et al. (2006) students predicted their exam grades (Hawker et al., 2016).
distinguished between these two constructs by proposing that Looking at comparative performance percentiles following
self-efficacy refers to belief in one’s ability to perform, while exams, general chemistry students were found to overestimate
metacognition involves judgment of one’s level of performance. at the lower performance levels and underestimate at the higher
Schraw and Moshman (1995) thought of metacognition as performance levels (Pazicni and Bauer, 2014).
having two major components: knowledge of cognition and Although individual performance is the factor most fre-
regulation of cognition, also referred to as metacognitive skil- quently accounted for in these types of studies, there has been
fulness. Knowledge of cognition refers to different types of inconsistent reporting on how influential it really is. A review of self-
knowledge, such as what someone knows about themselves as a assessment studies found a low average correlation between perfor-
learner, different learning strategies and when to use them. mance and self-assessment measures (Mabe III and West, 1982),
Regulation of cognition comprises different skills, such as planning, while other studies have reported moderate (Bol and Hacker, 2001)
monitoring, and evaluation. Several studies in chemistry education and high (Sharma et al., 2016) correlations. Kruger and Dunning
have investigated the effect of different instructional interventions (1999) did note that the effect of performance using simple
on the development of these metacognitive skills (Cooper et al., 2008; regression was not a statistically significant predictor of the self-
Cooper and Sandi-Urena, 2009; Sandi-Urena et al., 2012). assessment percentiles. However, Pazicni and Bauer (2014)
Nelson and Narens (1990) proposed a mechanism for meta- found performance to be a statistically significant predictor of
cognition in which information flows from the object-level to a the accuracy of self-assessment, defined by the difference
meta-level via monitoring, and flows back to the object-level via between performance and self-assessment.
control. These authors characterized metacognitive monitoring In addition to performance, it is widely accepted that other
using terms like judgments, feeling-of-knowing, and confidence factors could relate to student self-assessment of understanding
suggesting that student ability for self-assessment of understanding and therefore are worthy of investigation (Veenman et al., 2006).
is a monitoring activity. They noted, however, that using self- For example, Carvalho and Moises (2009) have identified
assessment as a measure of metacognitive monitoring is personal, task-related, and environmental factors that may affect
Chem. Educ. Res. Pract. This journal is © The Royal Society of Chemistry 2021
View Article Online
Chemistry Education Research and Practice Paper
metacognitive monitoring. Personal factors include gender and changes in students’ self-assessments of understanding following
cognitive ability; task-related factors include different characteristics engagement in instructional activity in a chemistry classroom.
of the activity individuals are asked to complete, and environmental For factors under analysis, the following research questions were
factors refer to characteristics of the learning environment. sought to be answered:
The effect of gender on self-assessment has been investi- 1. Is the effect of the factor on the change in students’ self-
gated in a number of studies. Kruger and Dunning (1999) did assessments of their understanding statistically significant?
not find gender to significantly impact comparative self- 2. For a factor that is a statistically significant effect, what is
assessment measures for different types of tasks on various the size of that effect, it’s practical significance?
subjects. Similar results were reported by Kim et al. (2016) but a
study in the domain of chemistry found gender to be of Participants and research setting
significance under some conditions, although its impact had Study participants were undergraduate students enrolled in the
a small effect size (Hawker et al., 2016). second semester of the introductory general chemistry courses
Feedback is a task-related or classroom factor that can be direct for science and engineering majors (regular and honors students)
or indirect. Direct feedback is provided to individual students based at the University of Arizona in the US. The curricular model for
on their performance, while indirect feedback results from the these courses emphasizes the development of chemical thinking
comparison of a student’s performance to that of others. Direct to answer questions of relevance in the world (Talanquer and
feedback can be provided by indicating what questions are right or Pollard, 2010). Active and collaborative learning are essential
wrong or posting the correct answers. Graded tasks provide direct components to the classroom experience in this curriculum.
feedback via the grade received. In a study based on two general Students in these courses engaged in the tasks/activities analysed
knowledge survey tasks, lack of feedback combined with lower in this study as part of the regular tasks/activities of the course;
confidence in the first task resulted in a greater likelihood of however, only data for students who gave consent were used in
changing a correct answer to an incorrect answer on the second task analysis. Participating students in each course section received no
(Butler et al., 2008). Receiving direct immediate feedback in the first incentive for their participation. This study was approved by the
task resulted in correct answers being unlikely to change in the Institutional Review Board at the University of Arizona.
second task, independent of confidence level. Data were collected in three course sections as summarized
Kruger and Dunning (1999) looked at the relationship between in Table 1 (N = 407; 62% female, 38% male). Data were collected
peer evaluation and self-assessment. Following an initial post-task on four occasions or sessions about evenly distributed through-
self-assessment, students evaluated their peers’ work. Lower per- out the semester for each course section. Participants with
forming students were less accurate when evaluating their peers’ incomplete or missing self-reports of their self-assessed under-
performance than the higher performing students. Lower perform- standing for a data collection session were excluded from the
ing students following peer evaluation did not improve their self- study for that data collection session only. In two later data
assessment accuracy, but higher performing students re-evaluated collection sessions, the instructor asked the students additional
themselves closer to their actual percentile rankings. The evalua- metacognitive questions. Due to the concern that those additional
tion of peer performance is an example of indirect feedback, as it questions could have influenced the students’ self-assessments,
provides insights into an individual’s performance by comparison. the data collected in those sessions were not included in
In general, there is a lack of research in chemistry education on the this study.
effect of feedback, direct or indirect, on students’ self-assessed
understanding. Measurement instruments
Task difficulty is another classroom factor that affects self- Tool for self-assessment of understanding. This research
assessment of understanding. It has been shown that when actual study was designed to identify factors that affect students’ self-
or perceived task difficulty increases, self-assessment decreases in a assessments of their understanding and quantify their impact.
variety of topics (Kruger and Dunning, 1999; Burson et al., 2006; In order to accomplish this goal, an instrument was developed that
Thomas et al., 2016). Kim et al. (2016) found that task difficulty asked students to complete such self-assessment using a Likert-
mediated the impact of performance on self-assessed ability. like scale and report that rating through an online quiz site. In
Environmental factors, such as culture, have also been assessed particular, students were asked to rank their understanding of a
in previous research. While many of the studies regarding self- given topic using a five-category scale, as presented in Table 2.
assessment of understanding have been conducted at universities in Informed by Bloom’s Taxonomy (1956) and the Blooming Biology
the United States, several studies have also been conducted cross-
culturally with similar results (Mathabathe and Potgieter, 2014;
Kim et al., 2016). Table 1 Data information
Section 1 Section 2 Section 3

Methods N = 221 N = 158 N = 28
Goal and research questions Type Regular Honors Regular
Semester Spring Spring Summer
The goal of this exploratory research study was to provide a Gender Female 67.0% Female 55.1% Female 67.9%
comprehensive characterization of factors that relate to Male 33.0% Male 44.9% Male 32.1%
View Article Online
Table 2 Tool for self-assessment of understanding
Scale
Score Title Description
1 Lost I am unsure of the definitions and meanings of the key concepts (I just guess)
2 Novice I know the definitions and meanings of the key concepts, but not how to apply them (I can’t get the right answers)
3 Middle I know how to apply the key concepts, but not explain them (I hope for multiple choice)
4 Advanced I know how to use the key concepts to explain things (I’m great with the free response questions)
5 Master I think about how these key concepts apply to things we haven’t talked about
Tool (Crowe et al., 2008), the scale moves from lack of under- performance tertiles based on their final exam scores. The post-
standing (Lost) and lower level understanding like remembering task self-assessment of understanding scores (5-point scale)
(Novice) through to higher levels like applying (Middle), explaining were scaled up to make them comparable to the 100-point scale
(Advanced), and synthesising (Master). used in the final exam. The results from the first pilot study
As with any measurement, especially with self-reporting, data collection, shown in Fig. 1 are similar to those presented
measuring may have its own effect. Because this study was by Bell and Volckmann (2011). The trend of overestimation in
interested in what changes happened during classroom activities the two lower performance quantiles and a more accurate
and not as a result of having students self-reflect or the addition estimation in the higher performance quantiles is similar to
of any metacognitive intervention or training, an attempt was that seen in studies that used a variety measures or tools for
made to limit the amount of self-reflection being asked of the self-assessment such as a Likert-like confidence scale (Mathabathe
students. Participants were asked to report their self-assessed and Potgieter, 2014), expected grades (Hawker et al., 2016), and
level of understanding before and after the performance task. analytical checklists (Austin and Gregory, 2007). The results from
Dunning et al. (1989) looked at the role of trait ambiguity in the second pilot study data collection, shown in Fig. 2, are
the inaccuracy in self-assessments of ability. In the develop- similar to the results of the ‘‘Dunning and Kruger effect’’
ment of the research tool, ‘‘understanding’’ was considered as a (Kruger and Dunning, 1999) using a measure of self-assessed
reasonably ambiguous trait given that different students could comparative percentile rankings of ability compared to actual
use different criteria in their self-assessment. To reduce such performance of the given task. These results, which have been
an ambiguity, criteria was provided for assessment in the form replicated using the same measures as Kruger and Dunning
of descriptions of different levels of understanding that (Pazicni and Bauer, 2014; Kim et al., 2016), still indicate an
were specific to the types of activities that students experienced overestimation in the lower performance quantiles, but an
(see Table 2). underestimation in the higher performance quantile.
To validate the research instrument, a pilot study was Data collection. A flow chart of data collection procedure
conducted and the results were compared to those previously and the data collected are presented in Fig. 3. Data were
reported. This pilot study was conducted in the year prior to this collected in different classrooms using the following procedure:
investigation. Pilot study participants (N = 65) were also under- Students were presented with the tool for self-assessment of
graduate students enrolled in the second semester of the intro-
ductory general chemistry courses for science and engineering
majors. Data were collected with the tool for self-assessment of
understanding on two occasions throughout a semester. During
each occasion, students were asked to use the tool before and
after a performance task. These tasks were in-class quizzes
similar to the knowledge surveys used by Bell and Volckmann
(2011) but administered through an online quizzing site. At the
end of the semester, student performance was evaluated using a
standardized general chemistry conceptual exam from the
American Chemistry Society.
Pilot study participants with missing data were excluded
from use in the analysis giving a total of 48 participants in the
first occasion and 53 in the second. As a comparison to prior
research was being made, the data were represented and
analysed using the same types of plots developed to analyse
equivalent data in those prior studies, shown in Fig. 1 and 2.
These plots compare self-assessment measures to performance
measures grouped by performance quantiles. This analytical Fig. 1 First data collection of pilot study for instrument validation. Scaled
approach was outlined by Bell and Volckmann (2011), who used self-assessment of understanding scores and final exam scores sorted by
a 3-point scale in their study. Students were separated into final exam performance tertiles.
View Article Online
Dependent measure. The dependent measure or predicted

variable in the analysis, the ‘‘change in self-assessed under-
standing’’ (DSAU), was calculated by subtracting the pre-task
self-assessment of understanding score from the post-task self-
assessment of understanding score. Positive DSAU scores are
thus indicative of an increase in self-assessed understanding
following a given task and negative DSAU indicative of a
decrease. The greater the absolute value of the DSAU score,
the bigger the change. For example, the theoretical minimum
DSAU score would be 4, representing a change from initially
self-reporting a ‘‘master’’ level understanding and then post-
task self-reporting a ‘‘lost’’ level of understanding. The theoretical
range would be 8. The actual observed minimum DSAU was 3
and observed maximum DSAU was +3 giving an actual range of
6 for the DSAU scores. If thought of as categories, there would be
seven categories ranging from 3 to +3 including a category for a

Fig. 2 Second data collection of pilot study for instrument validation. score of zero (representing no change). Descriptive statistics for
Scaled self-assessment of understanding scores and final exam scores the self-assessment of understanding variables are shown in
sorted by final exam performance tertiles.
Table 3.
Independent measures. Independent measures of variables
understanding and asked to rate their current level of under- (factors) potentially relating to DSAU scores can be classified
standing on a given topic. They were then given a performance into two main groups: measures of student variables and
task (quiz) to complete individually. This in-class quiz con- measures of classroom variables. Each of the different types of
sisted of several multiple-choice questions pertaining to the measures analysed in this study are described in the following
topic being addressed at that point in the course. Following the paragraphs.
completion of this task, students were again presented with Measures of student variables. Student variables were operatio-
the tool for self-assessment of understanding and asked to rate nalized through the following measures:
their level of understanding based on their perceived perfor-
mance. Subsequently, study participants received a second task Measure of gender. In line with previous studies, ‘‘gender’’ was
to complete collaboratively with the other students at their used to differentiate participants into two main categories (male,
table. This collaborative task was an in-class activity based on a female). It is important to recognize, however, the limitations of
problem-based open-ended prompt. Following the completion of this binary categorization in light of modern interpretations of
this collaborative task students completed a post-collaborative gender identity.
task survey. In this survey, they were asked to rate their under-
standing compared to the other members of their group. All Measure of initial self-assessed understanding. ‘‘Initial self-
students completed their self-assessments of understanding and assessed understanding’’ was defined as the group-mean centred
the performance task on an on-line quizzing site, www.socrati- pre-task self-assessed understanding score. For example, an initial
ve.com. Students were given the collaborative tasks on paper. self-assessed understanding of one would indicate that student
Fig. 3 Flow chart for data collection. Data collection procedure is shown in the series arrowed boxes in the center of the figure. Measurements
collected at different points in the data collection procedure or regarding different aspects of the data collection process are shown as white boxes.
View Article Online
Table 3 Descriptive statistics for self-assessed understanding variables feedback to the student or ‘‘immediate feedback’’ for tasks that
showed the correct answer after students completed a task.
Self-assessed understanding
Pre-task Post-task Change (DSAU) Measure of task difficulty. Task difficulty was measured by
Mean 3.19 3.18 0.01 taking the average performance score of all students across
Standard deviation 0.75 0.95 0.8 semesters who completed that task. The scores were converted
Median 3 3 0
Minimum 1 1 3 to be on a 0–100 point scale such that all tasks were on the same
Maximum 5 5 3 scale independently of how many questions were asked. Tasks
Range 4 4 6 were then separated into tertiles based on these scaled average
Skew 0.22 0.41 0.3
Kurtosis 0.09 0.34 0.53 performance scores. Additionally, the tasks were given to faculty
that taught general chemistry at the University of Arizona. Faculty
ranked the tasks from easiest to most difficult and those rankings
had a pre-task self-assessment score one unit above the class were also separated into tertiles. Faculty rankings were in align-
average on that day. Positive values for initial self-assessment ment with the categorization based on the scaled average perfor-
indicate that students pre-task self-assessed themselves higher mance scores with the tasks faculty ranked into the bottom tertile
than the class average. Negative values for initial self-assessment for difficulty also being in the bottom tertile of the scaled average
indicate that students pre-task self-assessed themselves lower performance scores and so forth. Tasks were categorized in order
than the class average. An initial self-assessed understanding of of difficulty: ‘‘low task difficulty,’’ ‘‘moderate task difficulty,’’ and
zero indicates that the level a student pre-task self-assessed their ‘‘high task difficulty.’’
understanding to be was equal to the class average.
Data analysis
Measure of student performance. ‘‘Student performance’’ was
In this exploratory study, data were analysed using statistical
defined as the group mean centred score the student received
modelling to build a theory detailing the factors that relate to
on the performance task. The student’s original task score was
changes in students’ self-assessments and quantifying those
converted to be on a 0–100 point scale to ensure that the
relationships.
assigned score was independent of the number of questions
Hierarchical linear modelling (HLM). As data were collected
per task. With this transformation, a one unit increase in a
from students in multiple course sections on several occasions,
student performance score indicated that a student received a
the data were considered nested or able to be divided by subunits
score one percent above the average for that class on that task.
of student, data collection session, and course section. Nested
Positive student performance scores indicating higher than
data present two main issues that may need to be accounted for
average performance and negatives score indicating lower than
when using such data. One, not all subunits can be assumed to
average.
be the same. For instance, sections of the same course cannot be
assumed as the same. And two, variables within a subunit cannot
Measure of perceived comparative understanding. Perceived
be assumed to be independent from variables between subunits.
comparative understanding was measured as a categorical
For example, student factors cannot be assumed to be independent
variable corresponding to the student responses in the post-
from classroom factors that varied between data collection sessions.
collaborative task survey when asked to rate their understanding
Hierarchical Linear Modelling (HLM) accounts for the variance
of the given subject compared to the other members of their
between subunits and the interdependencies of the predictor vari-
group. Their rate could be ‘‘higher’’, ‘‘lower’’, or ‘‘about the same’’.
ables or factors (Raudenbush and Bryk, 2002; Woltman et al., 2012).
Measure of actual comparative performance. Actual comparative If these issues are of significance, HLM is recommended to account
performance was measured by the difference in the performance for data nesting and allow for generalisation over all subunits
score of a student and the average of their collaborative group. (Theobald, 2018). To determine if the need to account for these
Students with performance scores within plus or minus ten issues was of significance, three likelihood ratio tests (LRTs) were
percentage points of the collaborative group average were categor- performed and three intraclass correlation coefficients (ICC) were
ized as having ‘‘about the same actual comparative performance’’. calculated.
Students with performance scores below ten percentage points of The LRTs gave indication of the statistical significance of
the collaborative group average were categorized as having ‘‘lower accounting for the issues of nested data (Peugh, 2010). LRT
actual comparative performance’’. Students with performance compares two statistical models, in this case a basic HLM and a
scores ten percentage points above the collaborative group simpler linear regression model. The comparison or ‘‘model
average were categorized as having ‘‘higher actual comparative fit’’ determines if the given data fit the larger model signifi-
performance’’. cantly better than the smaller one. If there was a statistically
Measures of classroom variables. Classroom variables were significant difference in model fit, the need to account for the
operationalized through the following measures: nesting of data was supported. The first LRT used students as
the subunit, the second LRT used data collection sessions
Measure of feedback. Feedback was assessed as a categorical as the subunit for the larger model, and the third LRT used
variable as either ‘‘no feedback’’ for tasks that did not give any course sections as the subunit for the larger model. For each
View Article Online
possible subunit, an unconditional means model (UCM) long been treated as interval and analysed by parametric means
using restricted maximum likelihood (REML) estimation was with some arguing that doing so leads to ‘‘the wrong conclusion’’
conducted with a fixed intercept and random intercept for each (Jamieson, 2004) and others arguing that it has not resulted in
subunit. researchers ‘‘getting the wrong answer’’ (Norman, 2010). Though
The ICCs gave indication of the practical significance of some adhere to a more black and white interpretation of the
accounting for the variance between-subunits (Lorah, 2018). rules regarding the use of ordinal data in parametric analyses
The ICC (r) was calculated as the random intercept variance (Kuzon et al., 1996; Jamieson, 2004), others disagree with
divided by the sum of the random intercept variance and the a rules-based approach to what statistical methods are allowed
residual variance (Raudenbush and Bryk, 2002; Lorah, 2018; or not allowed, promoting situational evaluation of data
Theobald, 2018). The ICC, in this case, was the percentage of (Velleman and Wilkinson, 1993; Bacchetti, 2002). General
the variance in the change in students’ assessments of their trends in the research suggest that it is acceptable to analyse
understanding (DSAU) that was explained by the between- ordinal data with parametric methods when the data are
subunit differences. The higher the ICC, the greater the prac- normally distributed (Olsson, 1979; Gaito, 1980; Muthén and
tical significance of accounting for the nesting of data by use of Kaplan, 1985; Bauer and Sterba, 2011; Grilli and Rampichini,
HLM. When the ICC is above 10%, it is considered non-trivial 2012), the scale has a minimum of five to seven categories
and is evidence of the need to account for the nesting of data by (Muthén and Kaplan, 1985; Norman, 2010; Bauer and Sterba,
use of a multilevel method such as HLM (Lee, 2000). 2011; Grilli and Rampichini, 2012), there is a large sample size
The first UCM model that allowed for consideration of (Muthén and Kaplan, 1985; Dolan, 1994), and the other key
nesting within students (random intercept effect with student assumptions for the given analysis are assessed (Gaito, 1980;
as a subunit) did not show a statistically significant difference Carifio and Perla, 2008). Some additionally argue that when
in model fit from the linear regression model (fixed intercept ordinal data are combined by summing or with change scores,
only) with LRT w2(1) = 3.1 107, p = 0.9996. Less than 0.1% the data can be treated as interval just as binary correct-
of the DSAU variance was explained by the between-student incorrect data are combined on a multiple choice exam to create
differences (r = 1.12 108). Given no significant improvement an overall score that is treated as interval (Carifio and Perla,
in model fit and negligible DSAU variation between students, 2008; Norman, 2010).
accounting for the nesting of data within students was not In this study, a large sample size was used, if treated
needed. ordinally DSAU would be a 7-category scale (observed range of
The second UCM model that allowed for consideration of 3 to +3), DSAU is normally distributed, and the other key
nesting within data collection sessions (random intercept assumptions for HLM were assessed and are reported in the
effect with data collection session as a subunit) did show a following subsections. As such, in this study’s characterization
statistically significant difference in model fit from the linear of the change in students’ self-assessed understanding (DSAU),
regression model (fixed intercept only) with LRT w2(1) = 24.4, the ordinal self-assessment data was treated as interval. For
p o 0.0001. 11.6% of the DSAU variance was explained by the example, with the methods used, a change (DSAU = 1) from
between data collection session (session) difference (r = 0.116). initially self-reporting an ‘‘Advanced’’ level understanding and
The significant improvement to model fit and DSAU variation then post-task self-reporting a ‘‘Middle’’ level of understanding
between sessions supported the need to account for nesting of is assumed to be the same as a change (DSAU = 1) from
data within sessions. Given this, data analysis moved forward initially self-reporting a ‘‘Middle’’ level of understanding and
with a two-level HLM with session as the subunit. then post-task self-reporting a ‘‘Novice’’ level of understanding.
The third UCM model that allowed for consideration of Additionally, a two-level decrease in self-assessment (DSAU = 2)
nesting within course sections (random intercept effect with would be treated as twice as larger as any single-level decrease in
course section as a subunit) did not show a statistically significant self-assessment (DSAU = 1).
difference in model fit from the linear regression model (fixed Assumptions of HLM. The following key assumptions of
intercept only) with LRT w2(1) = 0.47, p = 0.4922. 0.6% of the DSAU HLM were assessed (Raudenbush and Bryk, 2002; Cohen et al.,
variance was explained by the between course section difference 2003; Ferron et al., 2008; Woltman et al., 2012):
(r = 0.00581). Given no significant improvement in model fit and
negligible DSAU variation between sections, accounting for the Normality of the dependent variable. Descriptive statistics for
nesting of data within course sections was not needed. the self-assessed understanding variables are shown in Table 3.
R statistical computing software was used to analyse the The dependent/predicted variable of change in students’
data applying HLM (Wickham and Henry, 2018). assessments of their understanding (DSAU) was found to be
Consideration of the dependent variable scale. The tool for normally distributed. Both skew and kurtosis were within a
self-assessment of understanding uses a 5-category Likert-like more than acceptable range of 1 (Kim, 2013). The normality of
scale creating ordinal data. In the strictest interpretation of DSAU for each session were additionally evaluated with Box
transformations or calculations that are allowed with ordinal plots (Albers, 2017).
data, change scores like the dependent variable, DSAU, and
parametric data analysis such as HLM should not be used Multicollinearity. Multicollinearity of independent continuous
(Stevens, 1946, 1958). Ordinal data from Likert-like scales have variables were assessed by calculation of variance inflation
View Article Online
factors (VIF). VIF values greater than 10 are considered indicators Peugh, 2010; Selya et al., 2012; Luo and Azen, 2013; Lorah,
of multicollinearity (Craney and Surles, 2002). The pre-task self- 2018). An appropriate effect size measure must account for all
assessed understanding scores displayed multicollinearity with other variables (fixed effects) because the modelling is multi-
student performance scores with VIF scores ranging from 15.28 variate and must also account for variance between subunits
to 43.29. To address this issue, the pre-task self-assessed under- (random effects) because the modelling is multilevel (Selya
standing scores and student performance scores were group- et al., 2012; Lorah, 2018). Based on the goals of this study, two
mean centred by session (Cohen et al., 2003). The resulting methods for assessing practical significance were chosen.
VIF values, calculated using the group-mean centred variables Details and calculations of the two methods used for the
ranged from 1.00 to 1.01 indicating that the multicollinearity was analysis of practical significance can be found in Appendix 2:
accounted for. practical significance.
First, practical significance was assessed by evaluation and
Homoscedasticity. Following the completion of model building, comparison of standardized and partially standardized fixed
both student residuals and classroom zetas showed constant effect coefficients. Fixed effect coefficients indicate the magni-
variance (homoscedasticity) in graphical analysis. Student standar- tude of the relationship, but do not indicate the strength of the
dized residuals were plotted against predicted scores for each relationship or how correlated the factor is to DSAU. Similar to
section subunit. Additionally, the classroom variable of task how, with a simple linear regression or trendline, the slope of
difficulty was plotted against intercept zetas and slope zetas. the line indicates the magnitude of the relationship while the
R2 indicates the strength of the relationship or how correlated
Normality of residuals and zetas. Following completion of the variables are. A standardized coefficient can generally be
model building, both student residuals and classroom zetas interpreted as the amount of observed change, in standard
were found to be normally distributed through graphical analysis. deviations, of students’ self-assessed understanding per standard
Student residuals were evaluated by boxplots for each section and deviation change in the fixed effect across all subunits accounting
classroom intercept zetas and slope zetas were evaluated by for all other variables. A partially standardized coefficient for a
quantile–quantile (qq) plots. category can be interpreted as the difference in observed change,
Analytical methods for research question #1. Factors were in standard deviations, of students’ self-assessed understanding
loaded in a stepwise manner (forward selection) into a series of between the category and the reference category across all subunits
models. Details of each model are presented in Table 6. When a accounting for all other variables. Using standardized and partially
factor was loaded, a series of statistical tests were performed to standardized coefficients allows for comparison of the factor’s
determine the result of the first research question, if that factor effect sizes for factors with different units and observed ranges,
was a statistically significant effect on change in students’ while also accounting for all other variables and variance between
assessments of their understanding (DSAU). For all categorical subunits (Snijders and Bosker, 2012; Lorah, 2018).
factors, dummy coding was used, and the reference variable was Second, practical significance was assessed by evaluation
reported. For certain factors, additional tests were performed for and comparison of residual variances using a local effect size
use in furthering the discussion of that factor. For example, measure, Cohen’s f 2. This analysis speaks to how much the
testing a model where the factor was loaded without other factor characterizes the change in students’ self-assessed
variables (not stepwise). For factors that were found to be understanding. Cohen’s f 2 accounts for all other variables
statistically significant, the possible dependence of the relation- and variance between subunits, unlike the more commonly
ship of that factor and DSAU on the other significant factors was known effect size measure Cohen’s d (Selya et al., 2012; Lorah,
determined by testing interaction variables. For factors that were 2018). Change in students’ self-assessed understanding was
found to not be statistically significant, interaction variables observed to vary. Cohen’s f 2 can be interpreted as the propor-
were also tested; however, none were found to be statistically tion of that variance in students’ self-assessed understanding
significant and were not reported. Details of the methods used accounted for by one factor relative to the proportion of
for the model building progression can be found in Appendix 1: unexplained variance in students’ self-assessed understanding
model building. across all subunits accounting for all other variables. For
Analytical methods for research question #2. To determine Cohen’s f 2, the effect size or practical significance can be
practical significance in this study, is to speak to how meaningful considered small above 0.02, medium above 0.15, and large
the different factors are in their characterization of the outcome above 0.35 (Cohen, 1977; Selya et al., 2012; Lorah, 2018).
variable, change in students’ self-assessed understanding (DSAU).
‘‘Effect size’’ is a broad term used for measures that indicate the
size, magnitude or strength, of the effect(s), the relationship Results
between the effect(s) and the outcome variable. Typically effect
sizes are standardized in some way to allow for comparison of Results for research question #1
effects within a study and/or between studies. Currently, there is The following results from model building are presented to
no generally agreed upon method or statistic for assessing answer the first research question: Is the effect of the factor on
practical significance or calculating effect size for use with the change in students’ self-assessments of their understanding
hierarchical linear modelling (HLM) (Ferron et al., 2008; statistically significant? The main findings for each factor can be
View Article Online
found in italics followed by the evidence and reasoning to initial self-assessment as evidenced by the improvement in
support those claims. The statistical models discussed in this model fit and therefore said to be a factor that relates to DSAU.
section were labelled by number, type of effect (fixed (f) or fixed The addition of both the fixed and random effects in Model
and random (f&r)), and the method of estimation (maximum 3.f&r.ML showed a statistically significant improvement in
likelihood (ML) or restricted maximum likelihood (REML)). model fit from the smaller Model 1.f.ML (without either the
Table 6 details what factors were in each model. fixed or random effects) with LRT w2(3) = 135.8, p o 0.0001. The
Initial self-assessed understanding. The student factor of addition of the random effect in Model 3.f&r.REML also showed
initial self-assessed understanding was found to be a significant a statistically significant improvement in model fit from the
effect on change in students’ assessments of their understanding. smaller Model 3.f.REML (with only the fixed effect and not
The addition of initial self-assessed understanding as the random effect) with LRT w2(2) = 9.5, p = 0.0088. As such, the
measured by the group mean centred pre-task self-assessed addition of student performance as a random effect (random
understanding score as a fixed effect in Model 1.f.ML showed a slope for each session subunit) to account for between-session
statistically significant improvement in model fit from the differences in the relationship of student performance and
smaller unconditional means (UCM) model with likelihood DSAU was determined to be needed. As the addition of both
ratio test (LRT) with w2(1) = 60.0, p o 0.0001. As evidenced by the fixed and random effects of student performance improved
improved model fit, the fixed effect of initial self-assessed model fit, model building was continued with the Model
understanding was found to be a significant effect on change 3.f&r.ML and Model 3.f&r.REML.
in students’ self-assessments of their understanding (DSAU). Interaction variables of student performance. The relation-
The addition of both the fixed and random effects in Model ship between student performance and change in students’ assess-
1.f&r.ML showed a statistically significant improvement in ments of their understanding was found to be dependent on initial
model fit from the smaller UCM model (without either the self-assessed understanding.
fixed or random effects) with LRT w2(3) = 61.1, p o 0.0001. The The addition of the interaction variable for initial self-
addition of the random effect in Model 1.f&r.REML did not assessed understanding and student performance in Model
show a statistically significant improvement in model fit from 4.f.ML showed a statistically significant improvement in model
the smaller Model 1.f.REML (with only the fixed effect and not fit from the smaller Model 3.f&r.ML with LRT w2(1) = 5.5,
the random effect) with LRT w2(2) = 1.2, p = 0.5579. As such, the p = 0.0185. The addition of both the fixed and random effects
addition of initial self-assessed understanding as a random in Model 4.f&r.ML did not show a statistically significant
effect (random slope for each session subunit) to account for improvement in model fit from the smaller Model 3.f&r.ML
between-session differences in the relationship of initial (without either the fixed or random effects) with LRT w2(8) = 9.3,
self-assessed understanding and DSAU was determined to not p = 0.3154. The addition of the random effect in Model
be needed. As the addition of the fixed effect of initial self- 4.f&r.REML also did not show a statistically significant
assessed understanding improved model fit, model building improvement in model fit from the smaller Model 4.f.REML
was continued with the Model 1.f.ML and Model 1.f.REML. (with only the fixed effect and not the random effect) with LRT
Gender. The student factor of gender was not found to be w2(7) = 4.3, p = 0.7407. As such, the addition of the interaction
a significant effect on change in students’ assessments of their variable as a random effect (random slope for each session
understanding. subunit) to account for between-session differences in the
The addition of gender as a categorical variable fixed effect relationship of the interaction variable and DSAU was deter-
(females as the reference variable) in Model 2.f.ML did not mined to not be needed. As the addition of the fixed effect of
show a statistically significant improvement in model fit from the interaction variable improved model fit, model building
the smaller Model 1.f.ML with LRT w2(1) = 0.02, p = 0.8948. As was continued with the Model 4.f.ML and Model 4.f.REML.
model fit did not improve, gender (fixed effect) was determined Perceived comparative understanding. The student factor of
to not be a significant effect on change in students’ self- perceived comparative understanding was found to be a significant
assessments of their understanding (DSAU) when controlling effect on change in students’ assessments of their understanding.
for their initial self-assessment. With no improvement to model The addition of perceived comparative understanding as
fit, model building was continued with Model 1.f.ML and fixed effects with the categorical variable of reporting about
Model 1.f.REML. the same perceived comparative understanding as the reference
Student performance. The student factor of student perfor- variable in Model 5.f.ML showed a statistically significant
mance was found to be a significant effect on change in students’ improvement in model fit from the small Model 4.f.ML with
assessments of their understanding. w2(2) = 14.3, p = 0.0008. The categorical variables of perceived
The addition of student performance as measured by the comparative understanding, as fixed effects, were found to be
group mean centred performance task score as a fixed effect in significant effects on change in students’ self-assessments of
Model 3.f.ML showed a statistically significant improvement in their understanding (DSAU) when controlling for all other
model fit from the small Model 1.f.ML with w2(1) = 126.6, variables loaded prior in the model building progression as
p o 0.0001. Student performance, as a fixed effect, was found evidenced by the improved model fit.
to be a significant effect on the change in students’ self- The addition of both the fixed and random effects in Model
assessed understanding (DSAU) when controlling for their 5.f&r.ML did not show a statistically significant improvement in
View Article Online
model fit from the smaller Model 4.f.ML (without either the variables were added as a fixed effects and perceived com-
fixed or random effects) with LRT w2(9) = 14.9, p = 0.0939. The parative understanding variables were removed in Model
addition of the random effects in Model 5.f&r.REML also did 11.f.ML. Model 11.f.ML however, did not show a statistically
not show a statistically significant improvement in model fit significant improvement in model fit from the smaller Model
from the smaller Model 5.f.REML (with only the fixed effects 4.f.ML (without either fixed effect) with LRT w2(2) = 1.3,
and not the random effects) with LRT w2(7) = 0.89, p = 0.9964. As p = 0.5346. Actual comparative performance variables, as fixed
such, the addition of perceived comparative understanding as effects, were therefore found to not be significant effects on
random effects (random slopes for each session subunit) to change in students’ self-assessed understanding (DSAU) when
account for between-session differences in the relationships of controlling for all other variables (including or not including
perceived comparative understanding and DSAU were deter- perceived comparative understanding) due to the lack of
mined to not be needed. As the addition of the fixed effects of improvement in model fit. Model building was continued with
perceived comparative understanding were determined to Model 5.f.ML and Model 5.f.REML.
improve model fit, model building was continued with the Feedback. The classroom factor of giving feedback regarding
Model 5.f.ML and Model 5.f.REML. performance was found to not be a significant effect on change in
Interaction variables of perceived comparative understanding. students’ assessments of their understanding.
The relationship between perceived comparative understanding The addition of feedback as a fixed effect with the categorical
variables were found to be independent of the other variables that variable of ‘‘no feedback’’ as the reference variable in Model
also relate to change in students’ assessments of their understanding. 12.f.ML did not show a statistically significant improvement
The addition of the interaction variable between lower in model fit from the smaller Model 5.f.ML with w2(1) = 0.95,
perceived comparative understanding and initial self-assessed p = 0.3296.
understanding in Model 6.f.ML did not show a statistically An additional LRT was performed to evaluate if the lack of
significant improvement in model fit from the smaller Model significance of giving feedback was due to the order of the
5.f.ML with LRT w2(1) = 3.4, p = 0.0667. The addition of the modelling in the analysis or in other words, to see if signifi-
interaction variable between higher perceived comparative cance could be found if the model was not controlling for all
understanding and initial self-assessed understanding in other variables. Model 13.f.ML was conducted with feedback as
Model 7.f.ML also did not show a statistically significant the only fixed effect. Model 13.f.ML however, did not show a
improvement in model fit from the smaller Model 5.f.ML with statistically significant improvement in model fit from the
LRT w2(1) = 0.001, p = 0.9707. The addition of the interaction smaller UCM model (without any fixed or random effects) with
variable for lower perceived comparative understanding and LRT w2(1) = 3.2, p = 0.0758. Giving feedback (fixed effect) was
student performance in Model 8.f.ML did not show a statisti- therefore not found to be a significant effect on change in
cally significant improvement in model fit from the smaller students’ self-assessed understanding (DSAU) when controlling
Model 5.f.ML with LRT w2(1) = 0.003, p = 0.9534. The addition of for all other variables and when not controlling for other
the interaction variable for higher perceived comparative variables as evidenced by the lack of improvements in models
understanding and student performance in Model 9.f.ML also fit. Model building was continued with Model5.f.ML and Model
did not show a statistically significant improvement in model 5.f.REML.
fit from the smaller Model 5.f.ML with LRT w2(1) = 1.7 105, Task difficulty. The classroom factor of task difficulty was
p = 0.9967. Interaction variables of perceived comparative found to be a significant effect on change in students’ assessments
understanding were therefore determined to not be significant of their understanding.
predictors of DSAU when controlling for all other variables. As The addition of task difficulty as fixed effects with the
such, model building was continued with Model 5.f.ML and categorical variable of moderate task difficulty as the reference
Model 5.f.REML. variable in Model 14.f.ML showed a statistically significant
Actual comparative performance. The student factor of actual improvement in model fit from the smaller Model 5.f.ML with
comparative performance was found to not be a significant effect on LRT w2(2) = 14.1, p = 0.0009. As evidenced by an improvement in
change in students’ assessments of their understanding. model fit, task difficulty variables as fixed effects were found to
The addition of actual comparative performance as fixed be significant effects on change in students’ self-assessed
effects with the categorical variable of having about the same understanding (DSAU). As the addition of the fixed effects of
comparative understanding as the reference variable in Model task difficulty improved model fit, model building concluded
10.f.ML did not show a statistically significant improvement with the Model 14.f.ML and Model 14.f.REML.
in model fit from the small Model 5.f.ML with w2(2) = 1.0,
p = 0.6038. Results for research question #2
An additional LRT was performed to evaluate if the lack of The final model for change in students’ assessments of their
significance of actual comparative performance was due to the understanding (DSAU), Model 14.f.REML, is presented in
order of the modelling in the analysis or in other words, to Table 4. The final model includes all factors that were found
see if significance could be found for actual comparative to be significant effects on DSAU. For fixed effects presented in
performance if the model was not controlling for perceived the final model, along with the unstandardized coefficients (B),
comparative understanding. Actual comparative performance the standardized coefficients (b) and Cohen’s f 2 values are
View Article Online
also reported. For each of the factors, the following results of between initial self-assessed understanding and DSAU was found to
the final model are presented to answer the second research be strong; initial self-assessed understanding was found to have a
question: For a factor that is a statistically significant effect on the large effect size with Cohen’s f2 = 0.39.
change in students’ self-assessments of their understanding, what Student performance. For student performance, B = 0.022,
is the size of that effect, it’s practical significance? The main generally indicating that students that performed worse on the
findings of the effect size measures and interpretations of the task, were more likely to lower their self-assessed understanding
practical significance of the correlative relationships between after that task and those that performed better were more likely to
the factors (predictor variables) and DSAU (predicted variable) raise their self-assessment.
are shown in italics. Again, due to the interaction with initial self-assessed under-
Initial self-assessed understanding. As shown in Table 4, the standing the individual evaluation of effect size for student
unstandardized coefficient (B = 0.533) for initial self-assessed performance is limited to only students with average initial self-
understanding generally indicates that students that initially self- assessed understanding (initial self-assessed understanding = 0).
reported their understanding to be lower than average, were more For student performance, b = 0.441, indicating that for students
likely to increase their self-assessment after the task and those with average initial self-assessed understanding, for each standard
that initially self-reported their understanding to be higher than deviation higher they were than average in task performance, the
average, were more likely to decrease it. observed change in self-assessed understanding following that task
This result is indicative of a regression towards the mean (DSAU) was an increase of 0.441 standard deviations and for each
and the inclusion of initial self-assessed understanding into the standard deviation lower than average in task performance, the
model helps control for this effect (Allison, 1990). Given that observed changed in DSAU was a decrease of 0.441 standard
initial self-assessed understanding and student performance deviations on average across all subunits, accounting for all other
were found to have significant interaction, the individual variables. The strength of that relationship between student
interpretation of the coefficient, standardized or not, speaking performance and DSAU was found to be negligible; student
to the magnitude or size of the effect of initial self-assessed performance was found to have a negligible effect size with Cohen’s
understanding on change in students’ self-assessed understanding f 2 o 0.01.
(DSAU), is limited to only students with average performance Initial self-assessed understanding and student performance.
(student performance score = 0). A full examination of the magni- The coefficient for the interaction of initial self-assessed under-
tude or size of the fixed effects with interactions on DSAU is more standing and student performance (B = 0.006) is indicative of the
complex and should include the interaction variable. degree to which students’ initial self-assessments affect the effect
The standardized coefficient (b = 0.480) for initial self- that student performance has with change in self-assessed under-
assessed understanding indicates that for students with average standing (slope or coefficient for student performance).
performance, for each standard deviation higher they were than the This result generally indicates that students that initially
average in initial self-assessed understanding, the observed change assessed their understanding to be lower, were more likely to have
in their self-assessment following the task (DSAU) was a decrease of their performance have a bigger effect on the observed change in
0.480 standard deviations and for each standard deviation lower their self-assessed understanding (DSAU). The strength of the
they were than the average, the observed change in DSAU was interaction effect on DSAU was found to be negligible; the inter-
an increase of 0.480 standard deviations across all subunits, action variable was found to have a negligible effect size with
accounting for all other variables. The strength of that relationship Cohen’s f 2 o 0.01.
Table 4 HML Model for Change in Self-Assessed Understanding (DSAU)
Predictor (fixed effects) B SE 95% CI b Cohen’s f 2

(Intercept) 0.014 0.055 (0.121, 0.094) — —
Initial self-assessed understanding 0.533 0.043 (0.617, 0.449) 0.480 0.39
Student performance 0.022 0.003 (0.016, 0.029) 0.441 o0.01
Perceived comparative understanding 0.03
Higher 0.147 0.109 (0.068, 0.362) 0.184 —
Lower 0.428 0.121 (0.666, 0.190) 0.535 —
Task difficulty 0.25
Higher 0.627 0.102 (0.952, 0.303) 0.784 —
Lower 0.190 0.069 (0.029, 0.409) 0.238 —
Interaction variable for initial self-assessed 0.006 0.003 (0.001, 0.011) 0.086 o0.01
Understanding and student performance
Component (random effects) Variance SD 95% CI (SD)

(Intercept) session 0.00136 0.03691 (0.0002, 5.6780)
Student performance 0.00004 0.00624 (0.0025, 0.0155)
Error 0.34246 0.58520 (0.5457, 0.6276)
LL = 379.2, AIC = 782.5, BIC = 830.3.
View Article Online
Moving from the low to average to high initial self-assessment

lines shows the contribution of the effect – as initial self-
assessment increases, the relationship between student perfor-
mance and DSAU increases with slopes of 0.36, 0.44, and 0.53
respectively.
With initial self-assessment and student performance each
grouped into low, average, and high categories, there are nine
combinations of initial self-assessment and student perfor-
mance. The three combinations that match, for example the
group of students that had low initial self-assessment and then
low performance, are said to have ‘‘accurate initial assess-
ments’’. Highlighted in green in Fig. 4, these three groups,
showed little to no change in self-assessment following the task
(all DSAU r 0.15) accounted for by their initial assessment
and performance.
The three combinations with mismatched with initial self-

assessment higher than performance are groupings of students
said to have ‘‘overestimated initial assessments’’. Highlighted
Fig. 4 Standardized interaction plot for student performance and initial
in red in Fig. 4, these three groups that initially overestimated
self-assessment. Observed changes, in standard deviations, of students’
self-assessed understanding (DSAU) accounted for by per standard deviation
themselves all showed a decrease in self-assessment accounted
changes in student performance, initial self-assessment, and the interaction for by those effects. The group that overestimated by having
of student performance and initial self-assessment on average across all average initial self-assessment followed by low performance
subunits and accounting for all other variables. Combinations of initial self- had an about equal decrease in self-assessment following the
assessment group and student performance group are indicated with groups
task (DSAU = 0.44) as the group that overestimated by having
said to be ‘‘accurate initial assessments’’ highlighted in green, ‘‘overestimated
initial assessments’’ highlighted in red, and ‘‘underestimated initial assess-
high initial self-assessment followed by average performance
ments’’ and highlighted in blue. (DSAU = 0.48). The group that overestimated the most with
high initial self-assessment followed by low performance had a
decrease in self-assessment almost twice as much less after the
To illustrate the effects an interaction plot was constructed task (DSAU = 1.01) as the other two groups that overestimated
using the standardized coefficients and is shown in Fig. 4. The themselves less.
combination of the fixed effects of initial self-assessed under- The three combinations with the initial self-assessment
standing, student performance, and the interaction on the lower than performance are said to be ‘‘underestimated initial
outcome variable, DSAU, on average across all subunits and assessments’’ and highlighted in blue in Fig. 4. These three
accounting for all other variables is shown. By accounting for groups that initially underestimated themselves all showed an
all other variables, the plot looks at the cumulative, but isolated increase in self-assessment accounted for by those effects. The
(above and beyond the other factors) effect on DSAU for group that underestimated themselves by having average initial
different variable combinations; the plot does not show the self-assessment followed by high performance had an about
average DSAU for the different variable combinations. equal increase in self-assessment following the task (DSAU =
Initial self-assessment and student performance were each +0.48) as the group that underestimated themselves by
divided into the categories of low, average, and high. ‘‘Low’’ is having low initial self-assessment followed by average perfor-
designated as being one standard deviation or more below mance (DSAU = +0.44). The group that underestimated them-
average and ‘‘high’’ is designated as one standard deviation or selves the most with low initial self-assessment followed
more above average. by high performance reported they understood more after the
The negative relationship between initial self-assessed under- task than they originally reported with DSAU = +0.84, a change
standing and DSAU is observed graphically by the different lines twice that of the other two groups that underestimated
shown in Fig. 4. For each performance level, moving upwards themselves less.
from the high to the average to the low initial self-assessment Perceived comparative understanding. For perceived com-
lines shows the contribution of the effect – as initial self- parative understanding, B = 0.147 for higher perceived com-
assessment decreases, DSAU increases. The positive relation- parative understanding and B = 0.428 for lower perceived
ship between student performance and DSAU is graphically comparative understanding. Generally, a positive coefficient for
represented by the positive slopes of three lines and shows higher perceived comparative understanding and a negative coeffi-
the contribution of the effect – as student performance cient for lower perceived comparative understanding indicate that
increases, DSAU increases. The interaction of the fixed effects students who perceived their understanding to be higher than those
is observed in the graphical representation by the differences of their peers were more likely to increase their self-assessed
in the slopes of the lines. The slope indicates the magnitude understanding and those that perceived their understanding to be
of the relationship between student performance and DSAU. lower were more likely to lower their self-assessment.
View Article Online
As this is a categorical factor, any coefficient indicates the Table 5 Descriptive statistics for perceived comparative understanding
total isolated impact of the factor and not a per unit or per and actual comparative performance
standard deviation impact as was so for the previous continuous Actual comparative performance
variables. For perceived comparative understanding, b = 0.184
Overall About the
for higher perceived comparative understanding and b = 0.535 frequency (%) Lower (%) same (%) Higher (%)
for lower perceived comparative understanding. This indicates
Perceived comparative understanding
that students that later reported a higher or lower comparative Lower 12 18 12 8
understanding, as compared to those students that reported about About the same 75 67 80 72
the same comparative understanding, were observed to have had an Higher 14 16 9 20
increase of 0.184 standard deviations or a decrease of 0.535
standard deviations, respectively, in self-assessed understanding Descriptive statistics were used to address the second follow-
(DSAU), across all subunits and accounting for all other variables. up question: Is perceived comparative understanding an accu-
The strength of the relationship between perceived comparative rate representation of actual comparative performance?
understanding and DSAU was found to be small; perceived Descriptive statistics for perceived comparative understanding
comparative understanding was found to have a small effect size and actual comparative performance are presented in Table 5.
with Cohen’s f2 = 0.03. Overall, 12% of students reported having lower understanding
Task difficulty. For task difficulty, B = 0.627 for high task compared to their peers, 75% reporting about the same, and
difficulty and B = 0.190 for low task difficulty. Generally, 14% reporting higher. The percentage of students from each
a negative coefficient for higher task difficulty and a positive category of actual comparative performance and perceived
coefficient for lower task difficulty indicate that students were more comparative understanding are also presented. For example,
likely to lower their self-assessed understanding after a more of the students that were categorized as having lower actual
difficult task and raise it following an easier task. comparative performance, 18% of them were said to be accurate
For task difficulty, b = 0.784 for higher task difficulty and by reporting that they perceived themselves as having lower
b = 0.238 for lower task difficulty. This indicates that students understanding compared to their peers, 67% inaccurately said
were observed, across all subunits and accounting for all other they had about the same level of understanding as their peers,
variables, to have lowered their self-assessed understanding by and 16% even more inaccurately said that they had higher
0.784 standard deviations after tasks categorized as being more understanding than their peers. Of the students that were
difficult as compared to students performing tasks categorized categorized as having higher actual comparative performance,
as having average difficulty and raised their self-assessed under- 20% of them reported perceiving themselves as having a higher
standing by 0.238 standard deviations after tasks categorized as level of understanding, and where therefore said to be accurate.
being less difficult as compared to students performing tasks
categorized as having average difficulty. The strength of the
relationship between task difficulty and DSAU was found to be Discussion and implications
moderate; task difficulty was found to have a moderate effect size
with Cohen’s f 2 = 0.25. The central goal of this research study was to assess if and how
student and classroom factors relate to change in students’ self-
Results for follow-up questions assessed understanding as a result of participation in class-
Several follow-up questions were assessed following the comple- room activities. By assessing student factors, how students are
tion of the initial analysis for the two primary research questions. differentially impacted by participation in in-class tasks can be
The follow-up questions and results are presented below. discussed. By assessing classroom factors, awareness of and
A follow-up analysis was performed to address the question: provide evidence to support how instructional decisions impact
Is gender a statistically significant effect on how students students’ self-assessed understanding can increase. The follow-
assessed their understanding at a given point as opposed ing discussion and implications are organized by sorting the
to how students changed their self-assessments? The addition main findings into three categories:
of gender as a categorical variable fixed effect (females as the 1. Contributions to the large body of research regarding the
reference variable) in a hierarchical linear model with session effect of student performance on self-assessment of understanding.
as the subunit, predicting the initial self-assessed understand- 2. Discussion of findings that provide further evidence to the
ing score pre-task showed statistically significant improvement relationship or lack thereof for factors that have inconclusive effects
in model fit (w2(1) = 4.21, p = 0.0401). As evidenced by improved based on previous research: gender, feedback, and task difficulty.
model fit, gender was found to be a significant effect on initial 3. Novel findings regarding the factors that have not previously
self-assessment. A positive coefficient for the male category assessed in the literature: perceived and actual comparative
generally indicates that male students were more likely to have performance.
a higher self-assessment of their understanding (pre-task).
As model building for the pre-task self-assessment outcome Insights on the effects of student performance
variable was limited and as such no other variables were While student performance has often been linked to student
controlled for, only the sign of the coefficient is being reported. self-assessment of understanding, it has not always shown to
View Article Online
have a significant relationship. Using simple regression, Kruger the inaccuracy of self-estimation made by students or not, they
and Dunning (1999) did not find task performance to be a should recognize the effects that performance on classroom
statistically significant factor for self-assessment at a given time, work has on their students’ perceptions of understanding.
specifically post-task self-assessment. In contrast, Pazicni and The analysis of interaction between variables indicated that
Bauer (2014) found test performance to be a statistically when students initially self-assessed higher understanding,
significant factor for self-assessment accuracy, as measured by their performance on that task had a greater effect on how
the difference in post-task comparative self-assessment and they changed their self-assessment. This finding implies that
comparative performance, using hierarchical linear modelling overconfident students are more affected by poor performance
(HLM). In this study, also using HLM, task performance was in an activity than underconfident students are affected by their
found to be a statistically significant factor for self-assessment positive performance (i.e., it is easier to correct overconfidence
change, as measured by the difference in pre-task and post-task than under-confidence).
self-assessment.
The overestimation of ability by lower performing students Insights on the effect of other variables (inconclusive in
has been well documented within the domain of chemistry previous research)
(Bell and Volckmann, 2011; Mathabathe and Potgieter, 2014; Gender. Previous research has been inconclusive on the
Pazicni and Bauer, 2014; Hawker et al., 2016), within other effect of gender on self-assessment (Kruger and Dunning,
science domains (Austin and Gregory, 2007; Sharma et al., 1999; Hawker et al., 2016; Kim et al., 2016). In this study,
2016), and outside of science (Kruger and Dunning, 1999; gender was not found to significantly relate to change in
Butler et al., 2008; Kim et al., 2016; Thomas et al., 2016). The students’ assessments of their understanding as a result of
pilot study that was conducted in the development of the self- engagement in class activities. Although females and males
assessment of understanding tool yielded similar results. may assess their performance differently at any given time, the
In this investigation, three student groups were designated findings of this study indicate that it cannot be assumed that
as having overestimated initial assessments due to their these groups differ in how they change their self-assessments
initial-self assessment categorization being higher than their following an in-class activity. The follow-up model to evaluate if
performance categorization. With all three of these groups, the the students involved in this study differed in how they assessed
combination of their initial self-assessment and performance their understanding by gender found that gender was a significant
accounted for a decrease in self-assessment following the predictor of self-assessed understanding at a given point in time
performance task and the group that most overestimated their (pre-task). This result provides further evidence regarding the
initial assessment showing the greatest decrease, as shown in relationship between gender and self-assessment, suggesting that
Fig. 3. These results indicate that some of those that initially females and males may differ in their self-assessments of perfor-
overestimated their ability made adjustments towards greater mance and understanding, but not necessarily on how their self-
accuracy in their self-assessment following the instructional assessments change as a result of working on a classroom task.
task and the more inaccurate they were initially, the more likely Feedback. Previous research has linked feedback to perfor-
they were to adjust or adjust to a greater degree. Even though this mance and learning (Salmoni et al., 1984; Butler et al., 2008);
regulation (pre-task to post-task) was observed, lower performing however, research on the impact of feedback on self-assessment
students may still, on average exhibit the Dunning-Kruger effect is limited. Pazicni and Bauer (2014) highlighted the small
by overestimating their ability post-task. This could possibly be effects that receiving direct feedback regarding performance
due to the level of adjustment not being great enough to fully (e.g., grades) has on the accuracy of student self-assessment
correct the inaccuracy of their initial overestimation. It could also over the course of a semester. Kruger and Dunning (1999) also
be that although enough students make adjustments for the showed the limited effects of indirect feedback from peer
average change in self-assessment to be significant, enough evaluation on students’ abilities to assess themselves.
students do not make adjustments for the post-task self- Given existing research that links poor performance to inaccu-
assessment to be on average accurate. rate self-assessment, it would be reasonable to assume that lower
Lower performing students that overestimate their ability performing students do not know that they are performing poorly
are often the target of interventions as the misalignment between (Kruger and Dunning, 1999). If these students were completely
perception and reality is seen as detrimental to learning (Schraw unaware, it would be beneficial to inform them about their actual
et al., 2006; Pazicni and Bauer, 2014). Some researchers have level of performance to help them better regulate their self-
discussed the negative emotional impact of overestimation on assessment. Given this line of reasoning, it could be hypothesised
students (Jones et al., 2012). Nevertheless, some authors have that the observed changes in self-assessment, showing regulation
argued that overestimation may not be entirely detrimental as toward greater accuracy, made by students that initially overesti-
people might be more eager to climb a mountain if they are mated their ability would only occur if the students were told how
unaware of how far they are from the top (Pajares, 1996; they performed. However, these changes were observed regardless
Zimmerman, 2000; Ehrlinger, 2008). The results of this study of whether or not feedback, that indicated whether answers were
suggest that independently of planned interventions, participation right or wrong, was given. Providing immediate and direct feedback
in in-class tasks affects students’ self-assessments of under- regarding performance was found to not be a significant effect
standing. Regardless of whether an instructor aims to correct on change in students’ self-assessed understanding. This may
View Article Online
seem counter intuitive at first, but given these results, it can be of as a decrease in inaccuracy as it is not representative of a
speculated that though students are at least somewhat inaccurate in change in response to performance.
their self-assessment at any given point in time, they do have enough The lack of significant interaction effects on change in
awareness of their performance that ‘‘right or wrong’’ feedback is students’ self-assessments of initial self-assessment, perfor-
neither necessary for the amount of regulation observed to occur mance, and task difficulty also indicates students did not have
nor initiates regulation above and beyond what is accounted for by to perform poorly on a difficult task to observe the decrease in
their initial self-assessment and performance. self-assessment accounted for by the task being difficult. The
Task difficulty. One could expect that as task difficulty results indicate that engagement in more difficult tasks
increases, self-assessment of understanding would decrease as decreased self-assessments for students of all performance levels.
students will struggle more to complete the activity (Kruger and Again, this suggests that students that initially overestimated
Dunning, 1999; Burson et al., 2006; Thomas et al., 2016). This their ability would lower their self-assessments and seemingly
result was corroborated in this investigation where task difficulty become more accurate. This also suggests students that were
was found to have a significant relationship with change in self- initially accurate would also lower their self-assessments possibly
assessed understanding with observed decreases in students’ causing them to then underestimate their ability. Lastly, this
self-assessments with higher task difficulty and observed suggests students that initially underestimated their ability would
increases with lower task difficulty. These results indicate that do so even more. Using the standardized coefficient results of
instructors may impact their students’ perceptions of their under- this study, an illustrative example can be made. For students that
standing by changing the level of difficulty of in-class activities. had the greatest initial underestimations of ability (low initial
A previous study reported a dependency between task diffi- self-assessment and high performance), there was an observed
culty and performance (Kim et al., 2016). In this work, however, increase in self-assessment indicating regulation towards greater
that dependency was not found. Additionally, the relationship accuracy that was almost equal to the observed average decrease
between initial self-assessment and change in self-assessment in student self-assessment accounted for by task difficulty. So, in
was also found to be independent of task difficulty. this study, students that initially underestimated themselves the
Recall that observed change in self-assessment accounted most, after completing a difficult task as compared to a task
for by initial self-assessment and performance showed little with average difficulty, were observed, on average, to have little
change for students that were designated as initially accurate in cumulative change in their self-assessment, leaving them to con-
their self-assessment, while for students that initially over- tinue to underestimate themselves greatly. Focus and decisions
estimated or underestimated themselves there was observed made by instructors to correct overestimation can not only impact
regulation, decreasing those inaccuracies. The lack of signifi- those students, but all students and should be kept in mind.
cant interaction effects on change in students’ self-assessments Comparing the standardized coefficients for lower task
of these variables with task difficulty indicates that, for exam- difficulty and higher task difficulty, having higher task difficulty
ple, the students that initially overestimated themselves did not appears to impact self-assessment over three times as much.
have to be given a difficult task for the observed change in self- This suggests that for tasks of low difficulty, the possible
assessment to occur that was accounted for by initial self- opposite differential impacts are not as great. For example,
assessment and performance. So, having a difficult task was not the decrease accounted for by high task difficulty (as compared to
a necessary condition for the observed decreases in inaccuracies. average task difficulty) would cancel out almost all the increase
The lack of significant interaction effects on change in students’ accounted for by initial self-assessment and performance by
self-assessments of initial self-assessment, performance, and task those that initially underestimated themselves. Whereas, the
difficulty also suggests when more difficult tasks are given, the increase accounted for by low task difficulty (as compared to
general decrease in self-assessment that is observed may give the average task difficulty) would cancel out only some of the decrease
appearance of correction in overestimation. However, the observed accounted for by initial self-assessment and performance by
decreases may not be reflective of low performance, but rather those that initially overestimated themselves. That is to say, the
reflective of task difficulty, across the board for all performance correction by students that initially overestimated themselves
levels. This would make it seem like lower performing students that would appear smaller when given an easier task, but still
were overestimating themselves were doing so less and becoming be present.
more accurate when given harder tasks. Accuracy in self-assessment The suggested conclusions presented in this study, that are
though, means self-assessment that is representative of ability evidenced by the magnitude of the coefficients, standardized
or performance, not representative of task difficulty. So ideally, or not, are limited by the range of task difficulty. For a full
performance would independently and accurately predict self- examination of task difficulty, a non-categorical measure with a
assessment (at any given time) with an effect size measure that large range of difficulties is recommended. The magnitude of
indicates a strong relationship. And, if there were an inaccuracy, change in self-assessment observed with varied task difficulty is
the change in self-assessment would be a full correction of and worthy of further research as the results of this study suggest
only in response to that inaccuracy, not in response to task that the use of more difficult instructional tasks may cause the
difficulty. So, for students that initially overestimated them- appearance of correction in overestimation and block correction in
selves, the observed change in self-assessment accounted for by underestimation. The appearance of correction in overestimation is
higher task difficulty, though a decrease cannot truly be thought suggested by the observed decreases in self-assessment reflective of
View Article Online
task difficulty independent of performance or accuracy. Blocking The correlative relationship could suggest that students that
correction in underestimation is suggested by the observed lowered their self-assessment beyond what was accounted for
decreases in self-assessment accounted for by task difficulty by initial self-assessment, performance, and task difficulty,
cancelling out the observed increases accounted for by initial were then more likely to report having a lower understanding
self-assessment and performance. Further, the results of this than their peers. If perceived comparative understanding is
study indicate the importance of controlling for task difficulty something that is not easily impacted and was present prior to
when looking at self-assessment. and during the data collection, that perception may have
The effect size or strength of the relationship between task influenced the changes in self-assessment. Further research
difficulty and change in students’ self-assessments was found would need to be conducted to explore these possibilities. The
to be moderate with only initial self-assessment being found relative magnitude of the standardized coefficients being as
to have a stronger relationship. The effect size or strength of the high as they were and the strength of the correlative relation-
relationship between performance and change in students’ ship being found to be small, but not negligible suggest that
self-assessments was found to be negligible. Factors other than how students perceive their understanding as compared to
initial self-assessment and performance with greater effect their peers is worthy of further research that could answer
sizes and larger standardized coefficients, like task difficulty questions the results this study simply cannot speak to.
suggest the high influence of instructional decisions differentially Actual comparative performance was not found to signifi-
impacting students’ self-assessments. cantly relate to change in students’ assessments of their under-
standing following an individual task. The results this study for
Insights on the effects of comparative understanding variables actual comparative performance and perceived comparative
Many researchers have used comparative measures of under- understanding triggered the consideration of possible challenges
standing like percentile rankings and categorical comparisons for instructors as it is easier to assess how a student compares to
as dependent variables (Dunning et al., 1989; Kruger and their peers by looking at their performance (scores/grades) than it
Dunning, 1999; Mathabathe and Potgieter, 2014; Pazicni and is to assess how a student thinks they compare to their peers and
Bauer, 2014; Hawker et al., 2016). Fewer studies have asked these may not be aligned.
students to self-assess their understanding in an absolute In follow-up results, evaluation was performed to assess how
manner as the dependent variable (Bell and Volckmann, 2011). accurately perceived comparative understanding matched
Research in chemistry education lacks studies that evaluate the actual comparative performance. As shown in Table 5, most
effect on self-assessment of understanding of comparative mea- students in our sample (74%) reported that they had about the
sures of understanding as independent variables. The findings of same level of understanding as that of their peers. Of the
this study contribute in this area by providing insights into the students that were categorized as having lower actual comparative
relationship between how students think their understanding performance, only 18% accurately reported thinking that they
compares to that of their peers and self-assessment, and the understood less than their peers, 67% inaccurately reported about
relationship between how they actually differ in performance the same understanding as their peers, and 16% overestimated
from their peers and self-assessment. themselves even more by reporting having higher understanding
Findings of this study may provide evidence that suggest than their peers. Of the students that were categorized as having
possible causal relationships and/or hypothesize possible causal higher actual comparative performance, only 20% accurately
relationships worthy of further research; however, only correlative thought they had higher understanding, 72% inaccurately
relationships and what the observed changes were accounted for reported about the same understanding as their peers, and
by the different effects can be indicated by the results of this 8% underestimated themselves even more by reporting having
study. In the sequence of events that occurred during this study’s lower understanding than their peers.
data collection, the measure for perceived comparative under- Further research is suggested to investigate if similar results
standing was the only measure that occurred after the pre-task would be found if exploring the change in students’ self-
and post-task self-assessment measures used to calculate the assessments following a collaborative task. A study looking at
change in self-assessed understanding. Causality cannot be the change in self-assessment following a collaborative task could
implied by a sequence of events or measurements. Nor can a have possible instructional implications regarding the formation
sequence of measurements imply a lack of causality, as what is of small collaborative groups in chemistry classrooms. Instructors
being measured may not change over time. Even so, a causal may want to reflect on the use of performance measures to assign
relationship between change in students’ self-assessed under- students to groups, if for instance, they were attempting to make
standing and perceived comparative understanding is not being decisions that could impact self-assessment of understanding.
suggested. Results from this study indicate that students who Having lower performing students work with higher performing
later reported their understanding to be lower than their peers, as students on in-class collaborative activities may or may not impact
compared to those who later reported their understanding to be students’ self-assessed understanding as one might initially think.
about the same as their peers, were observed to have lowered their The results this study lead to the recommendation for further
self-assessed understanding across all subunits and accounting research to see if students’ self-assessments following a colla-
for all other variables and vice versa for students who later borative activity are impacted by perceptions of comparative
reported their understanding to be higher than their peers. understanding and actual comparative performance
View Article Online
Limitations of study random effects (f&r). The addition of the random effect added a
random slope for each session subunit to the model equation
This study’s ability to make causal inferences regarding student that allowed for differences in the factor’s relationship with
factors is limited as they are not randomly assigned independent DSAU between sessions. If the between-session variation of the
variables. Because the student variables are trait-variables, only factor’s relationship with DSAU is significant, the random effect
correlative conclusions can be made regarding their relationships. should be included in the model (Peugh, 2010; Theobald,
The goal of this study was to present a characterization of 2018). With the addition of the random slopes, the fixed slope
the changes in self-assessed understanding that were observed still quantified the overall effect of the factor on DSAU, but
following an instructional activity. This study’s ability to generalize controlled for or remove the impact of between-session differences
the findings was limited by several factors. First, the statistical for that factor’s relationship with DSAU.
methods used have the potential for generating false positives. For each factor considered, a series of LRTs comparing the
Second, this study was conducted at single university with a student models were performed. First, a LRT was performed comparing
population that may differ from those in other settings. Further that factor’s Model j.f.ML to the previous ML estimated model
testing is recommended with different populations, conditions, and to determine if the addition of the fixed effect had a statistically
analytical methods to provide additional evidence for the general- significant improvement to the model fit. If there was a
izability of the characterization, ensuring it is not situational to the statistically significant improvement in model fit, the factor
given population, conditions, or analyses of this study. as a fixed effect was determined to be a significant predictor of
This study focused on change in self-assessment of under- DSAU and therefore said to relate to DSAU. If the fixed effect
standing. This research thus focused on a small, but important was not found to significantly predict DSAU, the remaining
aspect of metacognition. Veenman (2006, 2011, 2017) has LRTs were performed, but not reported. Second, a LRT was
emphasized that studies like ours that are centred on a single performed comparing Model j.f&r.ML to the previous ML
metacognitive skill like monitoring, although important, may estimated model to determine if the addition of both the fixed
not provide enough information to design effective instructional and random effects had a significant improvement to the model
interventions. fit. Third, a LRT was performed comparing Model j.f&r.REML to
Model j.f.REML to determine if the addition of the factor as a
Conflicts of interest random effect was needed as indicated by a statistically signifi-
cant improvement to the model fit. REML estimation was used
There are no conflicts to declare. for the third LRT as the only difference between the two models
was a random effect (Raudenbush and Bryk, 2002; Peugh, 2010;
Appendix 1: model building Theobald, 2018).
Model building was conducted based on the best practice of Dummy coding
evaluation of the inclusion of fixed effects and random effects by Factors that were categorical variables were ‘‘dummy coded’’
statistically significant improvement to model fit (Peugh, 2010; using the following methods from Pedhazur (1997). If a factor
Snijders and Bosker, 2012; Theobald, 2018). The p-values for (C) had three categories (C1, C2, and C3), two vectors (V1 and V2)
likelihood ratio tests (LRTs) were reported and are indicative of would be added that matched the description of two of those
the significance of the effect on DSAU as evidenced by the three categories. The category without a matching vector, in this
statistical significance of the improvement in model fit. example C3, would be the reference category and reported as the
To determine if a factor was a significant effect on DSAU, a ‘‘reference variable’’. For those in C1, V1 = 1 and V2 = 0. For those in
series of LRTs was performed, comparing a model with the C2, V1 = 0 and V2 = 1. And for those in C3, V1 = 0 and V2 = 0.
factor as an effect to a model without such a factor (three to The following model building methods for the categorical
four models were conducted per factor) (Peugh, 2010; Snijders variable factors based on methods and recommendations from
and Bosker, 2012). For a summary of the fixed and random previous research (Cohen, 1991; Nezlek, 2008; Darlington and
effects used in the model building progression see Table 6. A Hayes, 2016). To determine if a categorical factor (C) was a
fixed-slope model (FSM) using full maximum likelihood (ML) significant effect on DSAU, the series of LRTs were performed
estimation (Model j.f.ML) and a FSM using restricted maximum comparing a model with V1 and V2 as variables to a model
likelihood (REML) estimation (Model j.f.REML), were com- without those variables. This method tests the statistical
pleted with the previous model and the addition of the factor significance of the factor (all categories) in a single step by
(j) as a fixed effect (f). The addition of the fixed effect added a model fit. This method was done as the alternative method
fixed slope across all session subunits to the model equation having each category of a factor individually added in model
which quantified the overall effect of the factor on the change building, to test each as a fixed effect would, mathematically
in students’ self-assessments of their understanding (DSAU) result in the issue of singularity. To avoid this issue a reference
overall or across all sessions. An unconditional slopes model category must be selected. Reference categories were selected
(USM) using ML estimation (Model j.f&r.ML) and a USM using based on the interpretability of the results. For example, if a
REML estimation (Model j.f&r.REML), was conducted with the factor had three categories – below average, about average, and
previous model and the addition of the factor as both fixed and above average, about average would be selected as the reference
View Article Online
Table 6 Summary of model building

Estimation methods (ML or REML) are not shown. ‡ Additional effect that improved model fit. + Effect included in the model. Additional effect
that did not improve model fit. Highlighted boxes signify the difference in that model from that prior in model building. a Categorical variable.
category. This allows for the below average category to logically that factor varies. Coefficients of fixed effects indicate the
describe what occurs below the reference of the average. The relationship between the variables – as the factor changes,
model building progression summarized in Table 6 would how does DSAU change across all subunits accounting for
therefore result in a final model (Table 4) that included all all other variables. If a fixed effect is also a random effect,
the factors identified as significant effects on DSAU. the coefficient still indicates the relationship between the
variables, but is on average across all subunits accounting for
Appendix 2: practical significance all other variables.
Standardized fixed effect coefficients can be calculated for
Standardized and partially standardized fixed effect coefficients factors that are continuous variables without significant inter-
This analysis speaks to how big of an effect the factor is or how actions with another fixed effects by multiplying the fixed effect
much change is observed in students’ self-assessments when coefficient by the standard deviation of the factor and dividing
View Article Online
by the standard deviation of the outcome variable (Snijders and I, RSBI2 is the proportion of variance accounted for by the
and Bosker, 2012). Standardized fixed effect coefficients for set of fixed effects in the final model (S, B, and I), RSB2 is
factors that are continuous variables with significant inter- the proportion of variance accounted for by S and B, RS2 is the
actions with another fixed effect cannot be calculated in this proportion of variance accounted for by S.
manner, however. Following Friedrich’s procedure for variables For a given interaction variable, I, between fixed effects j and
with interactions, a standardized model was used (1982). k, Cohen’s f 2 was calculated as a measure of effect size
z-Scores for the factors and the cross-product of the z-scores (Cohen, 1977):
as the interaction variable were substituted into the final model
RjkI2 Rjk2
for the unstandardized variables and the z-score for the out- f2 ¼
come variable was substituted in for the unstandardized 1 RjkIS2
outcome variable, DSAU. The fixed effect coefficients, without
where I is the given interaction between fixed effects j and k, S is
manipulation, from this model are the standardized fixed effect
the set of all statistically significant fixed effects included in the
coefficients for the variables with interactions and the inter-
final model excluding I, j, and k, RjkIS2 is the proportion of
action variable. A more detailed procedure and analysis of the
variance accounted for by the set of fixed effects in the final
method can be found in Friedrich (1982) and Aiken and West
model ( j, k, I, and S), RjkI2 is the proportion of variance
(1991).
accounted for by I, j, and k, and Rjk2 is the proportion of
Partially standardized fixed effect coefficients were calculated
variance accounted for by j and k.
for each category (other than the reference category) for factors
that were dummy coded categorical variables by dividing
the fixed effect coefficient for that category by the standard Acknowledgements
deviation of the outcome variable (Lorah, 2018).
We would like to thank Dr Monica Erbacher for the guidance
Cohen’s f 2 and support she provided in the statistical analysis of the
Practical significance was additionally assessed by calculating data and Alex Nathe, an undergraduate researcher, for her
the local effect size measure, Cohen’s f 2, a function of proportional contributions to this work.
variance. The values for the proportion of variance accounted for
by a set of fixed effects were calculated as a function of residual References
variances (Selya et al., 2012):
Vnull Vi Aiken L. S. and West S. G., (1991), Multiple Regression: Testing
Ri2 ¼ and Interpreting Interactions, Sage Publications.
Vnull
Albers M. J., (2017), Graphically representing data, in Introduction
where Vnull is the residual variance of a model without any fixed to Quantitative Data Analysis in the Behavioral and Social
effects and Vi is the residual variance of a model with the set of Sciences, John Wiley & Sons Inc., pp. 63–85.
fixed effects, i, and with both Vnull and Vi calculated holding the Allison P. D., (1990), Change scores as dependent variables in
variance accounted for by random effects constant. regression analysis, Sociol. Methodol., 20, 93–114.
For a given fixed effect, A, that did not have significant Austin Z. and Gregory P. A. M., (2007), Evaluating the accuracy
interactions with other fixed effects, Cohen’s f 2 was calculated of pharmacy students’ self-assessment skills. Am. J. Pharm.
as a measure of effect size (Cohen, 1977; Selya et al., 2012; Educ., 71(5).
Lorah, 2018): Bacchetti P., (2002), Peer review of statistics in medical
research: The other problem. BMJ, 324(7348), 1271–1273.
RSA 2 RS2
f2 ¼ Bauer D. J. and Sterba S. K., (2011), Fitting multilevel models
1 RSA 2
with ordinal outcomes: Performance of alternative specifications
where A is the given fixed effect, S is the set of all statistically and methods of estimation, Psychol. Methods, 16(4), 373–390.
significant fixed effects included in the final model excluding A, Bell P. and Volckmann D., (2011), Knowledge surveys in general
RSA2 is the proportion of variance accounted for by the set of chemistry: Confidence, overconfidence, and performance,
fixed effects in the final model (S and A), and RS2 is the J. Chem. Educ., 88(11), 1469–1476.
proportion of variance accounted for by S. Bloom B., (1956), Taxonomy of educational objectives; the classi-
For a given fixed effect, B, that did have significant inter- fication of educational goals, 1st edn, New York, Longmans,
action with another fixed effect, Cohen’s f2 was calculated as a Green.
measure of effect size (Cohen, 1977): Bol L. and Hacker D. J., (2001), A comparison of the effects of
practice tests and traditional review on performance and
RSB 2 RS2
f2 ¼ calibration., J. Exp. Educ., 69(2), 133–151.
1 RSBI 2
Burson K. A., Larrick R. P. and Klayman J., (2006), Skilled or
where B is the given fixed effect, I is the set of interaction unskilled, but still unaware of it: How perceptions of diffi-
variables of the given fixed effect, S is the set of all statistically culty drive miscalibration in relative comparisons, J. Person.
significant fixed effects included in the final model excluding B Soc. Psychol., 90(1), 60–77.
View Article Online
Butler A. C., Karpicke J. D. and Roediger H. L., (2008), Correcting Gaito J., (1980), Measurement scales and statistics: Resurgence
a metacognitive error: Feedback increases retention of low- of an old misconception, Psychol. Bull., 87, 564–567.
confidence correct responses, J. Exp. Psychol.: Learn., Mem., Grilli L. and Rampichini C., (2012), Multilevel models for
Cogn., 34(4), 918–928. ordinal data, in Modern Analysis of Customer Surveys: with
Carifio J. and Perla R., (2008), Resolving the 50-year debate around Applications using R, Kenett R. S. and Salini S. (ed.), John
using and misusing Likert scales, Med. Educ., 42(12), 1150–1152. Wiley & Sons, Ltd, pp. 391–411.
Carvalho F. and Moises K., (2009), Confidence judgments in Hawker M. J., Dysleski L. and Rickey D., (2016), Investigating
real classroom settings: Monitoring performance in different general chemistry students’ metacognitive monitoring of
types of tests, Int. J. Psychol., 44(2), 93–108. their exam performance by measuring postdiction accura-
Cohen A., (1991), Dummy variables in stepwise regression, Am. cies over time, J. Chem. Educ., 93(5), 832–840.
Stat., 45(3), 226–228. Jamieson S., (2004), Likert scales: How to (ab)use them, Med.
Cohen J., (1977), F tests of variance proportions in multiple Educ., 38(12), 1217–1218.
regression/correlation analysis, in Statistical Power Analysis Jones H., Hoppitt L., James H., Prendergast J., Rutherford S.,
for the Behavioral Sciences, Elsevier, pp. 407–453. Yeoman K. and Young M., (2012), Exploring students’ initial
Cohen J., Cohen P., West S. G. and Aiken L. S., (2003), Applied reactions to the feedback they receive on coursework, Biosci.
multiple regression/correlation analysis for the behavioral Educ., 20(1), 3–21.

sciences, 3rd edn. Kim H.-Y., (2013), Statistical notes for clinical researchers:
Cooper M. M. and Sandi-Urena S., (2009), Design and validation Assessing normal distribution (2) using skewness and kur-
of an instrument to assess metacognitive skillfulness in tosis, Restor. Dent. Endod., 38(1), 52–54.
chemistry problem solving, J. Chem. Educ., 86(2), 240–245. Kim Y.-H., Kwon H., Lee J. and Chiu C.-Y., (2016), Why do
Cooper M. M., Sandi-Urena S. and Stevens R., (2008), Reliable people overestimate or underestimate their abilities? A
multi method assessment of metacognition use in chemistry cross-culturally valid model of cognitive and motivational
problem solving, Chem. Educ. Res. Pract., 9(1), 18–24. processes in self-assessment biases, J. Cross-Cult. Psychol.,
Craney T. A. and Surles J. G., (2002), Model-dependent variance 47(9), 1201–1216.
inflation factor cutoff values, Qual. Eng., 14(3), 391–403. Kruger J. and Dunning D., (1999), Unskilled and unaware of it:
Crowe A., Dirks C. and Wenderoth M. P., (2008), Biology in How difficulties in recognizing one’s own incompetence
bloom: Implementing bloom’s taxonomy to enhance student lead to inflated self-assessments, J. Pers. Soc. Psychol.,
learning in biology, CBE – Life Sci. Educ., 7(4), 368–381. 77(6), 1121–1134.
Darlington R. and Hayes A., (2016), Regression analysis and Kuzon W. M., Urbanchek M. G. and McCabe S., (1996), The
linear models: Concepts, applications, and implementation, seven deadly sins of statistical analysis, Ann. Plast. Surg.,
Guilford Publications. 37(3), 265–272.
Dolan C. V., (1994), Factor analysis of variables with 2, 3, 5 and Lee V. E., (2000), Using hierarchical linear modeling to study
7 response categories: A comparison of categorical variable social contexts: The case of school effects, Educ. Psychol.,
estimators using simulated data, Br. J. Math. Stat. Psychol., 35(2), 125–141.
47(2), 309–326. Lorah J., (2018), Effect size measures for multilevel models:
Dunlosky J. and Thiede K. W., (1998), What makes people study Definition, interpretation, and TIMSS example, Large-scale
more? An evaluation of factors that affect self-paced study, Assess. Educ., 6(8).
Acta Psychol., 98, 37–56. Luo W. and Azen R., (2013), Determining predictor importance
Dunning D., Meyerowitz J. A. and Holzberg A. D., (1989), in hierarchical linear models using dominance analysis,
Ambiguity and self-evaluation: The role of idiosyncratic trait J. Educ. Behav. Stat., 38(1), 3–31.
definitions in self-serving assessments of ability, J. Pers. Soc. Mabe III P. and West S., (1982), Validity of self-evaluation of
Psychol., 57(6), 1082–1090. ability: A review and meta-analysis, J. Appl. Psychol., 67(3),
Ehrlinger J., (2008), Skill level, self-views and self-theories as sources 280–296.
of error in self-assessment, Soc. Pers. Psychol. Compass, 2(1), Mathabathe K. C. and Potgieter M., (2014), Metacognitive
382–398. monitoring and learning gain in foundation chemistry,
Ferron J. M., Hogarty K. Y., Dedrick R. F., Hess M. R., Niles J. D. Chem. Educ. Res. Pract., 15(1), 94–104.
and Kromrey J. D., (2008), Reporting results from multilevel Moores T. T., Chang J. C.-J. and Smith D. K., (2006), Clarifying
analyses, in Multilevel Modeling of Educational Data, O’Con- the role of self-efficacy and metacognition as indicators of
nell A. and McCoach D. B. (ed.), Quantitative Methods in learning: Construct development and test. DATA BASE Adv.
Education and the Behavior Sciences: Issues, Research, and Inform. Syst., 37, 125–132.
Teaching, Information Age Publishing Inc., pp. 391–426. Muthén B. and Kaplan D., (1985), A comparison of some
Flavell J. H., (1979), Metacognition and cognitive monitoring: A methodologies for the factor analysis of non-normal Likert
new area of cognitive-developmental inquiry, Am. Psychol., variables, Br. J. Math. Stat. Psychol., 38(2), 171–189.
34(10), 906–911. Nelson T. O. and Narens L., (1990), Metamemory: A theoretical
Friedrich R. J., (1982), In defense of multiplicative terms in framework and new findings, in Psychology of Learning and
multiple regression equations, Am. J. Pol. Sci., 26(4), 797–833. Motivation, Bower G. (ed.), Academic Press, pp. 125–173.
View Article Online
Nezlek J. B., (2008), An introduction to multilevel modeling measure of local effect size, from PROC MIXED, Front.
for social and personality psychology, Soc. Pers. Psychol. Psychol., 3.
Compass, 2(2), 842–860. Sharma R., Jain A., Gupta N., Garg S., Batta M. and Dhir S.,
Norman G., (2010), Likert scales, levels of measurement and the (2016), Impact of self-assessment by students on their
‘‘laws’’ of statistics, Adv. Health Sci. Educ., 15(5), 625–632. learning, Int. J. Appl. Basic Med. Res., 6(3), 226–229.
Ohtani K. and Hisasaka T., (2018), Beyond intelligence: A meta- Snijders T. and Bosker R., (2012), Multilevel Analysis: An Introduction
analytic review of the relationship among metacognition, to Basic and Advanced Multilevel Modeling, 2nd edn, Sage.
intelligence, and academic performance, Metacogn. Learn., Stevens S. S., (1946), On the theory of scales of measurement,
13(2), 179–212. Sci., New Series, 103(2684), 677–680.
Olsson U., (1979), On the robustness of factor analysis against Stevens S. S., (1958), Measurement and man, Science, 127(3295),
crude classification of the observations, Multivar. Behav. 383–389.
Res., 14(4), 485–500. Talanquer V. and Pollard J., (2010), Let’s teach how we think instead
Pajares F., (1996), Self-efficacy beliefs in academic settings, Rev. of what we know, Chem. Educ. Res. Pract., 11(2), 74–83.
Educ. Res., 66(4), 543–578. Theobald E., (2018), Students are rarely independent: When,
Pazicni S. and Bauer C. F., (2014), Characterizing illusions of why, and how to use random effects in discipline-based
competence in introductory chemistry students, Chem. Educ. education research. CBE – Life Sci. Educ., 17(rm2).
Res. Pract., 15(1), 24–34. Thomas R. C., Finn B. and Jacoby L. L., (2016), Prior experience
Pedhazur E., (1997), Multiple Regression in Behavioral Research: shapes metacognitive judgments at the category level: The
Explanation and Prediction, 3rd edn, Harcourt Brace College role of testing and category difficulty, Metacogn. Learn.,
Publishers. 11(3), 257–274.
Peugh J. L., (2010), A practical guide to multilevel modeling, Veenman M. V. J., (2011), Learning to self-monitor and
J. Sch. Psychol., 48(1), 85–112. self-regulate, in Handbook of Research on Learning and
Raudenbush S. and Bryk A., (2002), Hierarchical linear models: Instruction, Routledge.
applications and data analysis methods, 2nd edn, Sage. Veenman M. V. J., (2017), Assessing metacognitive deficiencies
Rickey D. and Stacy A. M., (2000), The role of metacognition in and effectively instructing metacognitive skills, Teach. Coll.
learning chemistry, J. Chem. Educ., 77(7), 915. Rec., 119(130307), 1–20.
Salmoni A., Schmidt R. and Walter C., (1984), Knowledge of Veenman M. V. J., Van Hout-Wolters B. H. A. M. and Afflerbach P.,
results and motor learning: A review and critical reappraisal, (2006), Metacognition and learning: Conceptual and methodo-
Psychol. Bull., 95(3), 355–386. logical considerations, Metacogn. Learn., 1, 3–14.
Sandi-Urena S., Cooper M. and Stevens R., (2012), Effect of Velleman P. F. and Wilkinson L., (1993), Nominal, ordinal,
cooperative problem-based lab instruction on metacognition interval, and ratio typologies are misleading, Am. Stat., 47(1),
and problem-solving skills, J. Chem. Educ., 89(6), 700–706. 65–72.
Sargeant J., Mann K., Sinclair D., van der Vleuten C. and Wang M. C., Haertel G. D. and Walberg H. J., (1990), What
Metsemakers J., (2007), Challenges in multisource feedback: influences learning? A content analysis of review literature,
Intended and unintended outcomes, Med. Educ., 41(6), 583–591. J. Educ. Res., 84(1), 30–43.
Schraw G. and Moshman D., (1995), Metacognitive theories, Wickham H. and Henry L., (2018), tidyr: easily tidy data with
Educ. Psychol. Rev., 7(4), 351–371. ‘spread()’ and ‘gather()’ functions: R package version 0.8.1.,
Schraw G., Crippen K. J. and Hartley K., (2006), Promoting self- retrieved from https://CRAN.R-project.org/package=tidyr.
regulation in science education: Metacognition as part of a Woltman H., Feldstain A., MacKay J. C. and Rocchi M., (2012),
broader perspective on learning, Res. Sci. Educ., 36(1–2), An introduction to hierarchical linear modeling, TQMP, 8(1),
111–139. 52–69.
Selya A., Rose J., Dierker L., Hedeker D. and Mermelstein R., Zimmerman B. J., (2000), Self-efficacy: An essential motive to
(2012), A practical guide to calculating Cohen’s f 2, a learn. Contemp. Educ. Psychol., 25(1), 82–91.

Faktor SA Eng

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Faktor SA Eng

Uploaded by

Copyright:

Available Formats

Chemistry Education

Research and Practice

Characterizing change in students’

Introduction compared to what the desired level of understanding may be

Paper Chemistry Education Research and Practice

Chemistry Education Research and Practice Paper

Section 1 Section 2 Section 3

Paper Chemistry Education Research and Practice

Table 2 Tool for self-assessment of understanding

Chemistry Education Research and Practice Paper

Dependent measure. The dependent measure or predicted

seven categories ranging from 3 to +3 including a category for a

Paper Chemistry Education Research and Practice

Chemistry Education Research and Practice Paper

Paper Chemistry Education Research and Practice

Chemistry Education Research and Practice Paper

Paper Chemistry Education Research and Practice

Chemistry Education Research and Practice Paper

Table 4 HML Model for Change in Self-Assessed Understanding (DSAU)

Predictor (fixed eﬀects) B SE 95% CI b Cohen’s f 2

Component (random eﬀects) Variance SD 95% CI (SD)

Paper Chemistry Education Research and Practice

Moving from the low to average to high initial self-assessment

The three combinations with mismatched with initial self-

Chemistry Education Research and Practice Paper

Paper Chemistry Education Research and Practice

Chemistry Education Research and Practice Paper

Paper Chemistry Education Research and Practice

Chemistry Education Research and Practice Paper

Paper Chemistry Education Research and Practice

Table 6 Summary of model building

Chemistry Education Research and Practice Paper

Paper Chemistry Education Research and Practice

multiple regression/correlation analysis for the behavioral Educ., 20(1), 3–21.

Chemistry Education Research and Practice Paper

You might also like