Professional Documents
Culture Documents
The Common European Framework of Reference (CEFR) was empirically derived from experienced teachers’ intuitive
judgements and although it may seem adequate, assessment theorists argue on the absence of performance data-driven
approach in structuring its rating scale. Therefore, the objective of this study was to gauge rating scale functioning of five
CEFR oral assessment criteria (overall impression, range, accuracy, fluency and coherence), based on ratings awarded by
intermediate ESL learners during self- and peer assessments of their speaking skills. Before learners embarked on awarding
ratings to their own performance as well as their peers’, they were trained to understand and apply the criteria and the
corresponding ratings based on the benchmarked videos provided on the CEFR official website. Findings on rater
measurement report suggest that learners were within productive measurement range based on their infit and outfit mean-
squares and the reliability of their ratings was reported at 0.91. However, only four out of six rating categories were utilized
in all five CEFR oral assessment criteria and this was illustrated through category probability curve for each criterion. The
implications of the findings are discussed in view of learner performance and classroom instruction.
study are discussed in the concluding section. researchers intend to fill in this gap by gauging
intermediate ESL learners’ utilization of the ratings
2. LITERATURE REVIEW prescribed in the CEFR oral assessment criteria for their
SAPA practice.
In describing scaling development and
construction of descriptors, two empirical studies on the 3. METHODOLOGY
application of the CEFR are frequently reported. These
studies are generally referred to as the ‘DIALANG Prior to SAPA practice, thirteen ESL learners
project’ and the ‘ALTE can-do statements’. The with intermediate level of English proficiency voluntarily
DIALANG project was initiated for diagnostic purposes participated in a rater training. During rater training,
and the main approach taken by the system was learner- participants were trained by one of the researchers to (i)
oriented7. The system operates based on self-assessment, understand the CEFR descriptors and ratings through
language tests as well as feedback and this system is jigsaw activity and (ii) apply their understanding of the
already available in fourteen European languages and CEFR criteria while watching available DVDs on the
easily accessible via the Internet. The DIALANG system official CEFR website (http://www.ciep.fr/). In jigsaw
does not issue any certificate as it encourages learners to activity9, participants slotted in missing descriptors into
have greater control over their learning and consequently the relevant grids of the CEFR oral assessment criteria so
promote autonomous learning. In terms of the DIALANG that they would have better understanding of the criteria.
self-assessment scales, it was first developed qualitatively This activity was essential to SAPA practice in order to
and later calibrated with 304 subjects in order to maintain validity of ratings awarded later. After that, one
statistically determine the item difficulty level and the of the researchers discussed matching descriptors and its
corresponding statements so that a functioning scale can corresponding ratings based on the comprehensive
be constructed. From this procedure, scales in reading, manual provided by the website. After the researchers
writing and listening were developed. Although were satisfied with participants’ understanding of the
statements on speaking were chosen at the beginning of CEFR descriptors, viewing of sample DVDs began.
the DIALANG project, it was not included in the system While viewing, participants rated the speakers by
for reasons unknown. applying the CEFR six-level rating scale from A1 (basic)
Another validation study of the CEFR was to C2 (proficient) on overall impression, range, accuracy,
conducted by the Association of Language Testers in fluency and coherence. Although the CEFR oral
Europe (ALTE) and this study is known as the ‘ALTE assessment criteria have a specific grid on describing
can-do statements.’ The aim of this long term research interaction, this criterion was not included in the practice
project is ‘to develop and validate a set of performance- in order to control for interlocutor effects10.
related scales, describing what learners can actually do in To control for participants’ accuracy and
the foreign language’8. The scales consist of consistency during SAPA practice, their ratings during
approximately 400 can-do statements which have been rater training were analyzed first with Winsteps, a Rasch
translated into 12 languages. In the empirical validation measurement software. Two participants (S11 and S06)
process of these statements, nearly ten thousands were found to have exceeded the range for productive
respondents completed the distributed questionnaires and measurement (0.5-1.5) in Rasch model11. Therefore, their
analysis was conducted using Rasch measurement model. ratings were excluded for SAPA practice analysis. During
Results of the analysis were later linked to the ALTE SAPA practice, each participant uploaded three videos of
examinations and subsequently, a five-level system of the their speeches to a private YouTube channel and this
ALTE framework was conceptualized. These five levels totaled to thirty-nine videos for the entire twelve weeks of
corresponded broadly to A2 (Waystage), B1 (Threshold), this practice. Each uploaded video for each assessment
B2 (Vantage), C1 (Effective Operational Proficiency) and session – three assessment sessions in total - was assessed
C2 (Mastery) of the CEFR while the breakthrough level by participants themselves and their peers based on the
(A1) is still in progress (up to the researchers’ knowledge five CEFR oral assessment criteria. Participants used the
at this point of writing). Compared to the DIALANG six-level rating scale (A1-C2) in each assessment session.
project, the ALTE skill levels cover all four language At the end of the study, participants’ ratings were
skills with listening and speaking combined as one. analyzed using Many-Facets Rasch Measurement
Although these two studies had described the (MFRM) to gauge the rating scale functioning for each
rating scales development vigorously and empirically criterion. The findings are reported in the next section.
based on the CEFR template, samples of learner
performance were absent in the construction process. The 4. FINDINGS
DIALANG project shared similar aim with this study,
which was to focus on learners as agents of assessment Prior to reporting rating scale functioning of the
and learning. Unfortunately, the speaking section was not CEFR oral assessment criteria, participant (rater)
developed in the system to fulfil this aim. In contrast, the measurement report is displayed first in Table 1. The table
ALTE framework has developed a five-level rating scale shows the severity measure, infit and outfit mean-squares
which includes speaking and listening skills. However, of eleven participants in this study. In the MFRM
the scales were based on ‘can-do statements’ which were analysis, severity measure refers to raters’ average rating
derived from completed questionnaires. Therefore, the tendency to assign lower (severe) or higher (lenient)
2
RESEARCH ARTICLE XXXXXXXXXXXXXXXXX
scores than expected by taking into consideration of ways in rating themselves and their peers. Since the
scores given by other raters to the same performance12. separation value was more than 2, the reliability of the
Infit mean-square is sensitive to internal consistency of measurement was high with 0.91.
raters while outfit mean-square informs the outlier in the With the exception of S8, the infit mean-squares
data. In terms of productive measurement for the MFRM of ten raters were between 0.60 and 1.18 logits. This
analysis, the acceptable range is between 0.5 and 1.5. suggests that almost all raters were internally consistent
Based on Table 1, only S8 exceeded the value for when they rated themselves as well as their peers. One
productive measurement (infit: 1.91 and outfit: 1.93). explanation for this internal consistency amongst raters
Compared to other participants, S8 could be considered as could be attributed to the rater training provided before
having lower internal consistency and behaving SAPA practice commenced. This finding is consistent
differently than expected during SAPA practice. with many studies which reported that rater training was
However, since the infit and outfit mean-squares did not crucial in ensuring rater consistency during rating
exceed 2.013, S8 still remained in the rating scale analysis practice14. Again, with the exception of S8, the outfit
as it did not distort the measurement. mean-squares of ten raters were between 0.57 and 1.14
logits. This indicates that raters were able to award
Table. 1. Participant (Rater) Measurement Report expected ratings for participants’ performances
accordingly, despite the relative perceptions of very good
Rater Severity Infit Outfit or very bad performance as adjusted by the MFRM
algorithm.
Measure MnSq MnSq Based on the rater measurement report, it implies
that intermediate ESL learners were generally able to rate
S7 -1.46 0.91 1.01 consistently and accurately. One of the probable reasons
for these findings could be attributed to the jigsaw
S5 -0.34 0.60 0.57 activity which was previously conducted with the
participants. When participants were engaged in slotting
S3 -0.26 0.90 0.93 the correct descriptors to its matching CEFR levels, their
understanding of the oral assessment criteria was
S9 -0.15 1.18 1.14 enhanced and consequently, they became more focused
while rating the speakers. In addition, the discussion after
S10 -0.15 0.64 0.67 the activity also helped these participants to clarify any
misconceptions about their understanding of the criteria.
By engaging participants with the discussion, the validity
S12 0.02 0.98 1.06
of ratings awarded was maintained since participants’
understanding of the criteria was translated through fair
S4 0.05 0.88 0.84 ratings awarded which directly described the speakers’
performance.
S1 0.30 1.04 0.95 Besides rater measurement report, MFRM
analysis also gauged rating scale functioning for each
S13 0.49 0.83 0.82 criterion observed during SAPA practice. Raters’ rating
tendencies for overall impression, range, accuracy,
S2 0.57 1.08 0.96 fluency and coherence were illustrated through category
probability curves. In each category probability curve, the
S8 0.93 1.91 1.93 horizontal axis is the participants’ proficiency scale and
the vertical axis shows the probability of ratings used
during SAPA practice. If a category was observed more
S.D:0.59; Separation: 3.20; Reliability: 0.91 often during practice, it would show higher probability
curve. Figures 1 to 5 are category probability curves for
From Table 1, it is evident that six out of eleven each CEFR criterion observed during SAPA practice.
participants were more severe in awarding their ratings.
Participants’ severity in awarding ratings are indicated by
positive severity measure and these are identified through B1 C2
S12, S4, S1, S13, S2 and S8 values, ranging from 0.02 to B2 B1
C1
0.93 logits. Amongst the five lenient raters, S7 was the
most lenient during SAPA practice with -1.46 logits.
From these measures, it implies that participants had
nearly equal number of severe and lenient raters who
rated their videos on the private YouTube channel.
The MFRM analysis also revealed that raters had
been statistically categorized into more than three levels
of rating performance with 3.20 as the separation value.
This indicates that raters displayed about three distinct Fig. 1. Category probability curves for overall impression
3
Adv. Sci. Lett. X, XXX–XXX, 2015 RESEARCH ARTICLE
5
Adv. Sci. Lett. X, XXX–XXX, 2015 RESEARCH ARTICLE