You are on page 1of 6

RESEARCH ARTICLE XXXXXXXXXXXXXXXXX

Copyright © 2015 American Scientific Publishers Advanced Science Letters


All rights reserved Vol. XXXXXXXXX
Printed in the United States of America

CEFR Rating Scale: Scaling its Functionality


through ESL Learners’ Self- and Peer Assessments
Mardiana Idris1, Abdul Halim Abdul Raof2*
1Faculty
of Education, Universiti Teknologi Malaysia, 81310 Skudai, Malaysia
2Language Academy, Universiti Teknologi Malaysia, 81310 Skudai, Malaysia

The Common European Framework of Reference (CEFR) was empirically derived from experienced teachers’ intuitive
judgements and although it may seem adequate, assessment theorists argue on the absence of performance data-driven
approach in structuring its rating scale. Therefore, the objective of this study was to gauge rating scale functioning of five
CEFR oral assessment criteria (overall impression, range, accuracy, fluency and coherence), based on ratings awarded by
intermediate ESL learners during self- and peer assessments of their speaking skills. Before learners embarked on awarding
ratings to their own performance as well as their peers’, they were trained to understand and apply the criteria and the
corresponding ratings based on the benchmarked videos provided on the CEFR official website. Findings on rater
measurement report suggest that learners were within productive measurement range based on their infit and outfit mean-
squares and the reliability of their ratings was reported at 0.91. However, only four out of six rating categories were utilized
in all five CEFR oral assessment criteria and this was illustrated through category probability curve for each criterion. The
implications of the findings are discussed in view of learner performance and classroom instruction.

Keywords: CEFR rating scale, self-assessment, peer-assessment, oral proficiency

1. INTRODUCTION Consequently, this results in diminishing ‘diversity and


experimentation’5 in research on learners’ real application
The Common European Framework of Reference of the CEFR rating scales. In fact, many studies on the
(CEFR) has been widely utilized in Europe and beyond CEFR6 rating scales revolve on its reliability and validity
its borders mainly because the descriptors and its rating concerns, mostly in standardized assessments contexts.
scales were empirically derived from 2000 descriptors of Therefore, the objective of this study was to
30 available scales1 by experienced teachers. However, gauge rating scale functioning of five CEFR oral
this ‘common currency’2,3 of describing learners’ level of assessment criteria (overall impression, range, accuracy,
attainment in language proficiency has also garnered fluency and coherence) during self- and peer assessments
some criticisms. One of the contentions pertains to the (SAPA) practice of ESL learners’ speaking ability.
scaling structure which was mainly originated from Contribution of this study lies in offering understanding
experienced teachers’ intuitions rather than learners’ of ESL learners’ application of the CEFR analytical
performance4. In addition, the lack of studies on the descriptors in tandem with the rating scale functioning. In
CEFR scaling structure from samples of learner reporting this study, the paper is organized as follows. In
performance could probably be due to the CEFR being section 2, studies on related scaling development and
treated as a standardized assessment by policy makers construction of descriptors are reviewed. Section 3
rather than a framework or ‘heuristic model’4. describes the methodology of the study. Findings are
*
Email Address: m-halim@utm.my presented in section 4 and finally, the implications of this
1
Adv. Sci. Lett. X, XXX–XXX, 2015 RESEARCH ARTICLE

study are discussed in the concluding section. researchers intend to fill in this gap by gauging
intermediate ESL learners’ utilization of the ratings
2. LITERATURE REVIEW prescribed in the CEFR oral assessment criteria for their
SAPA practice.
In describing scaling development and
construction of descriptors, two empirical studies on the 3. METHODOLOGY
application of the CEFR are frequently reported. These
studies are generally referred to as the ‘DIALANG Prior to SAPA practice, thirteen ESL learners
project’ and the ‘ALTE can-do statements’. The with intermediate level of English proficiency voluntarily
DIALANG project was initiated for diagnostic purposes participated in a rater training. During rater training,
and the main approach taken by the system was learner- participants were trained by one of the researchers to (i)
oriented7. The system operates based on self-assessment, understand the CEFR descriptors and ratings through
language tests as well as feedback and this system is jigsaw activity and (ii) apply their understanding of the
already available in fourteen European languages and CEFR criteria while watching available DVDs on the
easily accessible via the Internet. The DIALANG system official CEFR website (http://www.ciep.fr/). In jigsaw
does not issue any certificate as it encourages learners to activity9, participants slotted in missing descriptors into
have greater control over their learning and consequently the relevant grids of the CEFR oral assessment criteria so
promote autonomous learning. In terms of the DIALANG that they would have better understanding of the criteria.
self-assessment scales, it was first developed qualitatively This activity was essential to SAPA practice in order to
and later calibrated with 304 subjects in order to maintain validity of ratings awarded later. After that, one
statistically determine the item difficulty level and the of the researchers discussed matching descriptors and its
corresponding statements so that a functioning scale can corresponding ratings based on the comprehensive
be constructed. From this procedure, scales in reading, manual provided by the website. After the researchers
writing and listening were developed. Although were satisfied with participants’ understanding of the
statements on speaking were chosen at the beginning of CEFR descriptors, viewing of sample DVDs began.
the DIALANG project, it was not included in the system While viewing, participants rated the speakers by
for reasons unknown. applying the CEFR six-level rating scale from A1 (basic)
Another validation study of the CEFR was to C2 (proficient) on overall impression, range, accuracy,
conducted by the Association of Language Testers in fluency and coherence. Although the CEFR oral
Europe (ALTE) and this study is known as the ‘ALTE assessment criteria have a specific grid on describing
can-do statements.’ The aim of this long term research interaction, this criterion was not included in the practice
project is ‘to develop and validate a set of performance- in order to control for interlocutor effects10.
related scales, describing what learners can actually do in To control for participants’ accuracy and
the foreign language’8. The scales consist of consistency during SAPA practice, their ratings during
approximately 400 can-do statements which have been rater training were analyzed first with Winsteps, a Rasch
translated into 12 languages. In the empirical validation measurement software. Two participants (S11 and S06)
process of these statements, nearly ten thousands were found to have exceeded the range for productive
respondents completed the distributed questionnaires and measurement (0.5-1.5) in Rasch model11. Therefore, their
analysis was conducted using Rasch measurement model. ratings were excluded for SAPA practice analysis. During
Results of the analysis were later linked to the ALTE SAPA practice, each participant uploaded three videos of
examinations and subsequently, a five-level system of the their speeches to a private YouTube channel and this
ALTE framework was conceptualized. These five levels totaled to thirty-nine videos for the entire twelve weeks of
corresponded broadly to A2 (Waystage), B1 (Threshold), this practice. Each uploaded video for each assessment
B2 (Vantage), C1 (Effective Operational Proficiency) and session – three assessment sessions in total - was assessed
C2 (Mastery) of the CEFR while the breakthrough level by participants themselves and their peers based on the
(A1) is still in progress (up to the researchers’ knowledge five CEFR oral assessment criteria. Participants used the
at this point of writing). Compared to the DIALANG six-level rating scale (A1-C2) in each assessment session.
project, the ALTE skill levels cover all four language At the end of the study, participants’ ratings were
skills with listening and speaking combined as one. analyzed using Many-Facets Rasch Measurement
Although these two studies had described the (MFRM) to gauge the rating scale functioning for each
rating scales development vigorously and empirically criterion. The findings are reported in the next section.
based on the CEFR template, samples of learner
performance were absent in the construction process. The 4. FINDINGS
DIALANG project shared similar aim with this study,
which was to focus on learners as agents of assessment Prior to reporting rating scale functioning of the
and learning. Unfortunately, the speaking section was not CEFR oral assessment criteria, participant (rater)
developed in the system to fulfil this aim. In contrast, the measurement report is displayed first in Table 1. The table
ALTE framework has developed a five-level rating scale shows the severity measure, infit and outfit mean-squares
which includes speaking and listening skills. However, of eleven participants in this study. In the MFRM
the scales were based on ‘can-do statements’ which were analysis, severity measure refers to raters’ average rating
derived from completed questionnaires. Therefore, the tendency to assign lower (severe) or higher (lenient)
2
RESEARCH ARTICLE XXXXXXXXXXXXXXXXX

scores than expected by taking into consideration of ways in rating themselves and their peers. Since the
scores given by other raters to the same performance12. separation value was more than 2, the reliability of the
Infit mean-square is sensitive to internal consistency of measurement was high with 0.91.
raters while outfit mean-square informs the outlier in the With the exception of S8, the infit mean-squares
data. In terms of productive measurement for the MFRM of ten raters were between 0.60 and 1.18 logits. This
analysis, the acceptable range is between 0.5 and 1.5. suggests that almost all raters were internally consistent
Based on Table 1, only S8 exceeded the value for when they rated themselves as well as their peers. One
productive measurement (infit: 1.91 and outfit: 1.93). explanation for this internal consistency amongst raters
Compared to other participants, S8 could be considered as could be attributed to the rater training provided before
having lower internal consistency and behaving SAPA practice commenced. This finding is consistent
differently than expected during SAPA practice. with many studies which reported that rater training was
However, since the infit and outfit mean-squares did not crucial in ensuring rater consistency during rating
exceed 2.013, S8 still remained in the rating scale analysis practice14. Again, with the exception of S8, the outfit
as it did not distort the measurement. mean-squares of ten raters were between 0.57 and 1.14
logits. This indicates that raters were able to award
Table. 1. Participant (Rater) Measurement Report expected ratings for participants’ performances
accordingly, despite the relative perceptions of very good
Rater Severity Infit Outfit or very bad performance as adjusted by the MFRM
algorithm.
Measure MnSq MnSq Based on the rater measurement report, it implies
that intermediate ESL learners were generally able to rate
S7 -1.46 0.91 1.01 consistently and accurately. One of the probable reasons
for these findings could be attributed to the jigsaw
S5 -0.34 0.60 0.57 activity which was previously conducted with the
participants. When participants were engaged in slotting
S3 -0.26 0.90 0.93 the correct descriptors to its matching CEFR levels, their
understanding of the oral assessment criteria was
S9 -0.15 1.18 1.14 enhanced and consequently, they became more focused
while rating the speakers. In addition, the discussion after
S10 -0.15 0.64 0.67 the activity also helped these participants to clarify any
misconceptions about their understanding of the criteria.
By engaging participants with the discussion, the validity
S12 0.02 0.98 1.06
of ratings awarded was maintained since participants’
understanding of the criteria was translated through fair
S4 0.05 0.88 0.84 ratings awarded which directly described the speakers’
performance.
S1 0.30 1.04 0.95 Besides rater measurement report, MFRM
analysis also gauged rating scale functioning for each
S13 0.49 0.83 0.82 criterion observed during SAPA practice. Raters’ rating
tendencies for overall impression, range, accuracy,
S2 0.57 1.08 0.96 fluency and coherence were illustrated through category
probability curves. In each category probability curve, the
S8 0.93 1.91 1.93 horizontal axis is the participants’ proficiency scale and
the vertical axis shows the probability of ratings used
during SAPA practice. If a category was observed more
S.D:0.59; Separation: 3.20; Reliability: 0.91 often during practice, it would show higher probability
curve. Figures 1 to 5 are category probability curves for
From Table 1, it is evident that six out of eleven each CEFR criterion observed during SAPA practice.
participants were more severe in awarding their ratings.
Participants’ severity in awarding ratings are indicated by
positive severity measure and these are identified through B1 C2
S12, S4, S1, S13, S2 and S8 values, ranging from 0.02 to B2 B1
C1
0.93 logits. Amongst the five lenient raters, S7 was the
most lenient during SAPA practice with -1.46 logits.
From these measures, it implies that participants had
nearly equal number of severe and lenient raters who
rated their videos on the private YouTube channel.
The MFRM analysis also revealed that raters had
been statistically categorized into more than three levels
of rating performance with 3.20 as the separation value.
This indicates that raters displayed about three distinct Fig. 1. Category probability curves for overall impression
3
Adv. Sci. Lett. X, XXX–XXX, 2015 RESEARCH ARTICLE

Figures 1 to 5 are displayed concurrently in order


B1 to identify the pattern of participants’ observation for each
C2 criterion. Evidently, categories B2 and C1 were observed
B1 more often in all five CEFR criteria since these two
B2
C1 categories had higher probability curves. Although the
CEFR oral assessment criteria utilize a six-rating scale,
participants only used four ratings during SAPA practice
which were B1, B2, C1 and C2. This implies that
participants rated themselves as well as their peers as
performing at intermediate and advanced levels, based on
the descriptors specified in the CEFR oral assessment
criteria. In fact, none of the participants awarded A1 and
A2 (beginner level) for both assessments.
From these probability curves, it implies that the
Fig. 2. Category probability curves for range rating scale was only functioning up to four out of six
CEFR levels. This would be unacceptable in developing
B1 or refining an instrument for standardized test. However,
C2 since the purpose of this study was to measure rating
C1 scale functioning of the CEFR oral assessment criteria
B2 based on intermediate ESL learners’ SAPA practice, these
findings were expected since no beginner learners were
recruited in this study. Most importantly, since all the four
curves were visibly peaked, it indicates that all the four
categories were utilized meaningfully by the raters.
Interestingly, the probability curve of C1 (low
advanced level) in assessing speakers’ accuracy was more
prominent than the other categories. This is in contrast
with the ratings awarded for other criteria which centered
on B2 (high intermediate). It may suggest that
participants were more sensitive to this criterion
Fig. 3. Category probability curves for accuracy compared to overall impression, range, fluency and
coherence.
B1 C2 4. CONCLUSIONS

Findings of this study suggest that intermediate


B2 C1 ESL learners were generally able to rate consistently and
accurately during SAPA practice. In gauging the
functionality of the CEFR rating scales, participants only
utilized four out of six ratings prescribed with majority of
ratings concentrated between high intermediate (B2) and
low advanced (C1) levels of the CEFR oral assessment.
The employment of intermediate ESL learners
may have contributed to the findings of this study. All
performances were rated as above beginner level by the
participants. The researchers believe that if beginner,
Fig. 4 Category probability curves for fluency intermediate and advanced level of English proficiency
participants were to be employed, the findings may yield
B1 different results on the CEFR rating scale functioning.
C2 However, by recruiting beginner participants for SAPA
B1 practice, it may compromise the validity and reliability of
B2
C1 ratings since they may face challenges in understanding
and applying the descriptors of the oral assessment
criteria. Although only four levels were utilized by the
participants, the presence of C1 and C2 (advanced level)
in the probability curves may suggest that these
participants were able to discriminate each speaker’s
performance and award the ratings accordingly. The issue
of central tendency, an indication of raters overusing the
middle categories of a rating scale15, did not arise as the
Fig. 5. Category probability curves for coherence ratings were not uniformly spread in all criteria. This
4
RESEARCH ARTICLE XXXXXXXXXXXXXXXXX

suggests that novice raters, similar to participants in this REFERENCES


study, are somewhat able to understand and apply
descriptors in analytical scale15. [1] G. Fulcher, Testing Second Language Speaking, Longman,
The implications of this study are situated within London (2003)
learner performance and classroom instructions. Apart [2] N. Figueras, The Impact of the CEFR. ELT Journal, 66 (4)
from contributing to the developing CEFR scaling (2012) 477–485.
structure, samples of learner performance which were [3] N. Van Huy, M. Obaidul Hamid, Educational Policy
stored in the private YouTube channel could be viewed as Borrowing in a Globalized World: A case study of
evidence of speaking practice beyond the classroom Common European Framework of Reference for languages
walls. With the implementation of 21st century learning in a Vietnamese university. English Teaching: Practice &
skills, the utilization of YouTube as a platform for SAPA Critique, 14(1) (2015) 60–74.
practice could be construed as achieving the aim of [4] G. Fulcher, The Reification of the Common European
applying media and technology skills in teaching and Framework of Reference (CEFR) and Effect-driven Testing.
learning. In terms of classroom instruction, findings of Advances in Research on Language Acquisition and
this study imply that ESL learners are capable of Teaching, (2010) 15–26.
assessing themselves as well as their peers’ oral [5] A. Davies, in Encyclopaedia of Language and Education,
proficiency. Therefore, when learners are entrusted with Edited E. Shohamy, Springer, New York (2008) , Vol 7,pp.
the role as key assessor for their learning endeavor, it 429-443.
encourages assessment as learning16 to be practiced by [6] Council of Europe, Common European Framework of
teachers. In assessment as learning context, learners are Reference for Languages: Learning, teaching and
primed to be autonomous in their learning with the assessment. Cambridge University Press, Cambridge (2001)
application of self- and peer assessments. Similarly, this [7] S. Luoma, M. Tarnanen, Creating a Self-Rating Instrument
assessment paradigm was also envisioned by the Council for Second Language Writing: From Idea To
of Europe in the initial development of the CEFR. Implementation. Language Testing, 20(4) (2003) 440–465.
[8] Council of Europe, Common European Framework of
Reference for Languages: Learning, teaching and
ACKNOWLEDGMENTS assessment. Cambridge University Press, Cambridge (2009)
[9] H. Ibberson, An Investigation of Non-Native Learners ' Self-
This work was supported in part by the Assessment of the Speaking Skill and Their Attitude
sponsorship from the Ministry of Education, Malaysia. towards Self-Assessment. Unpublished PhD thesis.
University of Essex, United Kingdom (2012)
[10] L. Davis, The Influence of Interlocutor Proficiency in a
Paired Oral Assessment. Language Testing, 26(3) (2009),
367–396.
[11] T. Eckes, Introduction to Many-facet Rasch
Measurement: Analyzing and evaluating rater- mediated
assessments, Lang, Frankfurt (2011)
[12] C.M. Myford, E.W. Wolfe, Detecting and Measuring Rater
Effects Using Many-Facet Rasch Measurement : Part II.
Journal of Applied Measurement, 5(2) (2004), 189–227.
[13] J.M. Linacre, Facets (Version 3.70).Winsteps.com. Retrieved
from http://www.winsteps.com/facets.htm (2012)
[14] L. Davis, The Influence of Training and Experience on Rater
Performance in Scoring Spoken Language. Language
Testing (2015)
[15] C. Harsch and G. Martin, Comparing Holistic and Analytic
Scoring Methods: Issues of validity and reliability.
Assessment in Education: Principles, Policy & Practice,
20(3) (2013), 281–307
[16] L.M. Earl, Assessment as Learning: Using Classroom
Assessment to Maximize Student Learning
(Second Edition), Corwin, California (2013)

5
Adv. Sci. Lett. X, XXX–XXX, 2015 RESEARCH ARTICLE

You might also like