Professional Documents
Culture Documents
in Language
Evaluation,
Assessment
and Testing
Current Issues
in Language
Evaluation,
Assessment
and Testing:
Edited by
All rights for this book reserved. No part of this book may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without
the prior permission of the copyright owner.
Abstract
This study examined item-level data from fifty 30-item cloze tests that
were randomly administered to university-level examinees from Japan (N
= 2,298) and Russia (N = 5,170). A single 10-item anchor cloze test was
also administered to all students. The analyses investigated differences
between the two nationalities in terms of both classical test theory (CTT)
item analysis and multifaceted Rasch analysis (the latter allowed us to
estimate test-taker ability and item difficulty measures and fit statistics
simultaneously across 50 cloze tests separately and combined for the two
nationalities). The results indicated that considerably larger proportions of
items functioned well in the Rasch item analyses than in the traditional
CTT item analysis. Rasch analyses also turned out to be more appropriate
for our cloze test analysis and revision purposes than did traditional CTT
item analyses. Linguistic analyses of items that fit the Rasch model
revealed that blanks representing certain categories of words (i.e., function
words rather than content words, and Germanic-origin words rather than
Latin-origin words), and to a greater extent relatively high frequency
words were more likely to work well for norm-referenced test (NRT)
purposes. In addition, this study found that different items were
functioning well for the two nationalities.
How Well do Cloze Items Work and Why? 3
Introduction
Taylor (1953) first proposed the use of cloze tests for evaluating the
readability of reading materials in US elementary schools. In the 60s and
70s, a number of studies appeared on the usefulness of cloze for English as
a second language (ESL) proficiency or placement testing (see Alderson,
1979 for a summary of this early ESL research). Since then, as Brown
(2013) noted, this research on using cloze in ESL proficiency or placement
testing has continued, but has been inconsistent at best with reported
reliability estimates ranging from .31 to .96 and criterion-related validity
coefficients ranging from .43 to .91.
While the literature has focused predominantly on fixed interval (i.e.,
every nth word) deletion cloze tests, other bases have been used for
developing cloze tests. For example, rational deletion cloze was
developed by selecting blanks based on word classes (cf., Bachman, 1982,
1985; Markham, 1985). Tailored cloze involved using classical test theory
(CTT) item analysis techniques to select items and thereby create cloze
tests tailored to a particular group of students (cf. Brown, 1988, 1989,
2013; Brown, Yamashiro, & Ogane, 1999, 2001; Revard, 1990).
For the most part, cloze studies have been based on CTT. However,
Item Response Theory (IRT), including Rasch analysis, has been applied
to cloze in a few cases. Baker (1987) used Rasch analysis to examine a
dichotomously scored cloze test and found that “observed and expected
item characteristic curves show reasonable conformity, though with some
instances of serious misfit…no evidence for departure from unidimensionality
is found for the cloze data…” (p. iv). Hale, Stansfield, Rock, Hicks,
Butler, and Oller (1988) found that IRT provided stable estimates for cloze
in their study of the degree to which groups of cloze items related to
different subparts of the overall Test of English as a Foreign Language
(TOEFL). Hoshino and Nakagawa (2008) used IRT in developing a
multiple-choice cloze test authoring system. Lee-Ellis (2009) used Rasch
analysis in developing and validating a Korean C-test. However, Rasch
analysis has not been used to study the effectiveness of individual items.
The Study
Certainly, no work has investigated the degree to which cloze items
function well when analyzed using both CTT and IRT frameworks, and
little research has examined the functioning of cloze items in terms of their
linguistic characteristics. To address these issues and others, the following
research questions were posed, all focusing on the individual items
4 Chapter One
Participants
A total of 7,468 English as a foreign language (EFL) students
participated in this study: 2,298 of these EFL students were studying at 18
different universities in Japan as part of their normal classroom activities;
the remaining 5,170 EFL students were studying at 38 universities in
Russia (see Appendix 1-A for a list of the participating universities in both
countries). In Japan, about 38.3% of the participants were women, and
61.7% of the participants were men; in Russia, 71.7% of the participants
were women, and 28.0% were men, with the remaining 0.3% giving no
response. The participants in Japan were between 18-24 years old, while in
Russia they were between 14-46 years old. The data from Japan were
collected as part of Brown (1993 & 1998); the data from Russia were
collected in 2012-2013 and served as the basis of Brown, Janssen, Trace,
and Kozhevnikova (2013). Though these samples were convenience
samples (i.e., not randomly selected), they were relatively large, which is
important as this sample size permits robust analyses of these cloze data.
It is critically important to stress that in this study we are interested in
how linguistic background affects different analyses; we do not make any
claims for the generalizability of these results to the EFL populations of all
undergraduate students in university-level institutions in these countries.
In fact, we want to stress that the samples from Japan and Russia cannot
be said to be comparable given the sampling procedures, the very different
proportions of university seats per million people available in the two
How Well do Cloze Items Work and Why? 5
Measures
The 50 cloze tests used in this study were first created and used for
Brown (1993). The 50 passages were randomly selected from among the
adult-level books at a public library in Florida. Passages were chosen from
each book by randomly selecting a page then working backwards for a
reasonable starting place. Passages were between 366-478 words long
with a mean of 412.1 words. Each passage contained 30 items, and the
deletion pattern was every 12th word, which created a fairly high degree of
item independence relative to the more typical 7th-word deletion pattern.
The first and last sentences of all passages were left intact to provide
context. Appendix 1-B shows the layout of the directions, example items,
and answer key.
A 10-item cloze passage was also administered to all participants to act
as anchor items (i.e., items that provide a common metric for making
comparisons across tests and examinee samples). This anchor-item cloze
was first created in a study by Brown (1989), wherein it was found that
these 10 items were functioning effectively.
To check the degree to which the English in the cloze passages was
representative of typical written English, the lexical frequencies for all 50
passages combined were calculated (see Appendix 1-C) and compared to
the frequencies reported for the same words in the well-known Brown
Corpus (Francis & Kuþera, 1979, 1982; Kuþera & Francis, 1967). We felt
justified in comparing the 50 passages to this particular corpus for two
reasons. First, following Stubbs (2004), though the Brown Corpus is
relatively small, it is
“still useful because of their careful design ... one million words of written
American English, sampled from texts published in 1961: both informative
prose, from different text types (e.g., press and academic writing), and
different topics (e.g., religion and hobbies); and imaginative prose (e.g.,
detective fiction and romance).” (p. 111)
(Brown, 1998). Thus, we felt reasonably certain that these passages and
cloze items were representative of the written English language, or at a
minimum the genres of English found in US public library books.
Procedures
The 50 cloze tests were distributed to intact classes by teachers such
that every student had an equal chance of receiving each of the 50 cloze
test passages. In Japan, 42-50 participants completed each cloze test, with
a mean of 46.0 participants completing each passage. In Russia, 90-122
completed each cloze test (Mean = 103.4). All examinees in both countries
completed the 10-item anchor cloze. Twenty-five minutes were allowed
for completing the tests. Exact-answer scoring was used (i.e., only the
word found in the original text was counted as correct). This was done for
two reasons: (a) we wanted each item to be interpretable as fillable by a
single lexical item for analysis purposes; and (b) with the hundreds of
items and thousands of examinees in this study, using an acceptable-
answer scoring or any other of the available scoring schemes would
clearly have been beyond our resources.
Analyses
Initially, CTT statistics were used to analyze the cloze test data. These
statistics included: the mean, standard deviation, minimum and maximum
scores, reliability, item facility, and item discrimination. Rasch analyses
were also used in this study to calculate item difficulty measures and to
identify misfitting test items. We used FACETS (Linacre, 2014a) analysis
rather than WINSTEPS because the former allowed us to easily analyze
our nested design (i.e., multiple tests administered to different groups of
examinees). Or as Linacre put it, “Use Winsteps if the data can be
formatted as a rectangle, e.g., persons-items … Use Facets when Winsteps
won’t do the job” (Linacre, 2014b, np).
We performed the analyses in several steps. Initially, we needed to
determine anchor values through a separate FACETS analysis of only the
10 anchor items that were administered across all groups of participants.
Then, we created a FACETS input file to link our 50 cloze tests by using
our 10 anchor items (see Appendix 1-D for a description of the actual code
that was used). There were three facets in this analysis: test-takers, test
version, and test items. By using the FACETS program, we were able to
combine the 50 different cloze procedures for both nationalities into a
single analysis using anchor items, and put all of the items onto the same
How Well do Cloze Items Work and Why? 7
true interval scale for ease of comparison (see e.g., Bond & Fox, 2007, pp.
75-90). Four of the total 1,500 items had blanks that were either missing or
made no sense, thus the total number of valid cloze items was 1,496.
Appendix 1-D also shows how we coded the data for the analysis. In
order to analyze separate tests in a single analysis using a common set of
anchor items, each examinee required two lines of response data. The first
line corresponds to the set of items for the particular cloze procedure, set
up by examinee ID, test version, the range of applicable items (e.g., 101-
130 for items 1-30 on Test 1), followed by the observed response for each
item. An additional line was also needed for examinee performance on
anchor items, with the same coding format as above except for a common
range of items for all examinees (31-40). The series of commas within the
data indicates items that were removed as explained above. Using the
same setup, we were able to run the program separately for the samples in
Russia, Japanese, and Combined (i.e., with the two samples analyzed
together as one).
Results
Classical Test Theory
Reliability. Tables 1-1 and 1-2 also show how the reliability estimates of
the various cloze passages were for the two nationalities. These cloze tests
functioned somewhat less reliably with the Japan sample (ranging from
.17 to .87) than with the Russia sample (ranging from .65 to .92). This
pattern could be a consequence of the greater variation and perhaps the
larger sample sizes in Russia. A synthesis of the cloze passages’ reliability
estimates is shown in Table 1-3.
8 Chapter One
Japan
Test Mean SD Min Max N r
1 5.23 3.16 0 15 48 0.71
2 4.21 3.42 0 13 47 0.86
3 2.02 2.13 0 10 48 0.74
4 7.54 3.87 2 16 46 0.80
5 3.98 2.79 0 13 47 0.73
6 5.11 3.23 0 14 47 0.80
7 6.14 3.41 0 16 43 0.83
8 3.16 2.27 0 8 45 0.46
9 2.85 2.46 0 11 46 0.77
10 2.54 2.31 0 8 46 0.83
11 5.94 3.36 0 16 46 0.74
12 8.98 3.97 0 21 47 0.79
13 2.87 1.71 0 8 46 0.50
14 3.23 2.50 0 9 47 0.68
15 9.18 3.42 4 18 49 0.68
16 1.36 1.41 0 6 48 0.65
17 1.38 1.25 0 5 46 0.35
18 1.02 1.09 0 3 50 0.50
19 4.76 2.88 0 10 50 0.70
20 4.38 3.24 0 15 47 0.86
21 9.92 4.44 0 19 48 0.84
22 3.70 2.86 0 11 47 0.84
23 3.64 2.40 0 11 43 0.65
24 2.96 2.26 0 9 47 0.44
25 5.36 2.74 0 12 46 0.63
26 2.68 1.56 0 5 47 0.17
27 2.34 2.72 0 13 47 0.87
28 2.58 2.17 0 8 43 0.57
29 2.32 1.77 0 7 44 0.64
30 9.56 3.28 3 16 48 0.72
31 3.78 3.08 0 15 46 0.83
32 3.83 2.53 0 9 42 0.77
33 2.14 1.87 0 6 44 0.63
34 5.87 2.92 0 13 45 0.82
35 6.63 3.66 0 17 45 0.72
36 5.00 2.05 0 9 46 0.51
How Well do Cloze Items Work and Why? 9
Japan
Test Mean SD Min Max N r
37 5.46 3.66 0 13 48 0.77
38 1.71 1.57 0 8 48 0.75
39 2.51 1.98 0 9 47 0.65
40 3.49 1.90 0 9 43 0.66
41 2.87 2.51 0 10 43 0.76
42 4.41 3.10 0 18 44 0.81
43 1.43 1.45 0 7 44 0.19
44 3.24 2.52 0 10 46 0.67
45 6.55 3.87 0 16 42 0.79
46 2.16 1.82 0 7 47 0.31
47 3.79 2.33 0 11 43 0.69
48 2.69 2.12 0 11 42 0.74
49 4.56 2.81 0 11 49 0.75
50 2.49 2.70 0 12 45 0.77
Mean 4.11 2.61 0.18 11.34 45.96 0.69
Note: SD = Standard Deviation; Min = Minimum score; Max = Maximum
score; N = number of items; r = reliability.
Russia
Test Mean SD Min Max N r
1 6.78 3.99 0 16 120 0.75
2 7.06 4.94 0 19 102 0.85
3 3.94 3.71 0 14 103 0.81
4 9.82 6.12 0 21 105 0.89
5 6.54 4.38 0 22 106 0.82
6 5.34 4.19 0 16 102 0.83
7 8.07 6.22 0 20 103 0.90
8 3.13 3.67 0 24 101 0.86
9 4.08 3.67 0 23 105 0.81
10 3.77 4.24 0 22 102 0.87
11 5.74 4.53 0 17 101 0.85
12 9.27 4.86 0 20 115 0.83
13 3.30 3.89 0 17 105 0.86
14 5.10 4.70 0 17 107 0.87
15 8.10 5.60 0 21 106 0.89
16 2.30 2.70 0 11 115 0.77
10 Chapter One
Russia
Test Mean SD Min Max N r
17 2.55 2.29 0 10 109 0.65
18 1.60 2.27 0 15 100 0.78
19 6.15 5.08 0 30 102 0.88
20 5.41 5.01 0 24 97 0.89
21 10.32 6.96 0 27 103 0.92
22 3.74 3.64 0 14 102 0.83
23 3.58 3.36 0 14 102 0.79
24 2.13 2.37 0 10 101 0.71
25 4.63 4.55 0 15 102 0.87
26 4.35 3.25 0 21 100 0.77
27 3.48 3.07 0 15 100 0.75
28 4.01 3.81 0 18 102 0.84
29 3.39 2.70 0 11 102 0.70
30 12.82 5.39 0 22 111 0.83
31 4.88 3.89 0 14 101 0.82
32 4.96 3.22 0 12 101 0.79
33 2.82 2.57 0 10 102 0.71
34 7.11 4.43 0 18 102 0.83
35 6.72 5.54 0 25 103 0.87
36 4.81 4.11 0 16 96 0.83
37 8.38 5.46 0 24 103 0.87
38 2.42 2.44 0 14 106 0.74
39 3.62 3.44 0 12 103 0.80
40 3.87 4.39 0 24 90 0.88
41 4.53 3.56 0 14 101 0.79
42 4.78 4.10 0 20 93 0.84
43 2.09 2.56 0 15 99 0.76
44 4.80 4.28 0 19 102 0.85
45 9.24 6.59 0 21 101 0.91
46 3.69 3.49 0 14 93 0.80
47 3.19 2.79 0 12 104 0.73
48 2.98 3.36 0 18 108 0.75
49 4.39 4.10 0 15 122 0.86
50 3.57 3.04 0 13 109 0.76
Mean 5.07 4.05 0 17.52 103.40 0.82
Note: SD = Standard Deviation; Min = Minimum score; Max = Maximum score; N
= number of items; r = reliability.
How Well do Cloze Items Work and Why? 11
Item Analyses. One aspect of CTT item analysis that testers often examine
while developing norm-referenced tests (NRTs) is item facility (IF).
Brown (2005) recommends keeping items with an IF in the range between
.30 and .70 and discarding or replacing any items with IF values outside
that range. Table 1-4 shows at the bottom of the third column of numbers
that only 19.1% (Japan = 14.0%; Russia = 24.1%) of the items overall
were functioning well in the .30-.70 range. Interestingly, 27.6% of the
items (Japan = 31.7%; Russia = 23.5%) were not functioning at all (i.e.,
nobody answered correctly, hence IF = .00) and over 50% of the items
were in the .01 to .29 range, which further confirms that these items were
generally too difficult for these two samples.
12 Chapter One
Another aspect of CTT item analysis that testers often consider when
developing NRTs is item discrimination (ID, calculated using the point-
biserial correlation coefficient in this study). ID values can range from .00
for items that do not discriminate at all between the high and low
performing examinees to 1.00 for items that are discriminating perfectly
between the high and low examinees. Negative discrimination values,
which can range down to -1.00, indicate the degree to which items are
measuring differently from the total scores on the test. Generally, CTT test
designers try to use items with the highest positive ID values available
when developing and revising NRTs. Ebel (1979) suggests these following
ID value ranges for test development: poor ID (.00-.19), marginal ID (.20-
.29), good ID (.30-.39), and very good ID (.40 and higher).
Table 1-5 shows the frequencies and percentages of cloze items in
terms of different ID value ranges for both nationalities. The Total row at
the bottom shows that on average 28.8% of the items contributed nothing
to the discrimination of these tests, though that percentage was
considerably higher in Japan (41.2%) than in Russia (16.4%). While large
proportions of items were Very Poor, Poor, or Marginal discriminators
when used with both test-taker samples, the cloze tests when used in
Russia had considerably more items in the Good and Very Good categories
(19.1% and 10.1%, respectively) than in Japan (9.2% and 2.1%,
respectively).
How Well do Cloze Items Work and Why? 13
Marginal .20-.29
.00 or negative
Good .30-.39
Poor .10-.19
Total
Frequencies
Japan 616 194 258 258 138 32 1,496
Russia 246 173 292 348 286 151 1,496
Total 862 367 550 606 424 183 2,992
Percentages
Japan 41.2% 13.0% 17.2% 17.2% 9.2% 2.1% 100%
Russia 16.4% 11.6% 19.5% 23.3% 19.1% 10.1% 100%
Total 28.8% 12.3% 18.4% 20.3% 14.2% 6.1% 100%
Total # Misfitting
Item Separation
Item Reliability
# Fitting Items
# Underfitting
Item RMSE
Items
Sample Ȥ2
The Item RMSE is the root mean square standard error statistic used
in calculating the separation index discussed next, but it can be interpreted
on its own as a standard error. The lower the value of RMSE, the better the
data fit the model. In this study, the RMSE values were high (ranging from
.76 to 1.16) which indicates that a number of items were not fitting the
model as well as might be desired (as discussed above).
The item separation index is an estimate of the spread of the item
estimates relative to their precision, “the number of standard errors of
spread among the items” (Bond & Fox, 2007, p. 59), which is to say that
this measure reports reliability in units of standard error. Higher values are
desired in this case. Notice that the separation index is higher for the
How Well do Cloze Items Work and Why? 15
combined data than for the single nationalities, and also higher for the
samples in Russia than for those in Japan. In all three cases, this indicates
that the items are comparatively high in terms of the way the difficulty
estimates are spread out relative to their precision.
The item reliability statistic shown in Table 1-6 is similar to
Cronbach’s alpha (Bond & Fox, 2007), and as with Cronbach’s alpha, is
on a scale from 0.00 to 1.00. High reliability for items in this case
indicates that items’ measures of difficulty are predicted to be ordered
similarly across different iterations of Rasch modeling. The analyses
indicated moderate item reliability of .70 for the Japan samples,
considerably higher reliability of .87 for the Russia samples, and even
higher reliability at .89 for the combined samples.
The chi-square (fixed) values test the following hypothesis: “Can this
set of elements be regarded as sharing the same measure after allowing for
measurement error?” Thus, for this design, the following hypothesis is
being tested: Can these items be thought of as equally difficult? Clearly,
since all of the chi-square (fixed) statistics in this study were found to be
significant (at p < .00), this hypothesis must be rejected, that is, the answer
is no, the items cannot be thought of as equally difficult. Thus, the
variations in item difficulty estimates are probably due to factors other
than chance.
Rasch Vertical Rulers. Tables 1-7 and 1-8 present the vertical rulers
that resulted from our FACETS analyses for Japan and Russia. The first
column shows the scale for the vertical ruler, which represents the range of
scores on a true interval logit scale, centered on 0. Note that FACETs
requires at least one category to be fixed (i.e., centered on 0.00) in order to
set the parameters of the scale. We chose to center tests on 0.00 because
they were the same for both groups. Because persons and items were the
more interesting categories, they were set to float (i.e., non-centered) to
reveal their positions relative to one another. In these rulers, the range of
logit scores is from low scores at about -6 to high scores at +7. The second
column shows the test-taker ability measures for Japan ranging from about
2.5 down to -6. The third column shows the relative difficulty of the 50
cloze tests when used in Japan. The fourth column shows the test item
difficulties for Japan with a number of items at +4 (i.e., maximally
difficult because nobody answered them correctly) and others ranging
down below -5. The fifth column shows the logit scores again, and the
sixth column shows the test takers in Russia on the scale ranging from
about 6.5 down to -5. The seventh column shows the test versions in
Russia. The eighth column shows the test item difficulties in Russia with a
number of items at +7 (i.e., meaning that they were maximally difficult in
16 Chapter One
that nobody answered them correctly) and the others ranging down to
below -3.
Table 1-7: Vertical rulers for test taker ability, test version difficulty,
and test item difficulty for Japan.
Japan
Logits Test Taker Test Version Test Item Difficulty
Ability Difficulty
7 + + +
| | |
| | |
| | |
6 + + +
| | |
| | |
| | |
5 + + +
| | |
| | |
| | |
4 + + + XXXXXXXXX*
| | | *
| | | *
| | | X*
3 + + + X*
| | | *
| . | | *
| . | | *
2 + + + *
| . | | X*
| . | | *
| . | | *
1 + * + + *
| *. | | *
| **. | | X*
| ***. | *** | *
0 * ****. | ********** * *
| *****. | ***. | *
| ********. | | *
| *******. | | *
How Well do Cloze Items Work and Why? 17
Japan
Logits Test Taker Test Version Test Item Difficulty
Ability Difficulty
-1 + ********. + + *
| *********. | | *
| *******. | | *
| *********. | | *
-2 + ******. + + *
| *****. | | *
| ****. | | *
| ***. | | *
-3 + **** + + *
| **. | | *
| *. | | *
| . | | *
-4 + . + + *
| | | *
| . | |
| . | | *
-5 + + + *
| . | | *
| | |
| . | |
-6 + *. + +
* = 23 *=3 x = 48
. = 4 . =1 * = 24
18 Chapter One
Table 1-8: Vertical rulers for test taker ability, test version difficulty,
and test item difficulty for Russia.
Russia
Logits Test Taker Test Version Test Item
Ability Difficulty Difficulty
7 + + + *********.
| | |
| . | |
| | |
6 + + + .
| | | .
| . | | *.
| . | | **.
5 + + + *.
| | | **.
| . | | *.
| . | | *.
4 + . + + **.
| . | | *.
| . | | *.
| . | | **.
3 + . + + **.
| . | | **.
| . | | **.
| *. | | **.
2 + **. + + ***.
| ***. | | ***.
| *****. | | ***
| *****. | | **.
1 + *******. + + ***.
| ********. | | **.
| *********. | | **.
| ********. | *. | **.
0 * *******. | ********. * *.
| ********. | ** | *.
| *******. | | *.
| ********. | | *.
-1 + *******. + + *.
| *******. | | .
| ******. | | .
How Well do Cloze Items Work and Why? 19
Russia
Logits Test Taker Test Version Test Item
Ability Difficulty Difficulty
| ***. | | .
-2 + *****. + + .
| ***. | | .
| ***. | | .
| ***. | | .
-3 + **. + + .
| *. | |
| . | |
| . | |
-4 + . + +
| . | |
| . | |
| | |
-5 + . + +
| | |
| | |
| | |
-6 + + +
* = 34 *=4 * = 20
. = 28 . =1 . = 10
The Rasch analyses results in Table 1-9 provide a brighter picture with
larger proportions of items, 66.1% and 84.7% for the Japan samples and
Russia samples, respectively, fitting the model and thus being analyzable
and useable. The association between the nationality and fit (analyzed in a
2 x 2 contingency table with Japan and Russia on one dimension and
misfit and fit on the other) indicated a small but demonstrable degree of
association (22%) between these two factors (Ȥ² = 139.26, df = 1, p < .00;
phi = .22). We conclude from Table 1-9 that the CTT item analysis
statistics indicated that only small numbers of items are functioning well,
while the Rasch analysis indicated that relatively larger numbers of items
fit the Rasch model and were thus analyzable, interpretable, and useable
for NRT purposes.
Further analyses provided additional detail. It turned out that the
number (and percentage) of the 1,496 items that were functioning well in
the CTT items analysis (i.e., with ID = .30+) were as follows: 106 (7.1%)
worked for both nationalities; 331 (22.1%) worked uniquely in the Russia
samples; and 64 (4.3%) worked uniquely in the Japan samples. Hence, in
the CTT analyses, considerable differences occurred in which items were
working for each of the nationalities. Similarly, in the Rasch analyses, the
number (and percentage) of the 1,496 items which fit were as follows: 938
(62.7%) fit for both nationalities; 329 (22.0%) fit uniquely for the Russia
takers; 51 (3.4%) fit uniquely for the Japan test-takers. Again,
considerable differences surfaced for which items were fitting for each of
the nationalities, but less so than in the CTT analyses.
How Well do Cloze Items Work and Why? 21
Given that the Rasch analysis turned out to be more sensitive and
appropriate for analyzing the effectiveness of the items in this study, the
remaining analyses were based on the Rasch item fit statistics. The
linguistic characteristics considered here were (a) the parts of speech of
the word in each blank, (b) whether the word was a content or function
word, (c) whether it was of Latinate or Germanic origin, and (d) the
frequency of the word in the Brown Corpus. Based on previous research
(Brown, 1992), we had a reasonable expectation that these linguistic
characteristics would have some relationship with item performance.
Parts of Speech and Rasch Item Fit. Table 1-10 shows the item misfit
and fit in the Rasch analyses separately for Japan and Russia for the
different parts of speech across all 1,496 items, presented in alphabetical
order by part of speech. Note that the chi square (Ȥ²) statistic for items that
fit for parts of speech by Japan and Russia was only 13.68 with df = 16,
which is not significant even at the very liberal p > .50. The Cramer’s V
statistic was .078. Thus, counter to our expectations, the pattern of item fit
for the parts of speech does not vary for the two nationalities beyond what
would be expected by chance alone (in this case, with p > .50) and the
association between test-taker linguistic background and parts of speech is
only about 7.8%. Of course, none of this means that these frequencies are
significantly the same as what could reasonably be expected by chance
alone, but it did convince us that there were no variations worth further
scrutiny in this table.
Content/Function Words and Rasch Item Fit. Table 1-11 shows the
number of content and function words that misfit or fit depending on
nationality for the 1,496 items. The Ȥ² statistic for word type by nationality
2 x 2 contingency table for the frequencies of fitting items was 5.02 with
df = 1, which is significant at the .025 level. The phi statistic here was .05.
Thus, these fluctuations in frequencies are significantly different from
chance at .025 and were associated to a very small degree (5%) with
content versus function distinction. Visual inspection of the percentages
shown in this table indicated that (a) fewer items in both the content and
function word categories fit in the Japan sample, (b) in the Russia sample,
a somewhat higher percentage of function words fit the Rasch model than
content words, and (c) in the Japan sample, considerably higher
percentage of function words fit than content words.
22 Chapter One
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Japan Fit
Russia Fit
Part of
Speech Total
Table 1-11: Item misfit for word type (content & function) in the
Japan and Russia data for 1,496 items.
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Russia Fit
Japan Fit
Word
Type Total
Content 424 601 197 828 1,025 58.6 80.8 22.2
Function 83 388 32 439 471 82.4 93.2 10.8
Total 507 989 229 1,267 1,496 66.1 84.7 18.6
Germanic/Latinate Word Origin and Rasch Item Fit. Table 1-12 shows
the number of Latinate and Germanic words that misfit or fit in the Rasch
models constructed for the 1,496 items, used with test-takers from two
nationality backgrounds. The Ȥ² statistic for the word origin by nationality
2 x 2 contingency table for the frequencies of fitting items was 6.17 with
df = 1, which is significant at the .025 level. The phi statistic here indicates
that the association was .05. Thus, these fluctuations in frequencies are
significantly different from chance at .025 and were associated to a very
small degree (5%) with Latinate versus Germanic distinction. Visual
inspection of the row percentages shown in this table indicates again that
(a) a smaller proportion of items fit in the Japan samples than in the Russia
sample, (b) in both the Russia and Japan samples, a considerably higher
proportion of the Germanic words fit the Rasch model than Latinate
words, and (c) Latinate words were less likely to fit than Germanic words
for the Russia sample and considerably less for the Japan sample.
Word Frequency and Rasch Item Fit. The point-biserial correlation
coefficients between whether or not the items fit with the raw frequencies
were .222 for the Japan data and .123 for the Russia data. Because the
frequencies had skewed distributions, we also transformed those
frequencies using a natural log transformation. The point-biserial
correlation coefficients between whether or not the items fit with the
transformed frequencies were .363 for the Japan data and .279 for the
Russia data, which were significant at p < .01. These results indicate
Rasch item fit estimates were somewhat related to item frequencies.
24 Chapter One
Table 1-12: Item misfit for word origin (Latinate & Germanic) in the
Russia and Japan data for 1,496 Items.
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Russia Fit
Japan Fit
Word Origin Total
Latinate 219 181 114 286 400 45.3 71.5 26.2
Germanic 288 808 115 981 1,096 73.7 89.5 15.8
Total 507 989 229 1,267 1,496 66.1 84.7 18.6
Pearson correlation coefficients between the item logit scores and raw
item frequencies were -.215 (p < .05) for the Japan data and -.294 (p < .01)
for the Russia data. The Pearson correlation coefficients between the item
logit scores and the transformed item frequencies were -.356 for the Japan
data and -.430 for the Russia data, both of which were significant at p <
.01. These correlations were negative because, as we expected, as the
magnitude of the difficulty estimates increased the frequencies decreased.
Nonetheless, these results indicate Rasch item difficulty estimates were
also somewhat related to item frequencies.
Table 1-13 shows the Rasch item fit frequencies separately for the
different the levels of vocabulary frequency in the Brown Corpus across
all 1,496 items for the two nationalities. Examining the percentages of
item fit on the right side of Table 1-13, it is easy to see that, below a
certain frequency threshold (i.e., as items become less and less frequent),
the infrequent items did not fit the models well. For instance, lexical items
that occurred fewer than 1,000 times in the Brown Corpus were much less
likely to fit (i.e., accounted for a much lower percentage) than more
frequent lexical items in the Japan data. Similarly, lexical items that
occurred fewer than 50 times in the Brown Corpus were much less likely
to fit than lexical items that occurred more often in the Russia data. This
result is intuitive in that less-frequent lexical items are likely to be more
unpredictably known or not known by test takers than the more frequent
items and thus are less likely to fit.
How Well do Cloze Items Work and Why? 25
Table 1-13: Item misfit as a function of word frequency (in the Brown
Corpus found in Francis & Kuþera, 1979) in the Japan and Russia
data for 1,496 items.
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Russia Fit
Japan Fit
Frequency in
Brown Corpus
Total
Discussion
A number of research studies (Brown, 1988, 1989, 2013; Brown,
Yamashiro, & Ogane, 1999, 2001; Revard, 1990) have examined the
issues involved in developing cloze items based on CTT item analysis.
This chapter expands on those analyses and is one of a few that apply
Rasch analysis to cloze testing. It is also one of the first studies to
systematically examine the performances of large numbers of students
from two distinct language backgrounds on 50 different cloze tests.
Recall that the title of the study was How well do cloze items work and
why? With regards to the first part of that question, How well do cloze
items work?, as mentioned above, the literature on the topic has shown
considerable variation in how well cloze tests function with reliability
estimates ranging from .31 to .96 and validity coefficients ranging from
.43 to .91 (Brown, 2013). The present study was designed to look a bit
closer at these issues by addressing five research questions.
26 Chapter One
The descriptive statistics in Tables 1-1 and 1-2 indicated that the raw
score means were low for the 50 thirty-item cloze tests administered to
samples from two different linguistic backgrounds. Reliability estimates
varied widely depending on the different cloze test forms and the language
background of the groups. IF statistics indicated that about one fifth of the
test items were contributing effectively to the variance in cloze scores for
these different test-taker populations; similarly, ID values also indicated
that that about one fifth were Good or Very Good discriminators.
From a CTT perspective, then, many of the 50 tests in this study were
not functioning particularly well in terms of central tendency, dispersion,
and reliability. These mixed results are consistent with the literature and
may be due to the fact that large numbers of the 1,496 individual items
were either not working at all or were not particularly effective as NRT
items in terms of IF and ID. These data potentially indicate that random
deletion patterns may not be the most effective way to build cloze tests;
indeed, item creation based on rational selection that considers lexical
items with relatively high word frequency may be a more productive
strategy for the development of cloze items either manually or through
automatic generation (as advocated by Coniam, 2013).
How do Rasch item difficulty measures differ for the test takers from
different linguistic backgrounds?
In the Rasch analyses, the items also proved to be difficult, that is, they
were generally suitable for students of high or very high ability levels as
indicated by the logit scores for the two test-taker groups. For both test-
taker groups, ability estimates were lower than the item logits in many
cases. Thus, as in the CTT analyses, a number of items were suitable for
students above the general ability levels of these samples. Nonetheless, the
same vertical rulers indicate that a fairly high proportion of items was also
suitable for the examinees in these samples.
In what ways do Rasch item fit patterns differ in terms of factors such
as linguistic background and four cloze item linguistic features: parts
of speech, word type, word origin, and word frequency?
What does all of this mean for cloze test design? Based on the results
of these analyses, it appears that using higher proportions of frequent
words (i.e., with frequencies over 50 in the Brown Corpus) should help
produce items that fit for samples like that in Russia. However, for
samples like that in Japan, items based on words with frequencies of 1,000
or more appear to be more likely to fit the Rasch model. It is also
advisable to use higher proportions of items requiring Germanic words (as
opposed to Latinate words) as it could help produce somewhat more items
that fit for samples like those in Russia and Japan (though somewhat more
so for samples like that in Russia). Using higher proportions of items
requiring function words (as opposed to content words) may help produce
somewhat more items that fit for samples like those in Russia and Japan,
(though more so for samples like that in Russia). Note that increasing the
proportions of function words might increase the degree to which
grammar knowledge is being tested.
Certainly, sound test development practices (and the results of this
study) dictate that the best strategy for producing effective cloze tests is to
pilot those tests with larger numbers of items and then use Rasch analysis
to select those items that fit the sorts of examinees being tested. If that is
not possible, it may help to select items that tend to require function words
and words of Germanic origin in the blanks, or more importantly, words
that occur frequently in English.
Conclusion
Implicit in the second part of the question posed by the title of this
paper is the question: Why do cloze items and tests operate the way they
do? Clearly, one reason for the items being as difficult as they proved to
be is that they were natural cloze items (i.e., cloze tests developed from
passages randomly selected from a large collection of native-speaker
texts). As first demonstrated in Brown (1993), such natural cloze tests
tend to be difficult even for university level students of English, especially
when scored using the exact-answer scoring method as was the case in this
study.
The generally wider dispersion of scores found for the Russia samples
may have occurred because (a) these test-takers varied more widely in
ability levels than the Japan samples, (b) the potential for variation was
greater as a result of their higher means, or (c) both. The differences in
How Well do Cloze Items Work and Why? 29
reliability may have been due to the higher means in the Russia samples,
to the greater variance, to the larger sample sizes, or to differences in the
test-takers (e.g., higher motivation, more familiarity with cloze format,
etc.).
The fact that much larger proportions of items functioned well in the
Rasch item analyses than in the traditional CTT analysis may be explained
by the nature of Rasch analyses which provide item difficulty estimates
based on the probability of an average test-taker answering a given item
correctly rather than on the proportion of examinees who answered
correctly as in CTT. As a result, the Rasch item difficulty estimates were
not sample dependent and were not affected as much as the CTT item
analyses statistics by the relative difficulty of the items in these 50 tests for
both samples. Hence, we were able to identify greater numbers of
functioning items, even those items which might be challenging for the
test-takers, and understand at least partially why and how the items were
functioning linguistically in interesting and interpretable ways.
In addition, the fact that we used multifaceted Rasch analysis in the
form of FACETS analysis made it possible to simultaneously analyze
(using 10 anchor items administered to all test takers) 50 cloze tests with
test-takers and items nested within tests (i.e., with different test takers and
items on each test). FACETS analysis also made it possible to link the 50
cloze tests from two nationalities and thereby put the test-taker ability
estimates and item difficulty estimates on the same scales for all tests and
both nationalities. Thus, we were able to learn that Rasch analysis is more
appropriate for our cloze test analysis and revision purposes, indeed,
considerably better than traditional CTT analyses. In addition, we found
that blanks representing certain categories of words (i.e., function words
and words of Germanic origin, and to a greater extent relatively high
frequency words) are more likely to work well for NRT purposes.
If we were to revise any or all of the 50 cloze tests in this study by
selecting only those items that functioned well from a CTT perspective (as
described in Brown, 1988) or from a Rasch perspective (based on the
results of this study), we are convinced that very different tests would
surface for the Russia and Japan samples because different items were
functioning well in the two samples. This is consistent with the starting
point for this study which was that cloze items are just another “family of
item types” (Mullen, 1979, p. 21). In fact, cloze is no more than “a
technique for producing tests, like any other technique” (Alderson, 1979,
p. 226), though, according to this study, they provide a more effective set
of items from the Rasch perspective than from the CTT viewpoint. Thus,
there is no reason to “…think that cloze tests are somehow different from
30 Chapter One
other tests” (Brown, 2013, p. 26), and we should no doubt pilot and revise
cloze tests just as we would any other tests to tailor them to the specific
range of abilities involved. However, we should also consider selecting
items based on the recommendations of this study, and thereby extend the
notion of rational deletion in useful ways.
References
Alderson, J. C. (1979). The cloze procedure and proficiency in English as
a foreign language. TESOL Quarterly, 13(2), 219-227.
Bachman, L. F. (1982). The trait structure of cloze test scores. TESOL
Quarterly, 16(1), 61-70.
—. (1985). Performance on cloze tests with fixed-ratio and rational
deletions. TESOL Quarterly, 19(3), 535-555.
Baker, R. L. (1987). An investigation of the Rasch model in its application
to foreign language proficiency testing. Doctoral thesis University of
Edinburgh, UK.
Bond, T., & Fox, C. M. (2007). Applying the Rasch model: Fundamental
measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence
Erlbaum Associates.
Brown, J. D. (1988). Tailored cloze: Improved with classical item analysis
techniques. Language Testing, 5(1), 19-31.
—. (1989). Cloze item difficulty. JALT Journal, 11(1), 46-67.
—. (1993). What are the characteristics of natural cloze tests? Language
Testing, 10(2), 93-116.
—. (1998). An EFL readability index. JALT Journal, 20(2), 7-36.
—. (2005). Testing in language programs: A comprehensive guide to
English language assessment. New York: McGraw-Hill.
—. (2013). My twenty-five years of cloze testing research: So what?
International Journal of Language Studies, 7(1), 1-32.
Brown, J. D., Janssen, G., Trace, J., & Kozhevnikova, L. (2013). Using
cloze passages to estimate readability for Russian university students:
A preliminary study. In M. A. Kulinich, V. A. Levchenko, E. G.
Kashina, L. A. Kozhevnikova, & E. A. Sokolova (Eds.),
ɉɪɨɮɟɫɫɢɨɧɚɥɶɧɨɟ ɪɚɡɜɢɬɢɟ ɩɪɟɩɨɞɚɜɚɬɟɥɟɣ ɚɧɝɥɢɣɫɤɨɝɨ ɹɡɵɤɚ ɜ
ɭɫɥɨɜɢɹɯ ɦɨɞɟɪɧɢɡɚɰɢɢ ɨɛɪɚɡɨɜɚɬɟɥɶɧɨɣ ɫɢɫɬɟɦɵ». Ɇɚɬɟɪɢɚɥɵ
ɦɟɠɞɭɧɚɪɨɞɧɨɣ ɧɚɭɱɧɨ-ɩɪɚɤɬɢɱɟɫɤɨɣ ɤɨɧɮɟɪɟɧɰɢɢ. ɋɚɦɚɪɚ, 25-
26 ɦɚɪɬɚ 2013. [English language teacher professional development:
Scaling New Heights. A Collection of Conference Papers. Samara,
March 25th–26th, 2013]. Samara, Russia: Samara State University.
How Well do Cloze Items Work and Why? 31
Native Language________________________________________
DIRECTIONS:
x Read the passage quickly to get the general meaning.
x Write only one word in each blank. Contractions (example: don’t)
and possessives (John’s bicycle) are one word.
x Check your answers.
NOTE: Spelling will not count against you as long as the scorer can read
the word.
Answer Key
past
day
how
said
about
jobs
energy
he
…
36 Chapter One
130,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0; Examinee #1
cloze item responses
100102,1,101-130,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
…
105041,50,5001-
5030,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
100101,1,31-40,0,0,0,,,0,0,0,0,0; Test taker #1 anchor item response
100102,1,31-40,0,0,1,,,0,1,0,0,0
…
105041,50,31-40,0,1,1,,,1,1,0,0,0
CHAPTER TWO
KAZUO AMMA
Abstract
In traditional placement tests a candidate’s proficiency level is
assessed based on the total test score. This estimation is inappropriate for
two reasons. Firstly, it may be affected by the arbitrary combination of test
items with varying difficulties. Too many relatively difficult items might
lead to underestimating a candidate’s true proficiency level, while too
many easy items might result in overestimation. Secondly, looking at the
total placement test score is not informative enough because it does not
show the absolute proficiency level of the candidates. This chapter
proposes using a logistic regression analysis, which properly assesses a
candidate’s proficiency level as well as the confidence interval, provided
the difficulty level of individual test items is defined in advance. The
difficulty scale can be any standardized proficiency scale (e.g., the
Common European Framework of Reference) as long as the test item
difficulty is projected on it. This technique can further allow continuous
estimation for incomplete performance (i.e., when an open-ended answer
is partially correct) as well as binary scoring (i.e., correct/incorrect). As
the main output of this kind of analysis is the proficiency level on the
difficulty scale, candidates can be placed directly in the corresponding
class level. As long as item difficulty information is provided, the
estimation can be conducted regardless of the number of candidates taking
the placement test and it can be applied to individualized online learning
programs.
Estimating Absolute Proficiency Levels in Small-Scale Placement Tests 41
Introduction
The goal of a placement test is to place the candidates on a proficiency
scale with properly defined descriptions of the target language behaviour.
The assessment of the candidates’ proficiency level should be absolutely
stipulated (i.e., criterion-referenced) and thus should not be affected by
arbitrary addition/deletion of test items. In other words, the candidate’s
proficiency measure must stay constant however many items there are in
the test that are above or below his/her proficiency level. If test items are
too difficult, the candidate will hardly answer them correctly; if they are
too easy, the candidate will quite probably get the answers right. But these
results should not affect a candidate’s proficiency assessment. As an
alternative to traditional score-based assessment, this chapter proposes a
psychometric solution to estimate the candidates’ proficiency level which
refers to a set of proficiency criteria defined in advance. Such assessment
has often been done manually and subjectively with reduced reliability as
a result. Logistic regression analysis, however, is a statistical tool that
calculates the estimated mean proficiency level of a candidate as well as
the range of the confidence interval. The chapter also reports on two
candidates in a small-scale placement test and shows how the logistic
regression analysis describes the different characteristics of their
proficiency. The discussion that follows employs a psychometric rather
than a psychological perspective (Henning, 1992). Although placement
testing involves a number of issues such as test design, reliability, validity,
and decision-making (Fulcher, 1997; Plakans & Burke, 2013; Wall,
Clapham, & Alderson, 1994), the setting of item difficulty and the
streaming of difficulty levels in the examples included are assumed to be
accurate and reliable, in order to make the argument simple and clear.
Background
Although the estimation procedure presented here is independent of
any particular language proficiency/difficulty model in theory, its practical
application is based on a linear grading system of proficiency/difficulty.
Among various second language (L2) proficiency scales the most
comprehensive and influential is the Common European Framework of
Reference for Languages (CEFR). It describes what a learner can do when
he/she has reached a certain proficiency level. The following are
descriptors of overall reading comprehension in six levels of proficiency
(Council of Europe, 2001, p. 69). Table 2-1 refers only to reading skills
42 Chapter Two
since the placement test to be described later in the chapter deals almost
exclusively with reading comprehension.
Level Descriptor
A1 Can understand very short, simple texts a single phrase at a time,
picking up familiar names, words and basic phrase and rereading
as required.
A2 Can understand short, simple texts on familiar matters of a
concrete type which consist of high frequency everyday or job-
related language. Can understand short, simple texts containing
the highest frequency vocabulary, including a proportion of
shared international vocabulary items.
B1 Can read straightforward factual texts on subjects related to
his/her field and interest with a satisfactory level of
comprehension.
B2 Can read with a large degree of independence, adapting style and
speed of reading to different texts and purposes, and using
appropriate reference sources selectively. Has a broad active
reading vocabulary, but may experience some difficulty with
low frequency idiom.
C1 Can understand in detail lengthy, complex texts, whether or not
they relate to his/her own area of speciality, provided he/she can
reread difficult sections.
C2 Can understand and interpret critically virtually all forms of the
written language including abstract, structurally complex, or
highly colloquial literary and non-literary writings. Can
understand a wide range of long and complex texts, appreciating
subtle distinctions of style and implicit as well as explicit
meaning.
Item #1 #2 #3 #4 #5
Points 10 10 10 10 10
Difficulty A1 A2 B1 B2 C1
“The farther person ability is below item difficulty, the more unlikely will
be success in responding to the item. Similarly, the farther person ability is
above item difficulty, the more unlikely will be failure in responding to the
item.” (p. 122)
where x represents the difficulty level and p(x) represents the probability
of either pass or fail of the item. ß0 and ß1 are parameters that characterize
this candidate, and EXP represents an exponent of e, a constant equal to
2.71828 or the base of natural logarithm. It is where the probability of
response is 0.5 that the candidate’s ability matches the item difficulty.
This formula for logistic regression can be found in various books on
multivariate statistics, e.g., Lloyd (1999). Logistic regression is an analytic
method often used in item analysis. Figure 2-1 below shows an example of
a graphical representation of logistic regression analysis applied to a test
item in an actual test conducted for Japanese university students of English
as a foreign language (EFL) (N = 316) in which one grammatically correct
sentence should be chosen out of four options: (a) Grandma went
shopping, (b) Grandma went to shop, (c) Grandma went shop, and (d)
Grandma went to shopping (Amma, 2001). Note that the responses are
reduced to either ‘Pass’ (for option (a)) or ‘Fail’ (for other options or no
answer).
Estimating Absolute Proficiency Levels in Small-Scale Placement Tests 47
“Let’s say that we know that a particular person has a 75% chance of
succeeding on a question about state capitals. This 75% probability of
success means that he has 3 chances succeeding to 1 of failing, so that the
scale value is the natural logarithm of 3/1 = log (3) = 1.1 logarithmic units
(“logits”). A 50% chance of success, or a 50% chance of failure would be 1
chance of succeeding to 1 chance of failing, giving a scale value of the
logarithm of 1/1 = log (1) = 0.” (pp. 4-5)
1. Education
The literacy rate in Finland is 99% and the number of newspapers and
books printed per capita is one of the highest in the world. The nine-year
comprehensive school (peruskoulu) is one of the most equitable systems in
the world—tuition, books, meals and commuting to and from school are
free. All Finns learn Swedish and English in school and many also study
German or French. (Source: Lehtippu, 1996, p. 31).
#14. Why was the period before the Renaissance called the ‘Middle Age’
(underlined part)?
What was special about this test was that, unlike ordinary entrance
examinations, the test writer specified a difficulty level for each item as he
Estimating Absolute Proficiency Levels in Small-Scale Placement Tests 53
wrote the test. Since the test writer was in contact with the teaching staff
who were familiar with the level of teaching and goals of the curriculum,
it was easy to connect the difficulty levels to the proficiency required in
the specific course (see Table 2-2).
After the test was administered the candidates’ individual responses for
open-ended questions were judged as either pass or fail, depending on
whether they satisfied the required proficiency of the item in question. In
the case of multiple-choice items, the responses were simply pass or fail.
The rater’s job was to calculate the estimated proficiency level of the
candidates using the pass/fail information. For example, candidate A
passed items #6, #10, and #17 whose difficulty levels were all 6 but failed
items #8, #12 which were in Level 7. From this fact alone we may infer
that her proficiency estimate is somewhere between 6 and 7. But we also
have to consider the contradictory responses. She failed the relatively easy
item #19 (Level = 6), and passed the relatively difficult item #14 (Level =
7). Candidate B had more such contradictory responses (see Table 2-7).
Figure 2-5: Logistic fit of Candidate A with confidence interval. Bottom layer =
‘fail’.
Estimating Absolute Proficiency Levels in Small-Scale Placement Tests 55
Figure 2-6: Logistic fit of Candidate B with confidence interval. Bottom layer =
‘fail’.
In the case of Candidate A, item #15 has a new level, it is now Level 5
because the rater judged her performance as corresponding to Level 5, the
weight is 4, half the original weight, and one response is ‘Pass’, and the
other one is ‘Fail’. The result of this process shows an expansion of
confidence interval as well as a drop in proficiency level, even more
notably with Candidate B (see Table 2-12 below).
Discussion
It appears that the exact estimation leads to decreased accuracy even
though it was intended to increase it. One reason for this seeming decrease
in reliability would be the possible inconsistency of the subjective rating
with the rest of the dichotomous data; the finer the rating is as one
pinpoints the level, the less accurate the conclusion becomes than when
one stays in a rough estimation. That is, an exact judgement that a
candidate’s proficiency is at Level 6.0 when his/her true proficiency is at
Level 6.5 is less accurate than a vague judgement that the candidate’s
proficiency is somewhere below Level 10. It is a matter of whether we
trust the rater’s case judgement or the latent ability structure that the
candidate is assumed to follow. In other words, should our analysis be
data-driven or model-driven? This question may remind us of the contrast
between Item Response Theory (IRT) and Rasch Model analysis. If we
understand the nature of variability in human behaviour, however, the
reality may more likely lie in obscurity than clearly focused measurement.
One reservation has to be made with this estimation method using
logistic regression analysis. The present data happened to have no missing
data. Where there are some, they will be ignored from calculation because
the difficulty level in equation (1) is not obtained. Revuelta (2004) points
out that raters cannot distinguish a simple accidental absence of data from
intentional avoidance of answering, and proposes a correction programme.
Although he deals with self-adaptive tests, the same possibility may occur
in placement as well as other time-constrained tests when the candidate
Estimating Absolute Proficiency Levels in Small-Scale Placement Tests 59
Conclusion
This chapter described how the use of logistic regression analysis made
it possible to successfully estimate test-takers’ absolute proficiency levels,
which was impossible by traditional placement based on raw scores.
Proficiency was described in terms of a set of levels, which included the
rough can-do statements (Table 2-2). This qualitative information is useful
when the candidates are streamed into classes. The information of the
60 Chapter Two
References
Amma, K. (1990). Unpublished internal document. [A grammar test
conducted at a private university in Tokyo.]
—. (2001). Variations of parsing strategies among EFL learners of
different proficiency levels. Ronso (Bulletin of the Faculty of
Humanities, Tamagawa University), 41, 79-115.
—. (2013). Criterion-referenced testing in small-scale placement: A case
study. Paper presented at the English Language Education and the
CEFR in Japan, JACET Kanto Chapter Meeting, June 16, Aoyama
Gakuin University, Tokyo, Japan.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing.
Cambridge: Cambridge University Press.
Council of Europe. (2001). Common European framework of reference for
languages: Learning, teaching, assessment. Cambridge: Cambridge
University Press.
Fulcher, G. (1997). An English language placement test: Issues in
reliability and validity. Language Testing, 14(2), 113-139.
Gombrich, E. H. (1984). The story of art. Oxford: Phaidon Press.
Harsch, C., & Rupp, A. A. (2011). Designing and scaling level-specific
writing tasks in alignment with the CEFR: A test-centered approach.
Language Assessment Quarterly, 8(1), 1-33.
doi: 10.1080/15434303.2010.535575.
Hawkins, J. A., & Buttery, P. (2010). Criterial features in learner corpora:
Theory and illustrations. English Profile Journal, 1(1), 1-23.
doi: 10.1017/S2041536210000103.
Henning, G. (1987). A guide to language testing: Development,
evaluation, research. Rowley, Massachusetts: Newbury House.
—. (1992). Dimensionality and construct validity of language tests.
Language Testing, 9(1), 1-11.
JMP. (2002). JMP Version 5: Statistics and graphics guide. Cary, North
Carolina: SAS Institute.
Lehtippu, M. (1996). Finland: A lonely planet travel survival kit.
Hawthorn, Victoria, Australia: Lonely Planet Publications.
62 Chapter Two
Abstract
Despite a growing number of bilingual children enrolled in Early
Intervention language services, methods of administering language
assessments to bilingual children are not standardized. This study reports
clinically-meaningful differences in bilingual children’s receptive and
expressive language outcomes when their language skills are assessed in
the primary language versus in both the primary and secondary languages.
Eleven Spanish-English speaking children (ages 1;11 to 2;11) with
language delay enrolled in Early Intervention were assessed using The
Rossetti Infant-Toddler Language Scale (Rossetti, 1990) in their primary
language only, and then in both their primary and secondary languages.
When assessed in only one language, bilingual children’s language skills
were underestimated by 1.4 months for receptive language and 2.2 months
for expressive language; language delay was overestimated by 4.7% for
receptive language and by 7.8% for expressive language. Single-language
assessments would lead to inappropriate Early Intervention referral for 3
of the 11 tested children. It is therefore suggested that assessing bilingual
children in only one language leads to a significant underestimation of
receptive and expressive language abilities and a significant overestimation
of language delay. Consequently, the efficacy, reliability, and validity of the
assessment are compromised and best practice as mandated by speech-
language pathology certification organizations is not achieved.
64 Chapter Three
Introduction
The number of bilingual children in the United States, as well as
throughout the world, is rapidly growing, due, in part, to globalization,
migration, and an increased prevalence of bilingual education options. For
example, of school-age children in the United States, 22% speak a
language other than English in the home (Lowry, 2011). Within certain
areas, such as large cities, an even higher percentage of families speak
more than one language in the home. For instance, a language other than
English is spoken in 35.5% of Chicago residences (United States Census
Bureau, 2013). Children in these homes who are developing more than one
language are generally believed to have language disorders at a similar
rate as children acquiring only one language (Kohnert, 2010). As a result,
the caseload makeup for speech-language pathologists often includes
children with language delay who are developing bilinguals.
When young monolingual and bilingual children fail speech-language
screenings or are referred by pediatricians due to speech-language concerns,
they undergo language assessment to determine eligibility for Early
Intervention services. For example, in Illinois a child is considered eligible
for speech-language services when he or she demonstrates a 30% or more
delay in one or more areas of speech, language, or communication, when
he or she presents with a medical diagnosis that typically results in
developmental delay, or when he or she is determined to be at risk of
substantial developmental delay (Illinois Department of Children and
Family Services, 2003; Illinois Department of Human Services
Community Health and Prevention Bureau of Early Intervention, 2009).
Eligibility for speech-language services through the Early Intervention
program in the United States is often determined based on assessment
outcomes of The Rossetti Infant-Toddler Language Scale (Rossetti, 1990).
The Rossetti is a criterion-referenced assessment of preverbal and verbal
areas of communication and interaction for children up to three years of
age. The skill age at which all criteria are demonstrated and the resulting
percent of receptive or expressive language delay relative to chronological
age decide the children’s eligibility for Early Intervention.
The Rossetti is often used in the Early Intervention program as it is
familiar to Early Intervention clinicians across disciplines (e.g.,
occupational therapists, social workers, etc.) (Marchman & Martinez-
Sussmann, 2002) and because few other assessment tools cover a similar
breadth of developmental domains within the birth to three age range. Like
many assessments structured for use with young children (e.g., Bzoch,
League, & Brown, 2003; Hedrick, Prather, & Tobin, 1984; Marchman &
Bilingual Language Assessment in Early Intervention 65
Background
Despite its use within the Early Intervention program, methods of
administering The Rossetti assessment to bilingual children are not
standardized. When The Rossetti is used to assess bilingual children,
accepted practices include measuring language abilities in only the child’s
primary language, in only the child’s secondary language, or across both
developing languages.
One concern with assessing bilingual children’s language skills in only
their primary or secondary language is that developing bilinguals with
language delay often display uneven skill distribution and shifting
development across languages, as well as individual variation in their
developmental trajectories (Kohnert, 2010). For example, a child may
have relatively even expressive vocabulary skills in Spanish and English,
but demonstrate more advanced verb conjugation skills in English. Even in
typically-developing bilingual children, language acquisition is
characterized by variable timeframes and patterns of development, which
cause difficulty in obtaining valid assessment outcomes (e.g., Kohnert &
Goldstein, 2005; Marian, 2008; Marian, Faroqi-Shah, Kaushanskaya,
Blumenfeld, & Sheng, 2009). Therefore, single-language assessment of
developing bilinguals may not accurately reflect their language abilities
and may not be best practice. Indeed, previous research with school-age
bilinguals suggests that both languages should be measured and
considered as a composite in order to reduce the risk of misdiagnosis and
inappropriate individualized education plans (Kohnert, 2008; Kohnert,
2010; Marian et al., 2009; Roseberry-McKibbin, Brice, & O’Hanlon,
2005).
While such risks in the school-age population are well documented,
there is little research examining language assessment methods with birth
to three-year-old bilingual children who have language delays (Dollaghan
& Horner, 2011). Within typically-developing populations, the
language(s) of assessment can affect measures of young bilinguals’ total
vocabulary size (Core, Hoff, Rumiche, & Señor, 2013; Hoff, Core, Place,
66 Chapter Three
The Study
The study reported in this chapter looked at the differences in
expressive and receptive language measures on The Rossetti for birth to
three year old bilingual children with language delay when they were
assessed in their primary language versus in both their primary and
secondary languages. It was hypothesized that assessment outcomes
provide a more accurate picture of the developing bilingual’s language
level when skills are measured across both developing languages.
Therefore, it was predicted that when administering The Rossetti to young
bilingual children with language delay in only one language, outcomes
will underestimate language abilities and overestimate language delay.
Participants
Participants were 11 children (2 girls; 9 boys) of Hispanic descent
ranging in age from 1;11 to 2;11 (Mean = 2;5, SD = 0;4.8), born in the
United States to bilingual Spanish-English speaking parents. All
participants included in the study were assigned to Early Intervention
speech-language services and required annual or 6-monthly Early
Intervention mandated reassessment. All participants passed a hearing
screening within one year of the testing date. Verbal consent was obtained
from the participants’ parents prior to the evaluation.
Information about participants’ demographic information, linguistic
Bilingual Language Assessment in Early Intervention 67
backgrounds, and language skills was obtained from parent reports and
Early Intervention initial evaluation reports (see Table 3-1). Five
participants were reported to use English as their primary language; six
participants were reported to use Spanish as their primary language. On
average, participants made 78% of their expressions in their primary
language (SD = 10.8%) and 22% of their expressions in their secondary
language (SD = 10.8%).
Secondary Language
Secondary Language
Primary Language
Primary Language
Type of Diagnosed
% Expression in
% Expression in
Participant
Gender
Delay
Age
Materials
Participants were assessed according to Early Intervention standards
using The Rossetti Infant-Toddler Language Scales (Rossetti, 1990) at
home with the presence of a parent, the treating therapist (first author), and
an interpreter who had been assigned by the program to the child’s case at
the onset of service provision. The Rossetti assesses skills across
developmental domains including Interaction-Attachment (e.g., ‘Plays
away from familiar people’), Pragmatics (e.g., ‘Uses words to protest’),
Gesture (e.g., ‘Gestures to request action’), Play (e.g., ‘Stacks and
68 Chapter Three
Procedure
The Rossetti parent questionnaire and test criterion are available in
Spanish and English; however this study’s administration used only the
English questionnaire and test criterion, as an interpreter was present to
translate the questions from English to Spanish. Parent interviews were
completed in English with Spanish interpretation prior to the assessment to
determine participants’ demographic and linguistic backgrounds, and then
with The Rossetti parent questionnaire during the assessment period.
Follow-up questions and clarification questions were used as needed to
ensure adequate and appropriate interpretation of assessment questions.
Within each assessment period, The Rossetti was administered twice:
first only in the participant’s primary language (i.e., credit was only given
for skills demonstrated or reportedly observed in the primary language),
and then in the child’s primary and secondary languages (i.e., credit was
Bilingual Language Assessment in Early Intervention 69
given for skills in either and/or both languages). During primary language
administration, all activities were conducted in the child’s primary
language only, and the child received credit for skills demonstrated in that
language only. For example, a child whose primary language was Spanish
would not receive credit for a skill demonstrated in English. During dual-
language administration, all activities were conducted in a ratio that
matched the parent-reported ratio of Spanish to English expression.
Children were awarded credit for all skills, regardless of their language of
demonstration. Because the assessment accounts for skills that parents
have observed but that may not have been demonstrated during the
assessment period, and because it is a criterion-referenced assessment with
general skill benchmarks, practice effects across single- and dual-language
assessments were not problematic.
The parent interview, primary language assessment, and dual-language
assessment occurred within the same contact period. Sessions lasted
approximately one hour and involved child-directed and therapist-directed
structured play activities, similar to a typical therapy session (e.g., shared
storybook reading, symbolic play with a toy farm, and putting together
puzzles).
Results
All data were analyzed using paired t-tests to compare outcomes when
assessments were conducted in only the child’s primary language versus
his or her primary and secondary languages. Results revealed that single-
language outcomes underestimated the participants’ receptive and
expressive language skills.
Dual-Language Testing
When assessed in both primary and secondary languages, participants’
receptive skill age was 21.8 months (SD = 6.2 months), representing a
delay of 23.6% (SD = 15.8%). Expressive skill age was 21 months (SD =
6.6 months), representing an average delay of 26.3% (SD = 19%). When
accounting for scattered skills (i.e., skill distribution) across both
languages, average highest receptive skill-age was 24.3 months (SD = 4.9
months) and average highest expressive skill-age was 24.3 months (SD =
4.9 months).
expressive skill age by an average of 2.2 months (SD = 1.4 months, t(10) =
5.1640, p < .05) (see Table 3-3 and Figure 3-2). Primary language
assessment also significantly overestimated the language delay by 4.7%
(SD = 5.7%, t(10) = 2.7368, p < .05) for receptive skills and by 7.8% (SD
= 5.4, t(10) = 4.8348, p < .05) for expressive skills (see Figure 3-3). The
findings also suggest that single-language assessment significantly
underestimated scattered skills by 2.5 months (SD = 1.8 months, t(10 =
4.5000, p < .05) in the receptive domain and 1.9 months (SD = 2.0 months,
t(10) = 3.1305, p < .05) in the expressive domain.
Discussion
The results of the present study confirm that assessing bilingual
children in only one language leads to a significant underestimation of
participants’ receptive and expressive language abilities and a significant
overestimation of their language delay. Scattered skill measurement,
which provides treatment planning and skill distribution information, was
also significantly underestimated. As a result of obtaining inaccurate
assessment outcomes, eligibility determination and treatment planning are
therefore compromised when assessing language skills in only one
language, and implementation of best practice (ASHA, 2010) is not
achieved. We conclude that clinicians working with bilingual children
must measure highest skill levels across both languages to obtain accurate
diagnostic and treatment planning information.
72 Chapter Three
(months; % delay)
Dual Language
Dual Language
Participant
(months)
(months)
(months)
(months)
1 15; 35% 18; 21% -3 -14% 15 21 -6
2 24; 20% 24; 20% -0 -0% 24 27 -3
3 27; 4% 27; 4% -0 -0% 27 27 -0
4 27; 25% 30; 17% -3 -8% 27 30 -3
5 21; 25% 21; 25% -0 -0% 24 27 -3
6 12; 60% 12; 60% -0 -0% 15 18 -3
7 15; 35% 18; 22% -3 -13% 15 18 -3
8 18; 49% 21; 40% -3 -9% 21 24 -3
9 18; 14% 18; 14% -0 -0% 21 21 -0
10 30; 14% 33; 6% -3 -8% 33 33 -0
11 18; 31% 18; 31% -0 -0% 18 21 -3
Mean 20.5; 28.4% 21.8; 23.6% -1.4* -4.7%* 21.8 24.3 -2.5*
Note: * = significant difference at p < .05
Bilingual Language Assessment in Early Intervention 73
30 * * Primary Language
25
Age in Months
Figure 3-1: Participants’ receptive language assessment results using The Rossetti
(1990) Language Comprehension subtest. Error bars represent standard errors and
asterisks indicate significant differences at p < .05.
30
* * Primary Language
25
Age in Months
Primary and
20
Secondary Language
15
10
5
0
Skill Age Highest Skill
Age
Figure 3-2: Participants’ expressive language assessment results using The Rossetti
(1990) Language Expression subtest. Error bars represent standard errors and
asterisks indicate significant differences at p < .05.
74 Chapter Three
Primary Language
(months; % delay)
Dual Language
Dual Language
Only (months)
Participant
(months)
(months)
(months)
1 18; 21% 21; 9% -3 -12% 18 21 -3
2 24; 20% 24; 20% -0 -0% 24 27 -3
3 21; 25% 24; 14% -3 -11% 21 27 -6
4 24; 33% 27; 25% -3 -8% 33 33 -0
5 24; 14% 24; 14% -0 -0% 27 27 -0
6 9; 70% 9; 70% -0 -0% 27 27 -0
7 12; 48% 15; 35% -3 -13% 21 24 -3
8 18; 49% 21; 40% -3 -9% 21 24 -3
9 15; 28% 18; 14% -3 -14% 18 18 -0
10 30; 14% 33; 6% -3 -8% 21 24 -3
11 12; 53% 15; 42% -3 -11% 15 15 -0
Mean 18.8; 21; 26.3% -2.2* -7.8%* 22.4 24.3 -1.9*
34.1%
Note: * = significant difference at p < .05
Bilingual Language Assessment in Early Intervention 75
35.00%
* * Primary Language
30.00%
Secondary
20.00% Language
15.00%
10.00%
5.00%
0.00%
Comprehension Expression
Clinical Implications
The results of the present study are relevant for Early Intervention
initial evaluation and ongoing assessment methods. Frequently, initial
evaluations assess developing bilinguals in the primary language or
secondary language only, or the evaluation report does not discuss the
language of assessment. Consequently, questions may be drawn as to the
accuracy of children’s eligibility determination, as well as their speech-
language treatment planning. For example, if dual-language assessment
protocols are not followed, three of the eleven tested participants would
receive inappropriate referral for Early Intervention services. Although
these three participants would meet the 30% delay criterion when assessed
in only one language and could therefore be eligible for Early Intervention
services, when assessed across both of their languages, these participants’
language skills would fall within the average range for bilingual children.
Assessing children in only one language and inappropriately referring
them for services may cause these children’s families to direct limited
familial resources to the children’s treatment, as well as cause undue stress
on the family. Additionally, occupying a finite number of clinicians and
76 Chapter Three
limited funding is not warranted for these children. Children who are
significantly delayed and who actually meet the eligibility requirements
may linger on a waitlist or receive no services as children whose
development is age-appropriate receive treatment. Furthermore, not
accounting for a child’s second language perpetuates negative bias against
bilingual language learners and the differences in their course of language
development as compared to monolingual language development.
Appropriate treatment planning may also be impacted by single-
language assessment as treating therapists develop therapeutic goals and
establish the language of treatment based on the children’s initial
evaluation reports. Developing a treatment plan based on inaccurate
assessment outcomes and skill distribution information is not best practice,
and may hinder the child in reaching his or her full communicative
potential. Also, due to a lack of continuity and infrequent contact between
assessing and treating therapists in Early Intervention, the treating
therapist may not be able to determine how and in what language the
child’s skills were measured based on unreported or inaccurately-reported
language of assessment in the initial evaluation reports. Consequently, the
Early Intervention language assessment process must accurately and
thoroughly account for developing bilinguals’ composite language skills.
The research presented here has direct implications for how language
assessments should be structured. Prior to initiating the assessment process
for children who are developing more than one language, the assessor
must complete a thorough case history with the child’s parent or caregiver,
utilizing interpretation services as necessary. The case history should
include information related to medical history and current health status
(e.g., birth weight, hospitalizations, familial medical history),
developmental milestones (e.g., age the child first walked, first words),
linguistic environment (e.g., primary language, language input/output,
community language), and concerns regarding the child’s language skills
(e.g., the child uses less than five true words and jargoning to
communicate). Assessments should then measure the child’s highest
language skill across both developing languages, as well as scattered skills
and other qualitative information (e.g., the child produces the pronouns ‘I’
and ‘me’ in English and ‘me’ in Spanish independently, but is able to also
produce ‘yo’ in Spanish given support). For example, a child with a
primary language of English and secondary language of Spanish who is
able to follow 2-step directions in English and 1-step directions in Spanish
should receive credit for following 2-step directions. Measuring the
highest reported and observed language skills across languages ensures
that all of the child’s skills are given credit. As a result, the assessment
Bilingual Language Assessment in Early Intervention 77
Conclusion
To conclude, we have shown that assessing bilingual children in only
the primary language can underestimate their language abilities, and may
result in inaccurate eligibility determination and over-identification of
language delays. Therefore, it is vital that language assessments in
children acquiring multiple languages account for abilities across all
developing languages. Measuring children’s skills in all developing
languages (as opposed to skills in only one language) yields a more
accurate and complete assessment, which has immediate benefits for
appropriate service eligibility determination and treatment planning.
While our current findings provide support for the use of dual-
language assessments when determining children’s eligibility for Early
Intervention services, future research will need to explore the use of
single- versus dual-language assessments as evaluated by independent
raters. Although concerns of examiner bias in the present study were
minimized because all evaluations were thoroughly reviewed and
approved by a non-treating clinician not involved in the present study,
more rigorous evaluation methods are prudent to ensure that the
differences between single- and dual-language assessments are reproducible
across a variety of contexts and populations. Assessment outcomes will
also need to be evaluated across other diagnostic tools (e.g.,
Communication and Symbolic Behavior Scales by Weatherby & Prizant,
(1993); The Language Development Survey by Rescorla (1989); etc.).
Finally, future research will need to investigate the magnitude of
misdiagnoses by expanding the participant selection to more diverse
groups of language speakers (e.g., sequential language learners) and
demographic makeups (e.g., high versus low socioeconomic status). By
ensuring that all children receive accurate diagnoses and referrals for Early
Intervention treatment, best practice standards will be met and increased
therapeutic success will be achieved.
References
American Speech-Language-Hearing Association. (2010). Code of ethics.
Retrieved from: http://www.asha.org/Code-of-Ethics/
Bzoch, K. A., League, R., & Brown, V. (2003). The receptive-expressive
emergent language scale third edition. Austin, TX: Pro-Ed.
Core, C., Hoff, E., Rumiche, R., & Señor, M. (2013). Total and conceptual
78 Chapter Three
PENELOPE KAMBAKIS-VOUGIOUKLIS
AND PERSEPHONE MAMOUKARI
Abstract
The study reported in this chapter involved twelve Greek learners of
English in an oral administration of a translated and validated version
(Gavriilidou & Mitits, 2013) of the Strategy Inventory for Language
Learning (SILL) questionnaire (Oxford, 1990). There were two
innovations in this study, the first of which concerns the participants’
reporting of not only the frequency of use of each language learning
strategy (LLS), but also of their confidence in the effectiveness of each
strategy. The employment of this extra parameter provided the researcher
the potential to identify factors in learner strategy use not usually detected
by the indication of frequency use only. The second innovation concerns
the use of the bar (Kambaki-Vougioukli & Vougiouklis, 2008) instead of
the usual Likert scale, as this can be more flexible for both the participants
and the researcher. The results of the study showed deviations between the
frequency of strategy use and students’ confidence in the effectiveness of
the language learning strategies indicating that learners either appreciated
the effectiveness of a strategy but they did not know how to use it or that
they used a strategy without firmly believing in its usefulness. These
findings suggest the need for pedagogical interventions in order to raise
the learners’ awareness of language learning strategies and how to use
them effectively. Additionally, more proficient learners reported a higher
frequency and confidence in LLS use than their less proficient peers, while
the age of the learners did not seem to affect LLS use.
Language Learning Strategy Use by Greek Students of English 81
Introduction
Language learning strategies (LLS) are the conscious or semi-
conscious mental processes employed for language learning and language
use (Cohen, 2003). Research has shown that strategies may facilitate
language learning. As a consequence, strategic behavior has greatly
concerned research in language learning (Chamot, 2007; Wharton, 2000).
Moreover, there is enough convincing evidence that language learning
strategies can and should be taught (Chamot, 2005; Cohen & Macaro,
2007; Graham & Macaro, 2008).
Research has also indicated that the use of language learning strategies
can often be unclear since it depends on various factors, such as the
learners’ age, their target language proficiency, and the socio-cultural
context (see Tragant & Victori, 2012 and references therein). Moreover,
the different methodological tools selected to investigate use of LLS may
lead to discrepancies between studies.
Background
Strategy Inventory for Language Learning (SILL)
x How familiar are the learners with the certain strategies mentioned
in the questionnaire?
x Are they sure they really employ the strategies they claim they do
or do they think so because they have heard the teacher or their
peers emphasize their importance?
Although one would assume that when learners claim they use a
strategy, they are most likely to consider it effective, we have reasons to
believe that this might not probably be the case. A series of studies
(Kambaki-Vougioukli, 2012, 2013) included confidence along with
frequency in the SILL questionnaire, namely the learners were asked to
specify not only how frequently they used each strategy but also how
confident they felt of its effectiveness. Results from these studies indicate
that when the learners claim they use a strategy, this does not necessarily
imply that they consider it effective as evidenced by the low confidence
scores. There have also been cases where learners claimed they did not use
a strategy but seemed confident that this strategy would help them in
language learning.
Finally, the close relation between the learners’ proficiency and the
frequency of strategy use would be of particular interest together with the
measurement of their confidence of strategy effectiveness.
Moreover, SILL questionnaires are generally in written form and their
data analysis process is usually quantitative. However, the oral
administration of SILL may glean important insights by stimulating the
learners’ individual experiences and by allowing the expression of
attitudes, feelings and behaviors, possibly opening up new topic areas. A
researcher might be able to better explain why a particular response was
given through a qualitative analysis of such results, alongside a
quantitative one.
Language Learning Strategy Use by Greek Students of English 83
that influence the choice of strategies and the teaching of strategies. Another
study by Gavriilidou and Papanis (2010) investigated the effectiveness of
direct strategy teaching with suggested activities for Muslim students. In
2009, Psaltou-Joycey and Kantaridou investigated multilingualism in
relation to the use of learning strategies as well as learning styles.
Learning strategies are also investigated in the project “ĬĮȜȒȢ 2012”
with the translation and validation of the SILL questionnaire in Greek and
Turkish, aiming to the collection of useful data regarding learning strategy
use.
The previous studies suggest that there is close relation between the
learners’ proficiency and the frequency of strategy use, however, there has
been no recording of the learners’ confidence that the strategies they
employ are actually effective towards their learning.
The study reported in this chapter is part of a larger research project
that involved both Turkish-Greek bilingual students and native-Greek
students. Students provided their responses to the SILL questionnaire in
the form of oral protocols, i.e., face-to-face interviews, in order to have
their frequency of strategy use and confidence of strategy effectiveness
recorded. The oral administration of the SILL allowed interviewees to ask
for clarifications and the researcher to pose further questions and reach
more accurate conclusions about students’ use of language learning
strategies.
The Study
This study is part of a wider investigation conducted in Thrace, Greece
with two groups of learners of English, one Muslim, i.e., native speakers
of Turkish, and the other Christian, i.e., native speakers of Greek. The
terms Christian and Muslim are conventionally used to distinguish the two
groups. The Muslim group results have already been presented elsewhere
(Kambakis-Vougiouklis, Mamoukari, Agathopoulou, & Alexiou, 2013). In
this chapter, we focus on the LLS of the Greek native speakers and both
their frequency of strategy use as well as their confidence in the
effectiveness of those strategies. The study was guided by the following
research questions:
2. Are there any problematic items in the initial version of the SILL
questionnaire, i.e., items that are not well understood by the
learners?
3. Is the learners’ strategic behavior affected by their proficiency in
English (in combination with their age), and if so, how?
Participants
The learners in our study were all Greek and were recruited from the
first three grades of a public secondary school in Thrace (a prefecture in
the north-east of Greece). There were a total of 12 participants (six male
and six female), aged 12-15 years and learning English as a foreign
language. The sample comprised: four students from each grade: two of
low and two of high level in English; one male and one female in each
proficiency level. The learners’ level of English language proficiency was
estimated according to their performance in class and their course grades
by their English teacher, who was also one of the investigators in the
present research study. Learners of intermediate English language
proficiency were not included in the sample because previous research
found differences in LLS use only between learners of low and high
proficiency in the target language (Magogwe & Oliver, 2007).
procedure that requires them to ‘feel’ or sense their position on the bar,
rather than consciously think of the wording or having to choose from any
suggested division pre-arranged for them. Replacing the discrete character
of Likert scales by a fuzzy one, such as that of the bar, seems even more
suitable when a questionnaire is not in the learners’ mother tongue and
where insufficient linguistic knowledge of the target language may distort
the validity of the questionnaire. Similarly, at the results processing stage,
when using a Likert scale, researchers must decide in advance how many
divisions will be used. By contrast, such an initially predetermined
decision is not required by the employment of the bar. Moreover, it is
possible to process the same data using different subdivisions, for a
number of reasons including that of comparability with different research
studies.
The bar was first introduced at a length of 10 cm but was later
modified at 6.2 cm, which is the Golden Ratio of 10. The reason for this
change is that, as argued, since human eyes are used to the decimal
system, people can easily divide a 10 cm long bar equally, which is not
desirable in our case. On the other hand, a bar length of 6.2 avoids familiar
divisions, leaving the participant free to choose from an infinite number of
points (Vougiouklis & Kambaki-Vougioukli, 2011). Finally, Kambaki-
Vougioukli et al. (2011) compared the fuzzy bar with the Likert scale in an
application of a departmental evaluation questionnaire among all students
of the Department of Education in Alexandroupolis, Greece, asking the
students to specify which method they preferred. The results yielded an
overwhelming majority of 98% in favour of the bar.
Figure 4-1: A
An example froom the SILL qu
uestionnaire em
mploying the [0
01] bar for
frequency andd confidence.
Resu
ults
Within thhe content-analysis techniqque, all the ansswers were no
ormalized
into groupss on the bassis of two crriteria: (a) cconfidence, wherew the
deviation beetween frequeency of use an nd confidence in the effectivveness of
each strateggy for everyy single quesstion was exxamined; and d (b) the
questionnairre comprehennsion (wording g of the quesstions that miight have
caused somme problems). Also, a decision was maade on the (arbitrary)
convention tthat if the diffference between the confideence and the frequency
f
88 Chapter Four
scorings was 6 on the 6.2 bar, then it was negligible and no further
investigation was necessary. If it were higher, we estimated that it would
need investigation.
1. Are the learners confident that the strategy they employ is effective
so they score high confidence where they score high frequency of
use?
2. Do they use certain strategies often but they are not sure of their
effectiveness, so they score lower in confidence?
3. Do they rarely use a strategy but nevertheless score high
confidence in this strategy?
Certain SILL items drew our attention regarding the way the learners
perceived and answered these items, always in relation to the confidence
factor. The items that were of greatest interest are the following:
Q.3 (Memory Strategy): Combining the image with the sound of a new
word. Eight out of the twelve participants scored equally in frequency and
confidence. Two scored lower in confidence, while two students appeared
to be rather puzzled, and one of them paused for quite some time before
scoring. Pausing for quite some time was translated as a problematic
behavior and was recorded accordingly. The student either did not
understand the description of the strategy or was in confusion of whether
s/he did that or not.
Q.5 (Memory Strategy): I use flashcards in order to remember the new
words (with the new word on one side and the definition or other
information on the other side). Two out of the twelve students scored with
no apparent deviation between frequency and confidence, while ten
students scored much lower in frequency than in confidence. One of the
students asked what exactly was implied by the word ‘flashcards’. The
interviewer explained the word so that the student could proceed with the
scoring. The majority of the students did not use the strategy even though
they considered it to be rather effective. This could be translated as a need
for instruction on how to actually make better use of the strategy, or as a
fact related to the participants’ age, who as teenagers may not use
flashcards for learning as much as very young learners.
Q6. (Memory Strategy): I physically act out new English words. Four
students scored equally in frequency and confidence, whereas eight
Language Learning Strategy Use by Greek Students of English 89
Q44. (Affective Strategy): I talk about the way I feel when learning
English. Eight out of the twelve students scored equally on both bars,
except for four students who marked a higher score on the confidence bar.
Most of them made no comments, apart from one who admitted that he
does feel stressed when he talks in English and another student, who
wanted to make sure he got it right, asking, or rather repeating the
statement, as if seeking for further explanation (which was not provided,
as he immediately proceeded with the scoring). The students seem to be
aware of the strategy and also have the confidence that the particular
strategy is helpful.
therefore they extracted similar information from them. The questions that
were more problematic than others in the sense that they needed further
explanation or the students misinterpreted if no clarification was given
were questions Q21-I try not to translate word-for-word, and Q27-I read
English without looking up every new word. However, questions Q46-I
ask English speakers to correct me when I talk and Q48-I ask for help
from English speakers were dealt as if they expressed the same strategy
and therefore had the same impact on the students.
With regards to the two questions above, ‘I ask English speakers to
correct me when I talk’ and ‘I ask for help from English speakers’, there
was one more interesting observation. The lower level students reported
that they do not use these strategies but they believe that seeking help and
correction from others could help them. In contrast, the majority of the
advanced learners scored lower in confidence and some of them verbally
stated that they do not wish to be corrected or that they do not consider it
as a helpful strategy, i.e., they do not like it. This could be explained as a
refusal of the higher proficiency students to be corrected, as these students
are the ones who usually perform well, not only in English, but in other
subjects as well. Consequently, they might consider the correction by a
native speaker or by their teacher as a failure or negative exposure that
could cause them to ‘lose face’.
Discussion
Concerning the first question of this research study about whether
confidence affects learners’ choice of strategy, it was shown that in a
number of items there was great deviation between frequency of use and
confidence in the effectiveness of the strategy. This could imply an appeal
for strategy instruction, as the learners appear to be confident that the
specific strategy might help them, even if their frequency of use indicates
that they do not use the strategy often or even not at all in some cases. This
is an important finding as it demonstrates the difference between what is
used and what is considered useful. However, such an assumption would
have been impossible without the introduction of the parameter of
confidence and without the use of the bar.
As for the second question, there were a number of items in the
questionnaire that were identified as problematic items. These items
caused confusion among learners or were considered similar, and often
resulted in incorrect responses. Probably, such items need to be revised
and reworded before using this instrument again (see Dörnyei, 2003, as
well as Roszgowski & Soven, 2010 for suggesting similar improvements
94 Chapter Four
in questionnaires).
Finally, concerning the third research question, namely how
proficiency in English affects the learners’ strategic behaviour, it is
evident that it does, as the level of the students, beginners-advanced,
seems to influence not only their perception of the actual items of the
questionnaire, but also their recorded responses.
Conclusion
In the current study there is the obvious limitation of the small number
of participants since it was a pilot study. Future research with a larger
sample would allow quantitative analyses and correlations that would
provide more valid conclusions. The above limitation should also be taken
into serious consideration as, due to the way of the administration of the
suggested instrument–oral administration and individual interviews–the
fact that it cannot be applied to a large number of learners provides us with
very restricted data.
As a general conclusion, we could point out that apart from certain
improvement and/or changes that need to be performed on the
questionnaire to make it more appropriate for the specific learners, the
need for instruction is apparent as it will boost the learners’ strategy use so
as to make them more efficient and autonomous, and probably encourage
and reinforce their self-study. Moreover, the format of the data-collection
could be adapted, so that a bigger number of participants could be
included, and therefore more valid information could be extracted through
the use of a differentiated format of the same questionnaire, aiming to its
administration with larger groups of learners.
Acknowledgement
This study is part of the Thales Project MIS 379335. It was carried out
within the National Strategic Reference Frame (Ǽ.Ȉ.Ȇ.ǹ.) and was co-
funded by the European Union (European Social Fund) and the national
resources.
References
Bull, S., & Ma, Y. (2001). Raising learner awareness of language learning
strategies in situations of limited resources. Interactive Learning
Environments, 9(2),171-200.
Language Learning Strategy Use by Greek Students of English 95
LEE-YEN WANG
Abstract
This research study investigated English as a foreign language (EFL)
college students’ vocabulary acquisition of a group of 52 Academic Words
that were excluded from the national wordlist for high school students in
Taiwan. The study found that the 52 omitted words were acquired
significantly less by both the freshman and senior students in Taiwan
compared with the other non-omitted 518 academic words. In addition, 38
of the 52 omitted words are also on the Academic Vocabulary List (AVL),
which was made available in 2013 (Davies, 2012), as part of the Corpus of
Contemporary American English (COCA). Again, these 38 shared words
were also acquired significantly less than the non-omitted academic words
by the same groups of freshman and senior students in this research study.
These findings highlight the limitations of having a centrally controlled
national wordlist for students to learn, as anything omitted from that list
will have a high probability of being missed subsequently in later
acquisition.
Introduction
In Taiwan, before students can be admitted to a college, they have to
take at least one mandatory national exam, and English is a required
subject. Much like other Asian countries, in Taiwan college entrance
examinations are high-stakes exams (Guo, 2005; Ng & Renshaw, 2009)
and students are encouraged to study hard for them early in their
The Development of a Vocabulary Test 99
Background
High school English education in Taiwan is under the guidance of the
ERWL, which is published and released by the College Entrance and
Examination Center (CEEC), a non-profit organization commissioned by
the MOE in Taiwan to administer the nationwide college entrance
examination (CEEC, 2015). Wang (2015) compared the ERWL with
West’s General Service List (GSL) (West, 1953), Coxhead’s (2000)
Academic World List (AWL), the 5,000 Frequency Dictionary (Davies &
Gardner, 2010) from the Corpus of Contemporary American English
(COCA), and the most frequent 500,000 words in the COCA. The study
concluded that the ERWL is a reasonable wordlist, but the comparison
study identified a set of 52 AWL words which were omitted from the
ERWL probably because when Jeng (2005) was compiling the ERWL, he
was unaware of the availability of the AWL by Coxhead (2000). This
study leverages this finding as a prism to investigate the acquisition of
vocabulary by English as a foreign language (EFL) college students in
Taiwan from the perspective of these 52 academic words.
Vocabulary learning is essential for language development. Wilkins
(1972) states that “without vocabulary, nothing can be conveyed” (p. 111),
not ignoring the importance of grammar, but stressing the role of
vocabulary in conveying ideas. Lexical knowledge is essential in all skill
100 Chapter Five
overlap with COCA’s AVL and further examine the acquisition of these
overlapped words by Taiwanese students.
Properly assessing a learner’s vocabulary knowledge plays a
significant role in facilitating efficient language teaching and learning.
Assessment can evaluate vocabulary from a variety of perspectives: size,
depth, fluency, and other cognitive and association skills (Meara, 2002;
Meara & Wolter, 2004; Read, 2000; Schmitt, 2014; Sonbul & Schmitt,
2013; Tannenbaum, 2006). Read (1993, 2000) developed tests to assess
word association, including knowledge of collocation, derivative forms of
a stem word, and polysemous meaning senses.
Nation (2001) differentiated the passive vocabulary that a person can
understand in reading and listening, from the active vocabulary, which a
person can use in speaking and writing. A convenient way to measure
students’ passive or receptive vocabulary knowledge is through a checklist,
where students mark YES/NO (Y/N) to self-report whether they know
these words (Meara & Buxton, 1987). These Y/N tests are incapable of
surveying vocabulary depth (Laufer & Goldstein, 2004; Read, 2000), but
they are simple to administer and they are effective in assessing
vocabulary size or breadth (Read, 2007). However, the self-reporting
nature of Y/N tests requires researchers to implement further checks to
assess the reliability of the self-reporting. Pseudowords were introduced to
the Y/N tests in the reading comprehension assessment by Anderson and
Freebody (1983), with two approaches to creating a pseudoword. The first
one is to add a prefix or suffix to a real word, e.g., ‘steal’ becomes
‘stealment.’ Modifying the vowel or the consonant, one or two at a time,
forms the second method of constructing a pseudoword. Pseudowords
cannot be counted as real words, so they are dealt with as ‘hit’ and ‘false
alarm.’ A hit indicates that the pseudoword is correctly recognized as a
pseudoword, while a false alarm is when the test-taker claims to know a
pseudoword as if it were a real word. The relative numbers of these two
measures can indicate if the test result can be trusted Meara (1992).
The Study
This study is based on the findings by Wang (2015) that there is a set
of 52 AWL words that were omitted from the ERWL. A question that
arises is whether the omitted vocabulary items can be acquired later in
college and how they are acquired. In addition, because the COCA also
released the AVL, this study leverages the AVL to investigate the number
of overlaps between the omitted AWL words and the AVL and the
102 Chapter Five
Research Questions
In the current study, three questions were posited to investigate the
acquisition of the 52 omitted academic words in the AWL relative to the
set of non-omitted AWL words:
Participants
Two classes of freshman and one class of senior students from a
private university in Taipei, Taiwan participated in this study. This
university has a policy of assigning students to freshman English classes
by the level of their General Scholastic English Ability Test (GSEAT),
which is a mandatory exam every high school senior student in Taiwan
has to take in early February of each year. There are 15 levels in the
English subject. The freshman students who participated in the study were
at the GSEAT Level 13 or above, which is in one of the top percentiles of
76% to 82% of all students who took the exam. One student was at Level
15, which is in the top percentile of 88% to 100%. These two classes of
freshman students had the highest GSEAT score in the university. A total
of 50 freshman students participated in this study. A further 39 senior
students participated in this study from the English department. Table 5-1
summarizes some general information about the study participants.
The Development of a Vocabulary Test 103
Table 5-2: Distribution of the sample words across the AWL Levels.
Table 5-4 presents the list of the 82 pseudowords included in the Y/N
checklist. In this table, the pseudowords, the AWL Levels, and the original
words for the pseudowords are presented. For instance, ‘abandon’ was
modified to ‘abendon’ using vowel modification. The word ‘availabler’
was created by adding the suffix ‘-er’ to the word ‘available’. The word
The Development of a Vocabulary Test 105
word ‘breif’ was created from ‘brief’ with an ‘ie’ and ‘ei’ swap. The
purpose was to gauge if pseudowords derived from legitimate academic
words in the AWL would produce different responses between freshman
and senior students.
Table 5-4: Pseudowords with Level information and the original AWL
words.
x OC: the total count of ‘Yes’ responses in the omitted AWL words
in the checklist.
The Development of a Vocabulary Test 107
NP = NC110/NT220 = 0.5
OP = OC26/OT52 = 0.5
RP = NP/OP = 1
If this student keeps the same NC, 110 ‘Yes’, but only 39 ‘Yes’ out of
52 (OT), the calculation will then be as follows:
NP = NC110/NT220 = 0.5
OP = OC39/OT52 = 0.75
RP = NP/OP = 0.667
On the other hand, if this student keeps the same NC, 110 ‘Yes’, but
reports only 13 ‘Yes’ out of 52 (OT), the calculation will then be as
follows:
108 Chapter Five
NP = NC110/NT220 = 0.5
OP = OC13/OT52 = 0.25
RP = NP/OP = 2
95%
95% CI
Year DV Mean SD Min Max Range CI
Upper
Lower
NC 160.80 26.439 102 204 102 144.23 174.12
Freshmen
The combined results from Tables 5-5, and 5-6 indicate that the ratio
of the non-omitted to the omitted academic words for senior students was
lower than that for freshman students; and both ratios are significantly
greater than 1, which indicates that these two groups of words are not
known by the students in equal proportion. This can be attributed to the
110 Chapter Five
Finally, the last question this study attempted to answer was whether
there was a significant difference in the acquisition of the omitted words
that were both on the AWL and AVL by freshman and senior students.
Gardner and Davies (2014) reported that 451 out of the 570 AWL are in
the first 4,000 most frequent lemmas in the COCA. This leaves an
interesting question about how many of the 52 omitted words are in the
AVL. From COCA’s perspective, this would show how important the
omitted words are. A comparison was performed, and it was found that 38
of them were in COCA’s AVL. This indicates that from the perspective of
COCA’s AVL, these 38 words should not have been omitted by the
ERWL.
Table 5-8 shows the comparison in students’ performance between the
52 omitted academic words and the subset of 38 words (see Table 5-9) that
are in the new academic vocabulary list released by Gardner and Davies
(2014).
Table 5-9: Omitted academic words and their COCA ranks and
frequencies.
AWL Level
AWL Level
AVL Rank
AVL Rank
Frequency
Frequency
COCA
COCA
COCA
COCA
Word Word
Looking at the results of the analysis, it is obvious that the shorter list
(38 items) was more difficult for the freshman students than for the senior
students because for the senior students the OT4/NP4 in the 52 items was
3.183 while the OT4/NP4 in the 38 items was 3.229. In contrast, these two
values for the freshmen were 4.869 and 5.559, respectively. Furthermore,
the ANOVA comparison between the freshmen and the seniors for the
short list was found to be significant (F (1, 87) = 10.401, p = .002, Ș2 =
0.107, Observed Power = 0.891), indicating that there is a significant
difference in the acquisition of the 38 words between the freshman and the
The Development of a Vocabulary Test 113
Conclusion
The study reported in this chapter shows that there is indeed a
significant difference between freshman and senior college students in
Taiwan in their receptive knowledge of academic words that are on the
ERWL list and those that are not on the list. The freshmen knew an
average of 160.8 words out of the 220 sampled words or 73.1%, while the
seniors reported that they knew 176.41 words out of the 220, or 80.18%.
In contrast, the freshman and the senior students reported that they knew
9.74 (18.7%) and 16.69 (32.09%) words out of the omitted 52 words,
respectively. The average ratio of the non-omitted words to the omitted
words was 4.869 times for the freshman students and 3.183 times for the
seniors. The gap becames smaller as students learned more English, but it
was still far from the ideal 1:1 ratio. Further research is needed to find out
the reasons for senior English majors failing to acquire the omitted AWL
words. In addition, pseudowords were found not to have a discriminating
effect between freshman and senior students in this study. A
cross-examination with the new academic vocabulary list from COCA
found that 38 of the omitted AWL) words are also present on the new list
(AVL), and there is also a significant difference in acquiring these
overlapped words between both groups of students. Although vocabulary
testing is better in context (Read, 2004; Read & Chapelle, 2001), this
354-item checklist serves to show a significant learning discrepancy. The
findings of this study can be useful to the CEEC in order to help refine and
augment the ERWL, while college and high school English language
instructors can use the information in this study to help find ways to teach
the omitted words to their students.
References
Anderson, J. (1980). The lexical difficulties of English medical discourse
for Egyptian students. English for Specific Purposes Newsletter, 37, 4.
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the
assessment and acquisition of word knowledge. In B. A. Hutson (Ed.),
114 Chapter Five
Abstract
It is generally accepted that autonomy is a matter of degree, or degrees,
which fluctuate and therefore it could be assumed that the way in which
language teaching and learning is approached can make a significant
difference to the degree of autonomy and consequently the degree of
autonomy may influence language learning. It is also well-documented
that assessment plays an influential role in learning and that, like
autonomy, assessment is also a matter of degrees: the greater the degree of
involvement of the learner in the assessment process, the greater the
degree of autonomy that can be achieved. Although assessment in
language learning and teaching contexts is usually intended as assessment
of the language competencies, it is the intention of this chapter to show
that assessment of learning competencies and of competencies for
autonomy should play a role in a curriculum aiming at fostering learner
autonomy and reflexive learning. The research project reported in this
chapter was conducted in a German higher education context and involved
the development and use of a dynamic model of autonomy. Once the
nature of autonomy had been examined and the views of theorists and
practitioners in the field had been taken into account, dimensions of
autonomy and their sub-elements were integrated within a dynamic model
for initiating and continuing pedagogic dialogue between students and
their teachers/advisers at the Freie Universität Berlin. The model of
autonomy proved to be reliable, provided a clear picture to learners and
their teachers/advisers, and showed its potential to be used iteratively.
Being online, the model is available for anyone to use and its value lies in
the formulation of a profile for learners, which helps them understand their
Assessment for Learning; Assessment for Autonomy 119
Introduction
Assessment plays a central role in language teaching and learning. In
institutional forms of assessment, such as tests or certifications, we can
often observe that entire curricula and syllabi are directed to making
learners able to ‘pass the test’ (Prodromou, 1995). In teaching and learning
settings aiming at the development of learner autonomy, the capacity of
the learner to assess/evaluate their own progress and their learning process
is a pivot in the development of learner autonomy (Dam, 1995).
Empowering learners to assess their own language competencies and their
learning process is therefore one of the main challenges for language
educators. This can be done while putting in place formative assessment
modes, in which learners play an active role through self- or peer-
assessment. These forms of assessment are referred to in the literature as
assessment for learning and assessment as learning (Boud, 2000; Colbert
& Cummings, 2014).
Although assessment in second language education is mostly intended
as assessment of language competencies, in the literature on learner
autonomy researchers have started to investigate, besides assessment
modes of language competencies, also forms of assessing the learner’s
disposition and capacity for autonomy, i.e., self-directing their own
learning (Sheerin, 1997). This chapter describes an approach towards
assessment of/for autonomy developed in a German higher education (HE)
setting, at the Freie Universität, Berlin (FUB). The project was undertaken
as part of the author’s doctoral studies, with the aim of encouraging and
promoting the autonomy of the language learners involved, young adult
language learners in HE, using tools which, combined with advising,
aimed to increase learner awareness and to explore the possibility of
greater learner empowerment in the language learning process.
The study was conducted while the author was setting up a self-access
centre, the Centre for Independent Language Learning (CILL), at the FUB.
Recognizing that an essential element of the self-access centre was to
provide learners and learning facilitators, such as teachers and advisers,
with various forms of support for developing learner autonomy, the study
was undertaken with the aim of (i) operationalizing the notion of learner
autonomy on the basis of a critical analysis of existing definitions and
descriptions, and (ii) developing a tool for supporting the learners’
awareness and reflection on their learning competencies.
120 Chapter Six
Figure 6-1: The dynamic model of learner autonomy (Source: Tassinari, 2010, p.
203).
(active listening). The adviser then asks questions seeking further details
and clarification, reformulates the learner’s statements, summing up the
information elicited. Finally, the adviser aims to focus on what the
learner’s next priorities are and asks for their next steps.
This style of pedagogical dialogue is useful in that learners are not left
alone to cope with self-assessment, in which they may have the tendency
to be too kind to themselves or too strict. Without specific criteria with
which to judge, learners might have difficulty in assessing their
competencies in various situations. Having the descriptors allows the inner
perspective of the language learner to interact and be compared with an
external perspective on autonomous language learning. Most importantly,
the dialogue with the adviser/teacher or peers is real and, as such, it has
the potential to unleash powerful and meaningful interaction, where
internalized understandings can be brought to the surface and become
externalised. The benefits to learners are that they are enabled to reflect
deeply, without constraints. They can initiate the topics for discussion and
by doing so they gain insights into their own attitudes and competencies
and establish the basis for decision-making. This capacity for reflection
and consequent action is both the aim and the outcome of the evaluative
reflection process.
The descriptors in the model are not provided with a numeric answer
system, because giving a numeric score to the different answers would
imply a hierarchy among the components and the descriptors and would
severely compromise the learners’ ability to freely choose the components
and the descriptors upon which they would like to reflect. Furthermore, a
scored test is not advisable from a pedagogical point of view, since it
could give learners the false impression that there is a full score to reach.
Therefore, the assessment is merely formative and qualitative, resulting in
enhancing metacognitive processes and, most importantly, it can be
repeated, when and as needed, with the learner modifying and changing
the focus, as required. All of these features contribute to making the model
dynamic.
Table 6-1: Reasons for choice of components (more than one answer
possible).
What was encouraging was that all but one of the 21 participants gave
positive feedback concerning the self-assessment. The majority found the
model useful and thought that the self-assessment process gave them the
impetus for self-reflection, increased awareness of their learning processes
and helped them set further goals to improve their language learning.
Learners felt that the model also made them become more conscious of the
choices and opportunities open to them and more competent in making
decisions with regard to their future learning (see Table 6-2). Such decisions
might include choosing to undertake new learning tasks, trying new
strategies, joining a learning group or finding a tandem partner. Learners
might also decide to leave a course or change courses in order to meet their
more specific learning needs. Difficulties that the learners identified related
to autonomous learning were variations in motivation levels, the challenges
involved in self-assessment of their language skills, and their ability to select
suitable materials, to plan and to manage their time efficiently.
The results of the investigation showed that the dynamic model is a
valid tool which supports evaluation, raises awareness, reflection and
decision-making. Through reflection on skills and competencies, learners
were brought to a state of greater awareness, they could identify their
strengths and shortcomings and recognise areas in which they needed
support. This contributed to improving the learning process and to greater
regulation by learners. The following comments by two of the students
illustrate these points:
“I have learned that I have a problem with managing: I always learn, but
before [the self-assessment] I wasn’t aware of this problem. I start
learning, then I get side-tracked and I don’t make progress.” (Student 19)
actually exploit and which you do not, what can be improved and, as for
many things, if I have only my own point of view, maybe I am only able to
see certain things. […] It’s a test that, since it has no grade, one can do it
freely and it allows you to realize your own pros and contras.” (Student 4)
Conclusion
The aims of the research illustrated in this chapter were to create a
dynamic model of learner autonomy which would offer a comprehensive
description of language learning competencies, skills, attitudes and
behaviours, to be used to support reflection in autonomous language
learning processes. The self-assessment proposed by the dynamic model
offers a learner-centred, dynamic and recursive approach which involves
the collaboration of learners and advisers or teachers within a pedagogical
dialogue and can be renegotiated according to the changing focus of the
learning context and/or situation.
The research participants were positive about their use of the dynamic
model, stressing that the self-assessment stimulated their awareness and
their reflection on their learning competencies, and helped them recognize
strengths and/or issues in their learning process. Out of this reflection, they
could better focus on priorities and make decisions for their further
learning. With the use of the dynamic model, learners reached greater
awareness of themselves as autonomous learners, through the processes of
critical thinking and evaluation, which encouraged metacognitive
development.
However, self-assessment, both of language and of learning competencies,
can be very difficult for learners who are not used to it. The descriptors in
the model provide learners with criteria for conducting the self-
assessment. In addition, the pedagogical dialogue with advisers and/or
teachers gives learners the opportunity to compare their own perspective
with an external perspective, and therefore to enhance their self-awareness
and critical reflection, which are key aspects of learner autonomy. Thus,
the pedagogical dialogue aims at encouraging learners to reflect and take
on a more agentic role than they might previously have been accustomed
to in their language learning.
Since learners (and maybe even teachers) may be reluctant to engage
in such innovative practices of self-assessment, it is necessary to integrate
self-assessment in learning and teaching settings that foster autonomy and
to support it through reflection on learners’ and teachers’ roles and beliefs.
Due to the complex, developmental and fluctuating nature of learner
autonomy, a qualitative research approach, such as the one adopted in this
134 Chapter Six
References
Barcelos, A., & Kalaya, P. (2011). Introduction to beliefs about SLA
revisited. System, 39(3), 281-289.
Benson, P. (2001). Teaching and researching autonomy in language
learning. Harlow, Essex, UK: Pearson Education.
—. (2015). Foreword. In C. J. Everhard & L. Murphy (Eds.), Assessment
and autonomy in language learning (pp. viii–xi). Basingstoke, UK:
Palgrave Macmillan.
Boud, D. (2000). Sustainable assessment: Rethinking assessment for the
learning society. Studies in Continuing Education, 22(2), 151-167.
Breen, M. P., & Mann, S. J. (1997). Shooting arrows at the sun:
Perspectives on a pedagogy for autonomy. In P. Benson & P. Voller
(Eds.), Autonomy & independence in language learning (pp. 132-149).
London: Longman.
Candy, P. C. (1991). Self-direction for lifelong learning. San Francisco:
Jossey Bass.
Colbert, P., & Cummings, J. J. (2014). Enabling all students to learn
through assessment. In C. Wyatt-Smith, V. Klenowski & P. Colbert
(Eds.), Designing assessment for quality learning (Vol. 1, pp. 211-
231). Heidelberg, Germany: Springer.
Cooker, L. (2012). Formative (self-)assessment as autonomous language
learning. Doctoral thesis, University of Nottingham, UK.
Assessment for Learning; Assessment for Autonomy 135
Abstract
The study reported in this chapter explored whether portfolio-based
assessment is effective in fostering learner autonomy through a
longitudinal study in a Chinese junior high school. A three-dimensional
learner autonomy scale was administered to both the experimental and
control groups. The questionnaire findings revealed that English Learning
Portfolios (ELP) were conducive to helping students gain learner
autonomy, which was further supported by the case study results. The
study also showed that the ELP template and development need constant
negotiation and adjustment in accordance with learner needs and
environmental constraints. Therefore, some implications and suggestions
are provided in this regard.
Introduction
Learner autonomy has been a popular research topic in the past thirty
years since Holec (1981) first used the word ‘autonomy’ in his report of
the Council of Europe’s Modern Language Project. In China, during the
past decade, learner autonomy, as a remedy for the conventional problem
of teacher-centeredness, already found its way to the National English
Curriculum Standards (NECS) (2001), a national English teaching
syllabus for primary and secondary English language education in China.
To help learners assume a more active role in English language
learning, formative assessment has been suggested by the NECS, in
addition to summative assessment, so that learners can assess their own
138 Chapter Seven
performance and that of their peers. In doing so, it is believed that learners
can attach more importance to the learning process than the learning
results as learning and assessment are reciprocally integrated (Little &
Erickson, 2015). In fact, the issue of test-oriented, time-consuming but
ineffective English language teaching in China has received a lot of
criticism since the late 1990s (Dai & Hu, 2009). However, research by Shu
(2004) has shown that most English teachers in secondary schools were
unaware of the exact requirements or suggestions put forth in 2001 by the
NECS. In practice, teachers in many schools still immersed their students
in exercises and discrete point quizzes and tests, activities which were not
designed to improve student language competence. Because of the
high-stakes nature of the senior high school entrance examinations, tests
were still considered to be the most powerful measure of students’
performance and teachers’ teaching abilities.
The issue of over-emphasis on teaching and formal assessment is also
evident in a core language journal published in Chinese, namely, Foreign
Language Teaching in Schools. For example, even a decade after the
NECS was introduced, the themes of the published papers in the 2011
journal were classroom teaching, which constituted 51.4% of all journal
content, followed by high-stakes tests, which covered 13.8% of the articles.
Articles on formative assessment were conspicuously lacking.
To bridge the gap between societal needs, educational policy and
reality, researchers and scholars launched collaborative university-school
English language teaching (ELT) projects (Wang & Zhang, 2014). The
present study was but a part of a collaborative ELT project between a
junior high school and a foreign language university. The goal of the
three-year longitudinal study was to empower teachers and learners with
innovative development in course design, assessment and teacher training.
The present study focused on reform in assessment by integrating the
English learning portfolio (ELP) into the assessment system in an effort to
foster learner autonomy.
Background
Learner autonomy (LA), though considered “an elusive notion” (Bown,
2009, p. 572) and embodied in various terms (Dickinson, 1987; Sheerin,
1991; Wenden, 1991; White, 1999; Zimmerman & Schunk, 2001), is a
three-dimensional concept in the present study: metacognitive, affective
and social. Metacognition is accepted as a key element in LA as learners
are supposed to take charge of their own learning (Holec, 1981) and “take
Cultivating Learner Autonomy: A Longitudinal Study 139
relevant research about the portfolio and learner autonomy. Gong and Luo
(2002) introduced what portfolio assessment is and how to develop it in
schools by giving some examples and practical suggestions. However,
there was no systematic description about any empirical studies on this
issue. Rao (2006) carried out a 6-month study during which he integrated
the portfolio in his class instruction among university students and gained
positive feedback from students, but his study did not include a control
group. Lo (2010) recorded how she used a reflective portfolio to promote
learner autonomy in a journalism course. In her study, questionnaires were
administered to learn more about learners’ gains in journalism.
The longitudinal study reported in this chapter posited two research
questions:
The Study
This study took place in a Chinese junior high school as part of a large
ELT project, in collaboration with a university research team, which was
made up of six doctoral candidates and their supervisor. The needs
analysis in the preparatory phase revealed that English teaching at that
junior high school was still teacher-dominant and test-oriented, though
teachers and students agreed on the importance of the communicative
function of English. The ELP was thus used as a means of assessment in
order to change the test-oriented teaching and learning and to promote
learner autonomy. A quasi experiment was conducted to see the
effectiveness of portfolio assessment in fostering learner autonomy over a
period of three years. There were two groups of students: the experimental
group (EG), who received the ELP assessment intervention, and the
control group (CG), who received the traditional assessment.
Cultivating Learner Autonomy: A Longitudinal Study 141
Participants
The number of participants grew over the duration of the study as
every year new students were enrolled in the school. The newly-enrolled
students were assigned to twenty parallel classes according to their
performance on the placement test covering three subjects: Chinese, math
and English. Every year, two out of the twenty classes were randomly
selected to be the EG group.
In Year 1, the EG was composed of EG1-1 (Year 1, Class 1) and
EG1-2 (Year 1, Class 2), with a total of 109 students. Two English
teachers participated in the project on a voluntary basis. In the following
two years, another two classes were recruited into the EG. Similarly, the
CG also grew in number in the three consecutive years, Year 1, Year 2 and
Year 3. Table 7-1 displays the numbers of study participants over the three
years of data collection.
This study excluded data collected from the learners enrolled in Year 3
because of the drastic changes in the local educational policy and the
forthcoming entrance examinations, but the questionnaires were still
administered.
The participants for the case study were selected by stratified case
sampling (Duff, 2007). A total of 6 participants were selected according to
their language proficiency, gender, and learning goals: Alice from EG1-1,
Bob and Elian from EG1-2, Cathy and Jack from EG2-1 (Year 2, Class 1),
David from EG2-2 (Year 2, Class 2). Pseudonyms have been used in the
reporting of the data to protect the privacy and identity of the students. The
demographic information for the case study participants is displayed in
Table 7-2.
142 Chapter Seven
ELP Treatment
The use of the portfolio assessment in the school underwent four
phases: preparatory phase, phase 1, phase 2 and phase 3 (see Table 7-3). In
phase 1 and phase 2, the ELP was a must for the EG students; in phase 3,
the ELP was not required but voluntary, in order to mirror the progression
from reactive autonomy to proactive autonomy.
The ELP template was adapted from the European Language Portfolio,
and it was composed of the learner profile, learning records and learning
materials, corresponding to biography, passport and dossier of the
European Language Portfolio respectively.
The initial portfolio template was proposed by the research team and
negotiated with the school teachers for the details. The learner profile
included the age of starting English learning, English learning goal(s),
learning style (O’Brien, 1990, cited in Joy, 2012), and the most
Cultivating Learner Autonomy: A Longitudinal Study 143
the students’ feedback, the research team and the EG teachers simplified
the portfolio template.
In phase 2, EG1 and EG2 students were required to keep the simplified
ELPs, which were revised according to the suggestions from students and
teachers. This time, EG2 students learned to develop the portfolio with the help
of not only their teachers but also their peers. EG1 students presented their
learning outcomes and shared their experiences in the orientation week. Small
adjustments were made based on the on-going needs analysis in the process.
Cultivating Learner Autonomy: A Longitudinal Study 145
In phase 3, EG1, EG2 and EG3 students were not required but
encouraged to develop ELPs. The questionnaire and interviews were
conducted to see whether there was a difference between EG and CG.
Instruments
Data were gathered through the ELP, the questionnaire and the case
studies and the data from these multiple sources helped achieve the effect
of triangulation (Duff, 2007; Nunan &Bailey, 2009).
A 5-point Likert scale on learner autonomy was adapted from Oxford
and Burry-Stock’s (1995) Strategy Inventory for Language Learning
(SILL). This instrument was used for two main reasons: SILL has been the
most widely accepted questionnaire to evaluate LA; SILL incorporates the
metacognitive, affective and social dimensions of language learning, which
is consistent with the three-dimensional construct of learner autonomy (i.e.,
metacognition, willingness and communication) used in this study. The
questionnaire items focus on the measurable behaviors, including the
items portraying metacognition like ‘I have got my own way of learning
English’, ‘I adjust my learning methods sometimes’, ‘I have a clear
learning goal’, ‘I make learning plans according to my learning goal’, and
‘I often reflect on my English learning’. Examples of items depicting
willingness were the following: ‘I am interested in English’, ‘I have
confidence in English learning’, and ‘I feel happy when learning English’.
Example items showing communication were: ‘I exchange my views of
English learning with peers’, ‘I talk with others in and out of class’, ‘I take
the chance to speak English’, and ‘I also use nonverbal language (gestures,
facial expressions, etc.) to express myself when I communicate with
others’. The questionnaire was designed, translated and administered in
Chinese to avoid misunderstandings and misinterpretations.
All members of the research team were involved in the validation of
the questionnaire items and initially they put forward 43 items. Then, the
questionnaire was piloted with 395 learners to ensure its construct validity
via factor analysis after the reliability test of Cronbach alpha coefficient.
The results showed that KMO was 0.920, and the Barlett’s test correlation
matrix was 0.000. Factor analysis revealed that nine factors stood out by
means of the principal component analysis, which were a bit scattered. As
a result, items with extraction communities lower than 0.5 were eliminated
from the scale.
Another factor analysis was conducted to verify the reliability and
construct validity of the revised 29-item scale. Statistical analysis revealed
146 Chapter Seven
autonomous than CG learners, to which the use of the ELP might have
contributed, especially in students managing and planning their own
learning.
Phase 1
EG learners’ significant progress in their metacognitive abilities was
well supported by their weekly plans and reflections. Alice’s ELP won
favorable comments from her teacher Allan and her peers, and was
selected as one of the best 5 at the end of the first term. In the learning
journals, Alice viewed the portfolio as an aid to time-management:
“We will be clear about what to do instead of wasting time after making
plans. When we finish homework every evening, we will try to reach our
goal. And on weekends, we will spend time in learning English according
to plans and often check if the plan has been fulfilled. Otherwise, we might
waste time in watching TV and playing with computers.” (A-J2-11)
According to the journal entry, the portfolio helped Alice manage her
time and learning. Learning plans are an essential part of the portfolio and
they can guide students on what to learn and can give them insight into
when is the best time to learn outside of the classroom. The research team
and teachers gave students a list of recommended learning materials. Then,
students had the freedom to select the materials according to their tastes
and interests. As the school did not give students much extra time during
the school day, students had to use their free time to do the assigned
homework for the various subjects. Therefore, Alice’s plan of learning
English mostly on weekends was quite practical. Table 7-6 shows an
excerpt of Alice’s monthly learning plans. In the interview (A-I1), she
reported that she had made some plans in primary school but could not
stick to them. Thanks to the portfolio, she could persist in doing
Cultivating Learner Autonomy: A Longitudinal Study 149
out-of-class learning step by step together with her classmates under the
guidance of the teacher. She also gained a sense of achievement when her
portfolio was selected as one of the best. Her experience proved that the
portfolio could help learners to plan, manage and monitor their English
learning.
Elina, another good ELP keeper, viewed the portfolio as her learning
outcome and a stock of learning materials:
“I can carry out my plans and make some adjustments. For the out-of-class
learning, I’d like to separate my own learning materials from the teacher’s
so that I can clearly know what I have gained after class. More importantly,
I have begun to review all the materials in ELP. After all, we develop ELP
for our own sake…” (E-J2-11)
“This winter holidays did not last long but I felt happy and substantial for I
benefited a lot from autonomous learning. I read two parts of Gulliver’s
Travels and wrote two book reports. Besides, I wrote two poems and
learned to sing four English songs. I translated an article and made two ppt.
Also, I finished learning 7A, New Concept English and Li Yang English
150 Chapter Seven
various subjects, they had little time for extra learning. Therefore, the
teachers and the research team decided to streamline the ELP and use it
mainly for learning during weekends and holidays. In the second term of
Year 1, the ELP was simplified. Based on the ongoing needs analysis, in
Year 2, the ELP was further revised and mainly focused on planning and
reflecting in addition to self-assessment and peer-assessment in class.
Phase 2
For Cathy, David and Jack, the ELP template underwent some changes
in format. Their weekly plans were moved out of the ELP folder and they
were put together with the reflections in the learning journals so that the
students could clearly see whether or not they had accomplished their
plans.
As the research team and teachers were more experienced in the
second year, activities in the orientation week were used to increase the
students’ awareness of planning and doing out-of-class learning. After
watching the film Akeelah and the Bee and interacting with the EG1
students, Cathy immediately took action. She began to keep learning
journals in English in the hope of improving her English in addition to
reflecting. She wrote in the journal:
“I keep writing learning journals in English every week and Carol [her
teacher] corrects my mistakes. In this way, I benefit a lot, not only to learn
English but also in writing.” (C-J2-16)
Therefore, the weekly journals not only helped Cathy learn how to
learn but also helped her improve her English writing. The journal also
served as a bond in the teacher-student communication.
Similarly, David summarized his one-year of learning as:
“I can better manage my study with the help of the teacher and ELP after I
entered this junior school about a year ago. The weekly plan in my
learning journals and the monthly plan can help me well arrange my study.
I often cooperate with my classmates in class, and practice dialogues and
prepare for presentations. I like this learning mode.” (D-J2-16)
His journal also showed how ELP helped raise learners’ metacognitive
awareness in planning and managing their leaning. As a matter of fact,
students could not only learn to plan their learning but they also made use
152 Chapter Seven
of learning skills and strategies and tried to find out the most suitable way
of learning for themselves. As Cathy put it:
“This week I learned how to learn English and how to present what I learn.
Thank you, Carol. You taught me how to take notes and use notes. The
learning strategies we learn in every unit were really helpful.” (J-J1-2)
To Jack, the ELP served as the bond between him and Carol. When
encountering some problems in English learning, Jack would ask
questions in the learning journal. For instance, he wrote:
“Carol, I have two questions. First, will we learn phonetic symbols in the
junior high school? I did not learn in the primary school. Second, what can
I do if I meet with some new words in a test?” (J-J1-3)
“Don’t worry. We will learn phonetic symbols in the following week. Then
you can learn to pronounce new words and correct wrong pronunciations.
As for the new words you meet with in reading, I have two suggestions.
Firstly, it is common to run into new words. You can skip them if they do
not keep you from understanding the content or guess the meaning in the
Cultivating Learner Autonomy: A Longitudinal Study 153
context. Secondly, …”
Conclusion
This longitudinal study showed that ELP assessment employed as an
obligatory element in English learning in the first two phases enabled
learners to move closer to reactive autonomy and gradually progress to
proactive autonomy with teacher guidance and peer feedback.
The questionnaire results revealed significant differences between the
EG and CG students in LA, particularly in metacognition, while the case
studies validated the positive relationship between the ELP and LA among
EG learners in a high-stakes context. Though the study was conducted in a
top-down approach, on-going needs analysis was important. In the study,
the ELP template was negotiated and adjusted more than once to cater for
learner needs and to consider contextual constraints in the high-stakes
learning context.
The positive feedback concerning the simplified ELP template also
shows that there is no one-size-fits-all or best portfolio template. It is
always important for researchers, teachers and learners to figure out the
most suitable one for their learners in a certain context. A clear purpose
and focus in the portfolio is equally important, as it can help learners, who
may otherwise find it too difficult to achieve their goals and fulfill their
plans, develop their interest and confidence in the learning process. This
way, ineffective learners like Jack can also develop their interest in
English learning. In addition, Jack’s case also showed that portfolio use
should be adapted to the individual learner. Teachers are expected to
provide sufficient scaffolding and assistance in goal-setting and
Cultivating Learner Autonomy: A Longitudinal Study 155
References
Banfi, C. S. (2003). Portfolios: Integrating advanced language, academic,
and professional skills. ELT Journal, 57(1), 34-42.
Benson, P. (2005). Teaching and researching autonomy in language
learning. Beijing: Foreign Language Teaching and Research Press.
Bown, J. (2009). Self-regulatory strategies and agency in self-instructed
language learning: A situated view. The Modern Language Journal,
93(4), 570-583.
Confessore, G. J., & Park, E. (2004). Factor validation of the Learning
Autonomy Profile (Version 3.0) and extraction of the short form.
International Journal of Self-directed Learning, 38(1), 39-58.
Cotterall, S. (1995). Readiness for autonomy: Investigating learner beliefs.
System, 23(2), 195-205.
Cummins, P. W., & Davesne, C. (2009). Using electronic portfolios for
second language assessment. The Modern Language Journal, 93(1),
848-867.
Dai, W., & Hu, W. (2009). Foreign language education in China:
Development and research (1949-2009). Shanghai: Shanghai Foreign
Education Press.
Dickinson, L. (1987). Self-instruction in language learning. Cambridge:
Cambridge University Press.
Duff, P. (2007). Case study research in applied linguistics. Lawrence
Erlbaum Associates Inc.
Gong, Y., & Luo, S. (2002). The alternatives in English assessment.
Beijing: People’s Education Press.
González, J. Á. (2009). Promoting student autonomy through the use of
the European Language Portfolio. ELT Journal, 63(4), 373-382.
Holec, H. (1981). Autonomy and foreign language learning. Oxford:
Pergamon Press.
Joy, M. (2012). Learning styles in the ESL/EFL classroom. Beijing:
Foreign Languages Teaching and Research Press.
156 Chapter Seven
CAROL J. EVERHARD
Abstract
In investigations of the theory and practice of foreign language
learning and teaching, attention has often been focused on the evaluation
of language competencies, on the one hand, and on the nature of autonomy
in language education, on the other. Although learner-centred assessment
is recognized as an essential element in promoting learner autonomy, the
matter of the relationship and interconnection between teaching, learning,
assessment and autonomy remains largely unexplored. With reference to
empirical research conducted in a higher education context in Greece, the
author takes a closer look at the nature of the relationship between
teaching, learning, assessment and autonomy. The study explores how
practice in peer- and self-assessment of oral and writing tasks can both
help develop language competencies and allow learners to exercise more
autonomy in the learning process. By sharing assessment power with
learners and encouraging them to take part in assessment processes using
pre-determined sets of criteria and offering feedback to peers, an attempt
was made to move from summative methods of assessment, which align
with transmissional modes of teaching and assessment of learning, to more
formative methods of assessment, which align with transactional modes of
teaching and assessment for learning. Based on the study findings, it is
argued that the activation of criterial thinking, metacognitive processes
and awareness facilitates progression to more transformative approaches
to teaching and to sustainable assessment or assessment as learning, which
leads to more liberatory and lifelong learning.
Sharing Assessment Power to Promote Learning and Autonomy 159
Introduction
Although a great deal of interest has been generated in autonomy in
language learning and its promotion, at the same time it has become
apparent to teachers and researchers that assessment practices have not
really changed in ways that might help promote autonomy and, indeed,
may even come into conflict with the learner-centred practices which the
fostering of autonomy demands. The research project which is reported in
this chapter explored the use of assessment, not as a means of policing,
ranking or providing certification, but rather as a means to promote a
greater sense of ‘being’ as a learner and to create a greater sense of ‘self’
(Everhard, 2012). Conventional assessment passes judgment on learners
and categorizes their learning efforts by means of grades. This type of
summative assessment highlights overall weakness, but does not help
learners identify exactly where their weak points lie and how these
weaknesses can be overcome and converted to strengths. Hence the need
for more formative approaches in which assessment aims to promote
learning through increasing learner awareness.
Murphy (2015) has pointed out that although the rhetoric of an
institution may indicate the desire to encourage and promote autonomy in
learners, particularly in the distance education setting she describes, in
practice, the type of teaching, guidance, materials and support on offer
may actually contradict these aims. Nguyen and Walker (2016) indicate
that it is assessment practices which are inappropriate or poorly thought-
through which can actually detract the most from the good teaching on
offer, since the same benefits offered by good teaching are not captured in
assessment approaches (Boud, 1995, cited in Nguyen & Walker, 2016).
Raappana (1997) notes the dangers of such contradiction in a Finnish high
school setting and believes that opportunities for metacognition and self-
assessment among learners are key to promoting autonomy and
encouraging teachers to avoid teaching to the test. Black and Jones (2006),
in a British primary education language-learning context, also highlight
the importance of metacognition and how the assimilation of criteria,
through peer-assessment, helps learners progress to assessing their own
work with greater ease and understanding. Heron (1981, p. 66) highlights
the fact that “people are not used to criterial thinking” and so what is
important is “consciousness raising about criteria and criterial thinking”,
to enable self- and peer-assessment tasks and skills to become part of the
language learner’s repertoire.
What will be described in this chapter are details of an approach which
was taken in a Greek higher education (HE) setting towards assessment.
160 Chapter Eight
The participants in the research study were 1st year English majors and the
aim was to encourage and promote the autonomy of the language learners
involved by sharing assessment power. The study used particular
pedagogical and assessment tools, combined with advising sessions, to
increase learner awareness. It explored the possibility of greater
empowerment and autonomy of language learners through implementing
learner-focused assessment.
The Study
The setting for the research project, entitled the Assessment for
Autonomy Research Project (AARP), was undertaken in the context of a
first-year first-semester Language Mastery I course at the School of
English (SOE), at the Aristotle University of Thessaloniki (AUTh). The
aims of the research were to establish if learners, when given the
opportunity to peer-assess, could do so with objectivity using the criteria
given and whether once having had practice in peer-assessment, learners
could use the same criteria to conduct self-assessment with the same
objectivity. It was hoped that by assuming such responsibility and being
given the opportunity for critical and criterial thinking, learners could use
the assessment process as an opportunity for learning, through offering
and receiving feedback and playing a role in their assessment that had
equal standing with the instructor’s. In addition, by developing greater
self-awareness, the assessment process might lead learners to greater
autonomy and self-direction in their language learning.
As would be expected, the majority of participants were indeed first-
year students who had entered the SOE through competitive state exams.
The number of students participating in the project each year averaged out
at around 50, with about 250 participants in total over the 5-year period.
An anomaly of the entrance examination system was that students had
achieved high grades in order to qualify for entry to the SOE, but they
were not necessarily highly proficient in English. Language Mastery I was
the first of two courses on offer for the development of language skills,
with this first course being concerned with description and narrative, while
Language Mastery II focused on argumentative and persuasive discourse.
Each of these courses consisted of four contact hours per week, though as
English majors, all of their courses from the four departments of English
Literature, American Literature, Linguistics and Translation were delivered
in English.
Sharing Assessment Power to Promote Learning and Autonomy 161
the end of the assessment cycles to gather student opinions about the
assessment process. Having ensured the consistency and reliability of
these instruments, the same pedagogical and research procedures were
followed in the next three years. Again, each year the author-researcher’s
two groups of Language Mastery I students were involved in the same oral
and writing assessment procedures, thus involving in total six groups in
the three-year period.
In the final year of the AARP, in the Post-Study, claims made in the
literature, that training is necessary for accurate peer-assessment, were put
to the test, by providing training for the two groups involved, in the form
of mock assessment, both for the oral presentation and for the first writing
assignment. It should be noted that in the four years previous to the Post-
Study, learners learned how to peer-assess by doing peer-assessment.
There was no training involved.
In the case of the Post-Study intervention, which involved training for
oral peer-assessment, nine older student volunteers offered live
presentations which were assessed by both groups simultaneously, using
the pre-determined criteria for oral assessment previously provided. The
grading and comments of the learners were saved in duplicate and brought
for discussion and comparison with the instructor’s assessment in the next
lesson. In the case of the writing assessment, as suggested by Dickinson
(1992), frozen data were used and five sample answers, drawn from
students’ assignments in previous years, were randomly selected for
training purposes with the learners. They were asked to assess the first
three samples at home and these were then discussed together in class.
Then they were asked to assess the remaining two samples in class
(without consultation), together with a ‘live’ sample for peer-assessment.
On completion, they recorded their assessments in duplicate and their
assessment of the two remaining samples was discussed at the next class
meeting. In addition, appointments were made with students in small
groups so that their peer-assessment of the mock samples and of the ‘live’
example could be discussed with the instructor, as could any issues arising
from their oral assessment training. The outcome of this intervention on
the part of the instructor, using training in peer-assessment, rather than
practice alone, revealed some interesting results. Also, a new questionnaire
was devised, in English rather than Greek, so that information specific to
learners’ impressions and opinions of the training intervention could be
gathered.
The assessment data, both quantitative and qualitative was processed
and analysed statistically in a number of ways in order to be able to
spotlight and highlight findings from a variety of perspectives and views.
164 Chapter Eight
Results
In considering the results of the AARP study, it has to be remembered
that although assessment of both Writing and Speaking Skills involved
triangulated peer-assessment, self-assessment and teacher assessment, the
derivation of peer-assessment grades was significantly different in each
case, as will be explained. What is similar, however, is the fact that
learners learned to assess by doing. As mentioned previously, there was no
attempt to train them to peer-assess until the last year of the 5-year project.
Some of the results derived from the statistical analysis are shown in
Tables 8-1 and 8-2. If we consider the ANOVA results derived from
assessment of the first writing assignment, concerned with paragraph
writing, under the column ‘Writing 1’ in Table 8-1, we see that in the Pre-
Study, AY1, there were no significant differences between the groups,
indicating alignment between peer-, self- and teacher assessment. The
same occurred with the second writing assignment, designated ‘Writing 2’
in Table 8-1, which was concerned with essay writing. Again, there were
no significant differences between peer-, self- and teacher assessment.
With regards to the Main Study (AY2-AY4), there was some consistency
in the assessment of the Writing 1 task, but not the same type of
consistency as in the Pre-Study (AY1). Only one out of the two groups in
each year, in each case the second group, namely Groups D, F and H,
showed alignment in assessment between peer-, self- and teacher
assessment.
In the other groups, namely Groups C, E and G, in Writing 1, there was
divergence in assessment. There was no consistency in how these groups
diverged, as revealed by the Tukey-Kramer test. The Pearson correlation
coefficients, on the other hand, revealed some alignment between self-
assessment and teacher-assessment (S-T) in Group C, with a value of 0.61
and in Group F, with a value of 0.51. In the case of Group C, there was a
peer-assessment and teacher-assessment (P-T) correlation value, similar to
that of the S-T, with a value of 0.62.
In Writing 2, in the Main Study (AY2-AY4), the ANOVA results
revealed only one group, Group F, with no significant differences between
peer-, self- and teacher assessment, similar to the results in Writing 1.
Similarly, Groups A and B in the Pre-Study (AY1) showed no significant
differences for either Writing 1 or Writing 2. For the remaining groups,
the Tukey-Kramer tests revealed some consistency in assessment
behaviour. For Groups C, D and E, there was non-alignment between self-
assessment and peer-assessment, but there was alignment between peer-
assessment and teacher assessment. Also, in the case of Groups G and H,
Sharing Assessment Power to Promote Learning and Autonomy 165
Writing 1 Writing 2
One-way ANOVA &
Tukey-Kramer
Year &
Group
P-T
P-T
S-T
S-T
AY1
n.s. n.s.
Group A
AY1
n.s. n.s.
Group B
AY2 p = .002 p < .001
0.62 0.61 0.48
Group C A>C=B A>B=C
AY2 p = .010
n.s.
Group D A>B=C
AY3 p = .006 p = .002
Group E A>C only A>B=C
AY3
n.s. 0.51 n.s.
Group F
AY4 p = .005 p = .048
0.47
Group G A>B only A>B only
AY4 p = .017
n.s. 0.66
Group H A>B only
AY5* p < .001 p = .021
0.65
Group I A>B>C A>B only
AY5*
n.s. 0.66 n.s.
Group J
Note: n.s. = non-significant; ANOVA - A = Self; B = Peer; C = Teacher;
PEARSON - S-T = Self-Teacher; P-T = Peer-Teacher; AY = Academic Year;
AY5* = Only participants in the Post-Study Intervention exercises have been
included.
Oral Presentation
One-way ANOVA &
P-T
S-T
Year & Group
AY1 Group A n.s 0.75
AY1 Group B n.s. 0.51
AY2 Group C n.s. 0.45
AY2 Group D n.s.
AY3 Group E p < .001 A=B>C
AY3 Group F n.s. 0.46 0.42
AY4 Group G n.s. 0.44
AY4 Group H n.s. 0.52 0.45
AY5* Group I n.s. 0.51
AY5* Group J n.s.
Note: n.s. = non-significant; ANOVA - A = Self; B = Peer; C = Teacher;
PEARSON - S-T = Self-Teacher; P-T = Peer-Teacher; AY = Academic Year;
AY5* = Only participants in the Post-Study Intervention exercises have been
included.
throwing light on the results for these groups shown in Table 8-1. In Table
8-3:
1. ‘Actual’ refers to the means derived from all the participants who
were involved in self-assessment after conducting peer-assessment,
which was the normal procedure on the AARP and it is that on
which the results in Table 8-1 are based.
2. ‘Variation 1’ presents the means derived from learner-assessment
for those students from Groups C and D, who submitted self-
assessment with their assignments, before peer-assessment
processes took place. The S-A means presented in Variation 1 are
therefore those based on their first self-assessment, before
involvement in peer-assessment.
3. ‘Variation 2’ presents the same means as in ‘Actual’ (i.e., self-
assessment conducted after peer-assessment), but with the means
derived only from those students who performed self-assessment
twice, i.e., only those students who took part in Variation 1.
Table 8-3: AARP mean scores for Writing 2, for Groups C and D,
with ANOVA results for self-assessment variations.
Some interesting points emerged from these data. Firstly, from the
teacher-researcher’s point of view, it is gratifying to see that there seems
to be consistency in her assessment of Groups C and D, in that the T-A
Sharing Assessment Power to Promote Learning and Autonomy 169
means were calculated at 7.42 for Group C and 7.41 for Group D. Both
with regard to peer-assessment and self-assessment, Group C tends to be a
little more generous in its self-assessment when compared with Group D.
What is very interesting is the exact match between peer-assessment and
teacher-assessment in Group D. It is strange that this same alignment does
not appear in Group D self-assessment, with the discrepancy in S-A means
with T-A means rising from 0.82 for the first self-assessment to 1.42 for
the second self-assessment. There is also an increase in discrepancy from
1.03 to 1.36 in Group C, when comparing the means for S-A and T-A,
between the first self-assessment and the second. This increase in
discrepancy is alarming, when one would actually expect closer alignment
between S-A and T-A after the experience of peer-assessment.
What is most significant is the fact that the highest means in all cases,
whether Actual, Variation 1 or Variation 2 for both Groups C and D were
produced by S-A, with higher S-A values in each case awarded by Group
C, as compared with Group D. Most interesting also is the fact that the
second self-assessment exceeds the first self-assessment, with S-A means
rising from 8.54 to 8.87 in Group C and even more steeply, from 8.11 to
8.71 in Group D. This inflated self-assessment differs from findings in the
Far East where modesty prevails and self-assessment means tend to be
lower than T-A and P-A, both for writing (Matsuno, 2007, 2009) and
speaking (Chen, 2006).
With regard to Standard Deviation (SD), the highest level of SD was
displayed by P-A in all cases, indicating that Peers were awarding a wider
range of grades than both S-A and T-A. One-way ANOVA revealed
significant p values for both Groups C and D in the Actual and in
Variation 2, but in the case of Variation 1, while Group C still produced a
p value of .002, which is considered significant, in the case of Group D,
the p value was 0.147, which is considered non-significant. Tukey-Kramer
Multiple Comparison Tests revealed very similar results in Groups C and
D (Actual) and Groups C and D (Variation 2), with the pattern A>B=C,
indicating alignment between peer-assessment and teacher assessment.
In order to understand better the differences in assessment behaviour
between Variation 1 and Variation 2, a paired sample t-test of the two self-
assessments of the two groups was conducted and the results are shown in
Table 8-4. Both the ANOVA and the t-test revealed significant increases
in S-A values for both Groups C and D, with the paired t-test revealing an
increase of 0.33 for Group C and an increase of 0.60 for Group D between
the first self-assessment conducted and the second. These increases occur
after peer-assessment processes, which we would have expected to have
been a form of training in using the criteria and to have had more of a
170 Chapter Eight
Qualitative Data
Qualitative data was gathered from participants at the end of each
AARP semester by means of a post-questionnaire. In the case of Groups C
and D, completed questionnaires were received only from 13 participants
from Group C and 16 students from Group D. T-analysis of their
responses to 10 of the questions showed lack of agreement between the
groups concerning the matter of objectivity in peer-assessment, concerning
how easy self-assessment was and how easy it was to be objective in self-
assessment. Table 8-5 shows some responses by students to these
questions, which seem to be representative of opinions in these groups.
Sharing Assessment Power to Promote Learning and Autonomy 171
Table 8-6: Modified extract from the AARP model showing the
relationship between assessment and degrees of autonomy (Based on
Everhard 2014, 2015a; Harris & Bell, 1990).
Conclusion
While Tassinari’s research project in Berlin (this volume; Tassinari,
2015) focuses on learning competencies, this project in Thessaloniki,
focuses on language competencies; however, there are some similarities
between the two studies. Before proceeding to peer- and self-assessment
tasks, participants in the AARP were encouraged to reflect on their
speaking and writing skills by means of the Learner Contract and small-
group counselling meetings with the teacher. As with Tassinari’s dynamic
model, where learners reach awareness of themselves as autonomous
learners, in the case of the AARP they become aware of their strengths
and weaknesses in the productive skills and in their approach to language
Sharing Assessment Power to Promote Learning and Autonomy 173
References
Benson, P. (2001). Teaching and researching autonomy in language
learning (1st ed.). Harlow, UK: Pearson Education.
Black, P., & Jones, J. (2006). Formative assessment and the learning and
teaching of MFL: Sharing the language learning road map with the
learners. Language Learning Journal, 34(1), 4-9.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003).
Assessment for learning: Putting it into practice. Maidenhead, Berks.
& New York, NY: Open University Press.
Boud, D. (1995). Enhancing learning through self-assessment. London:
RoutledgeFalmer.
Bowen, J. (1988). Student self-assessment. In S. Brown (Ed.), Assessment:
A changing practice (pp. 47-70). Edinburgh, UK: Scottish Academic
Press.
Chen, Y.-M. (2006). Peer and self-assessment for English oral
performance: A study of reliability and learning benefits. English
Teaching & Learning, 30(4), 1-22.
Dam, L. (1995). Learner autonomy, 3: From theory to practice. Dublin:
Authentik.
Dickinson, L. (1987). Self-instruction in language learning. Cambridge,
UK: Cambridge University Press.
—. (1992). Learner autonomy, 2: Learner training for language learning.
Dublin: Authentik.
Dörnyei, Z. (1994). Motivation and motivating in the foreign language
classroom. The Modern Language Journal, 78(3), 273-284.
Everhard, C. J. (2012). Re-placing the jewel in the crown of autonomy: A
revisiting of the ލself’ or ލselves’ in self-access. Studies in Self-Access
Learning Journal, 3(4), 377-391. Retrieved from:
http://sisaljournal.files.wordpress.com/2009/12/everhard1.pdf
—. (2014). Exploring a model of autonomy to live, learn and teach by. In
A. Burkert, L. Dam & C. Ludwig (Eds.), The answer is learner
autonomy: Issues in language teaching and learning (pp. 29-43).
Faversham: IATEFL.
—. (2015a). The assessment-autonomy relationship. In C. J. Everhard &
L. Murphy (Eds.), Assessment and autonomy in language learning (pp.
8-34). Basingstoke: Palgrave Macmillan.
—. (2015b). Investigating peer- and self-assessment of oral skills as
stepping-stones to autonomy in EFL higher education. In C. J.
Everhard & L. Murphy (Eds.), Assessment and autonomy in language
learning (pp. 114-142). Basingstoke: Palgrave Macmillan.
Sharing Assessment Power to Promote Learning and Autonomy 175
Harris, D., & Bell, C. (1990). Evaluating and assessing for learning (2nd
ed.). London & New York: Kogan Page & Nichols Publishing.
Harris, M. (1997). Self-assessment of language learning in formal settings.
English Language Teaching Journal, 51(1), 12 -20.
Heron, J. (1981). Assessment revisited. In D. Boud (Ed.), Developing
student autonomy in learning (pp. 55-68). London: Kogan Page.
Hunt, J., Gow, L., & Barnes, P. (1989). Learner self-evaluation and
assessment – a tool to autonomy in the language learning classroom. In
V. Bickley (Ed.), Language teaching and learning styles within and
across cultures (pp. 207-217). Hong Kong: Institute of Language in
Education, Education Department.
Janssen-van Dieten, A.-M. (1989), The development of a test of Dutch as
a second language: The validity of self-assessment by inexperienced
subjects. Language Testing, 6(1), 30-46.
Little, D. (2009). The European language portfolio: Where pedagogy and
assessment meet. Strasbourg: Council of Europe.
Matsuno, S. (2007). Self-, peer-, and teacher-assessment in Japanese
university EFL writing classrooms. Doctoral thesis, Temple
University, Tokyo, Japan.
—. (2009). Self-, peer-, and teacher-assessments in Japanese university
EFL writing classrooms. Language Testing, 26(1), 75-100.
Murphy, L. (2015). Autonomy in assessment: Bridging the gap between
rhetoric and reality in a distance language learning context. In C. J.
Everhard & L. Murphy (Eds.), Assessment and autonomy in language
learning (pp. 143-166). Basingstoke: Palgrave Macmillan.
Nguyen, T. T. H., & Walker, M. (2016). Sustainable assessment for
lifelong learning. Assessment & Evaluation in Higher Education,
41(1), 97-111. doi: 10.1080/02602938.2014.985632
Nunan, D. (1997). Designing and adapting materials to encourage learner
autonomy. In P. Benson & P. Voller (Eds.), Autonomy and
independence in language learning (pp. 192-203). Harlow, Essex:
Longman.
Orsmond, P., Merry, S., & Reiling, K. (1996). The importance of marking
criteria in the use of peer assessment. Assessment and Evaluation in
Higher Education, 21(3), 239-250.
Orsmond, P., Merry, S., & Reiling, K. (1997). A study in self-assessment:
Tutor and students’ perceptions of performance criteria. Assessment &
Evaluation in Higher Education, 22(4), 357-368.
Orsmond, P., Merry, S., & Reiling, K. (2000). The use of student derived
marking criteria in peer and self-assessment. Assessment & Evaluation
in Higher Education, 25(1), 23-38.
176 Chapter Eight
Abstract
The effectiveness of teachers significantly correlates with students’
educational achievement. Ensuring teachers are well prepared for the
classroom is of primary importance in the educational system of any
country. In the field of English as an additional language (EAL), there are a
number of tools that have been developed to gauge teacher readiness for
taking on the profession. However, the existing tools and resources have a
North-American orientation and they are not a good fit for other contexts
that may require a different cultural sensitivity on the part of the teacher.
This is particularly acute in situations where teachers trained in the West are
employed in non-western contexts as is the case in the United Arab Emirates
(UAE). The purpose of the project described in this chapter was the creation
of a contextually relevant resource for independent learning and self-
assessment to strengthen EAL teachers’ content knowledge, pedagogical
knowledge, and professional dispositions. This chapter describes the context
of the study and the process of developing the resource by first compiling
EAL standards and indicators that are culturally responsive and cater to the
needs of EAL teachers in the UAE and the greater Gulf Region, and then by
producing over 300 assessment items designed to measure specific EAL
indicators. Valuable insights from the project leaders are shared in an effort
to facilitate the process of developing similar resources for other EAL
contexts. It is envisioned that the resource will support teacher learning and
178 Chapter Nine
Introduction
Recent studies at the classroom level have found that teacher
effectiveness is a strong determinant of differences in student learning
(Cavalluzzo, Barrow, Henderson et al., 2015; Cowan & Goldhaber, 2015;
Hawk, Coble, & Swanson, 1985; Tchoshanov, 2011). Empirical evidence
has shown that students who have high quality teachers make significant
and lasting learning gains. These findings make the identification and
evaluation of teacher effectiveness a major priority in today’s classrooms.
Despite the importance of highly effective teachers, there is no one scale
or agreed-upon list of criteria to evaluate what makes an effective teacher
in general or an effective second language teacher in particular. This
dearth of research carries over into the Middle East context where studies
about teacher effectiveness are lacking.
An important research strand in teacher effectiveness and one that is
central to the conceptualization and development of the teacher resource
described in this chapter is the knowledge base on what is known as self-
efficacy. Self-efficacy, a belief in one’s capability to execute the actions
necessary to achieve a certain level of performance, is an important
influence on behavior and affect, relating to individuals’ goal setting,
effort expenditure and levels of persistence (Deemer & Minke, 1999;
Bandura, 1993). When applied to teachers, the self-efficacy (or teacher
efficacy) construct has been associated with teachers’ instructional
practices and attitudes toward students (Ashton, Webb, & Doda, 1983;
Bender, Vail, & Scott, 1995; Midgley, Anderman, & Hicks, 1995).
In addition, teacher efficacy has been defined as both context and
subject-matter specific. In terms of context, the effects of efficacy have
been studied with pre-service, novice and in-service teachers at various
levels of education (i.e., elementary, middle and secondary school) and in
various contexts (i.e., urban, suburban and rural). However, no empirical
findings to date have been presented on the efficacy of teachers of English
as an additional language (EAL) in the Gulf or Middle East context. It was
this lack of empirical evidence and adequate resources for measuring
language teacher effectiveness that served as the motivation for the
creation of this context-specific localized teacher resource.
Finally, teacher efficacy has also been associated with subject-matter
knowledge and the belief that teachers must have both declarative and
procedural knowledge to successfully navigate today’s classroom
Developing a Tool for Assessing English Language Teacher Readiness 179
groups. They then matched teachers by level of experience, school, and used
pre- and post-tests of student achievement at the beginning and end of the
academic year in the specific curriculum taught. The study results showed
that students of certified teachers in mathematics performed significantly
better than those who were taught by uncertified teachers in mathematics in
both general mathematics and algebra.
With regards to early career teachers, Tchoshanov (2011) maintains that
there are significant correlations between new teachers’ content knowledge,
knowledge of student learning variables, and the quality of their lessons.
Content knowledge isolated from other teacher knowledge, such as
pedagogical content knowledge, curriculum knowledge, epistemological
knowledge, and knowledge of the learners may not give a complete picture
of the relationship between teacher, content knowledge, and student
achievement.
The studies reviewed here show that students of teachers with
professional knowledge perform better than those taught by teachers without
professional pedagogical and content knowledge. Ensuring that English
language teachers, in the Gulf Region and beyond, have the required
professional knowledge is crucial and of the utmost importance to help
increase student achievement.
The Study
The study reported in this chapter is part of a larger project that aimed
to design and implement a self-evaluation instrument for assessing EAL
teacher readiness in the context of the UAE and the greater Gulf Region. It
was envisioned that the creation of the instrument would involve five
stages (see Figure 9-1). In the first stage (Stage 1), an initial review of the
existing tools for assessing EAL teacher readiness would be performed by
Develooping a Tool forr Assessing Eng
glish Language Teacher Readin
ness 183
Stagge 1
Revview resources on EAL teacheer
standaards.
Stage 2
IIdentify contexxt-specific stand
dards and
inndicators.
Stage 3
Develop item
ms to test speciific indicators.
Stagee 4
Pilot thee resource and evaluate
e its valiidity
and reliab
bility.
Sttage 5
Makke resource avaailable to in-servvice and
prre-service EAL teachers in the UAE.
Review of
o the Tools for
f EAL Teeachers
One of tthe major staages in the prrocess of creaating our conttextually-
relevant andd localized teaacher developpment resourcee was examin ning what
other resourrces (i.e., stanndards docum
ments, tests, ettc.) were available for
teacher usee. In an initiial review off the literatuure, many so ources of
standards foor EAL teacheer performancce were foundd to exist. Wee decided
to base the ddevelopment of o our standard
ds and indicattors on two off the most
widely usedd in the public domain: the t TESOL P Professional Teaching
184 Chapter Nine
Domain 1: Language
Domain 2: Culture
Domain 3: Planning, Implementing and Managing Instruction
Domain 4: Assessment
Domain 5: Professionalism
Praxis
The Praxis Core Academic Skills for Educators tests and the Praxis
Subject Assessments feature selected-response and essay questions that
measure the content and pedagogical knowledge necessary for a beginning
teacher (ETS, 2015). For more information about the Praxis test see:
https://www.ets.org/praxis/
Upon reviewing the Praxis standards, it became apparent that they too
had a North-American orientation as seen in the following example of an
indicator and associated questions for Cultural Understanding in Module 2
(Cultural and Professional Aspects of the Job) of Praxis:
Question 63. A middle school ESOL student who has been in the United
States for two years is being discussed in a team meeting. It is noted that
the student is still at the beginning ESOL level, has difficulty focusing on
assignments, has poor recall, and displays several inappropriate behaviors.
The teachers have checked the student’s educational history, which
indicates that the same problems were seen the year before. Which of the
following would be an appropriate next step?
(A) Wait at least six more months because the student has not been in the
United States long enough to be evaluated for special education services.
(B) Send a letter home to the student’s parents urging them to help stop the
inappropriate behaviors from occurring.
(C) Develop a pre-referral intervention plan to improve the student’s
classroom and study skills.
(D) Refer the student to the special education team and ask for testing and
a physiological evaluation.
Explanation:
Your awareness of appropriate channels for evaluating the special
education needs of ESOL students is tested here. Before ESOL students
are referred for special education evaluations, pre-referral interventions
should be attempted. Based on the response to the intervention, the student
Developing a Tool for Assessing English Language Teacher Readiness 187
Procedure
The project was initiated at the College of Education in one of the
federal higher education institutions in the UAE. The College of Education
had a special EAL teacher education programme which had received
recognition from the TESOL International Association (http://www.tesol.
org) as part of the National Council for Accreditation of Teacher
Education/Council for the Accreditation of Educator Preparation
(NCATE/CAEP) accreditation process. The lead faculty of the EAL
programme headed the project team. Fifteen academics from notable
higher education institutions in the region were selected to participate in
the project. All team members were working in countries within the Gulf
region, namely, the UAE, Oman, and Qatar.
The 15 team members were selected based on a number of factors,
including academic background and qualifications, research publications
in the field, and work experience in EAL teacher training. The project
coordinators also sought representation from the three local federal
institutions and other regional institutions and entities. As far as academic
credentials are concerned: two team members had master’s degrees and
the rest of the team (13) had terminal degrees in language education or
general education. Seven of the 15 members were Arabic-English
bilingual academics with many years of teaching experience in K-12 and
tertiary institutions. Three members were language assessment supervisors
at their respective institutions. One of the language assessment supervisors
was serving at the time as the President of TESOL International
188 Chapter Nine
Association. Six team members were applied linguists and one of the
applied linguists was serving as the Associate Dean for the largest English
as a foreign language programme in the UAE. One team member was a
psychologist and another one was a special needs expert. The Chair of the
project was the Dean of the College of Education that had undertaken the
project and the principal investigator was a language education academic
with over 25 years of teaching experience, 7 years in K-12 and 18 years in
EAL teacher education.
Once assembled, the team members were divided into five groups and
a leader was appointed for each group. The five groups worked on the
creation of the resource by first reviewing EAL teacher international
standards and associated indicators using a set of shared resources. These
resources comprised core course textbooks and supporting material as well
as teacher eligibility guidelines from different countries, such as the
following:
indicators for each domain were then contextualized and adapted for the
Gulf Region and the greater Middle East and North Africa (MENA)
Region.
Once agreement among the members of each group had been reached
as to the set of indicators to be used for each domain, the groups
proceeded with the production of contextually relevant multiple-choice
items that covered the standards and indicators for each domain
respectively. Prior to item production, training was held for all item
writers on how to generate objective closed-response multiple-choice
items (MCIs) using internationally accepted guidelines (see Coombe,
Folse, & Hubley, 2007; Rodriguez, 1997; Statman, 1988). Moreover, item
writers were to link their questions to Bloom’s Taxonomy (Anderson &
Krathwohl, 2001; Bloom, Englehart, Furst, Hill, & Krathwohl, 1956).
Writing balanced MCIs for higher-order cognitive skills proved to be more
challenging than writing items for basic or factual knowledge.
In addition to each MCI, the members of each team had to also provide
an explanation as to why a particular answer was the right answer for each
MCI and/or why the distractors were the wrong answer options. This
feature of the assessment instrument was there to ensure that the EAL
teachers, who would use it, would be able to learn from it and not simply
find out which of their answers were wrong and which were right by using
a simple answer key. Once the MCIs and their explanations were written,
each group reviewed the items they created for their assigned domain
before sending them for internal review.
For the internal review process, each group was assigned the MCIs
from another group to review, make recommendations for improvements,
and provide feedback. Once the internal review of the items was
completed, each group revised their own MCIs and then the items were
sent to a group of external reviewers. The external reviewers chosen for
the project comprised professionals in the field of language assessment
who were responsible for compiling and administering large-scale
assessment instruments. Their task was to review the indicators and the
MCIs per indicator using a set of specific criteria (see Table 9-2).
190 Chapter Nine
Table 9-2: Criteria for the external review of the indicators and the
MCIs.
The external reviewers reviewed each and every item from each
group. A total of 450 MCIs were reviewed. The review revealed that
writing MCIs with plausible distracters is a challenging task. About 33%
of the total items were discarded after the review process was complete.
The results of the external reviews were then collated and each of the five
groups had to revise their MCIs accordingly.
Item 1
Paul’s family moved to Oman. The family adopted several of the values
and traditions of the Omanis like celebrating Eid, but held on to some
characteristics of their own culture such as traditional American dishes.
This is an example of cultural __________.
(A) segregation
(B) dehumanization
(C) integration
(D) discrimination
192 Chapter Nine
Explanation:
Cultural integration occurs when one cultural group preserves some
distinctive aspects of its own culture, while adopting many of the values,
attitudes, and traditions of the dominant culture.
Therefore, the correct answer is (C).
Item 2
Who are polychronic time-oriented people?
Explanation:
Because Western cultures are mostly monochronic and Middle Eastern
cultures are polychronic, it is important for Arab students to know the
difference. Polychronic time-oriented people are people who adjust their
time to suit their needs and may have to do many things simultaneously.
Therefore, the correct answer is (A).
Item 3
Fatima’s family moved from the United Arab Emirates to the United
Kingdom. While she appreciates many aspects of the new culture, she
prefers to wear Emirati traditional clothing and eat Emirati food, and she
makes a special effort to continue learning about Emirati heritage. Fatima
exhibits a high level of __________.
(A) assimilation
(B) cultural identity
(C) linguistic diversity
(D) bigotry
Explanation:
Cultural identity is part of people’s self-perception, as they prefer to follow
the traditions, nationality, language, ethnicity and social class of their
distinct culture. Assimilation is when people adapt to the prevailing culture
of the majority. Linguistic diversity is having a variety of languages, and
bigotry is an act of prejudice and racism. Therefore, the correct answer is
(B).
Developing a Tool for Assessing English Language Teacher Readiness 193
Item 4
A student ended an academic note to his teacher with this:
‘Wish peace be with you.’ This is an example of:
(A) Code-switching
(B) L1 interference
(C) Avoidance
(D) Displacement
Explanation:
Code-switching is when a student that speaks English and Arabic uses
words from both languages in the same sentence. Avoidance is a
communication strategy used by a learner when he avoids talking about a
topic because he does not have the necessary language resources to talk
about it. Displacement is a linguistic term indicating the capability of
language to communicate about things that are not immediately present.
The student example in this question is a direct translation from his first
language (Arabic) also known as L1 interference. Therefore, the correct
answer is (B).
Item 5
Children in the UAE attend bilingual schools where all subjects are taught
in both English and Arabic. At the end of school these children will be:
Explanation:
According to the definitions of Bilingualism, a natural bilingual is
someone who has not undergone any specific training in a second
language; a coordinate bilingual is someone whose two languages are
learned in distinctly separate contexts; a minimal bilingual is someone with
only a few words and phrases in a second language; and a compound
bilingual is someone whose two languages are learned at the same time,
often in the same context such as the New School Model bilingual
programme in the UAE. Therefore, the correct answer is (D).
Item 6
Children from an Arab background will typically be diglossic as they use a
dialectal variety of some sort at home, but will be taught in Standard
Arabic and English at school. The pattern of errors in English that these
children make will be more influenced by:
194 Chapter Nine
Explanation:
An Arab child becomes a relatively proficient user of Standard Arabic
after 5 to 6 years of formal education, that is, by the age of 12-13. The
dialectal variety, as a real mother tongue, is mastered at a much earlier
age. So the younger the child, the more likely it is that his/her English
errors will be influenced by his/her dialectal Arabic. Conversely, the older
the child the more likely that his/her Standard Arabic is well established
and affects his/her English performance. Therefore, the correct answer is
(B).
Item 7
When a language learner says: “Inshallah, I will see you next week”, it is
an example of:
(A) L1 interference
(B) Fossilization
(C) Code switching
(D) Pidginization
Explanation:
This question tests your understanding of common theoretical terms in the
field of language acquisition. When a student uses a direct translation from
his first language (Arabic) into English, it is called L1 interference.
Fossilization refers to the loss of progress in the acquisition of a second
language despite continuous exposure to the second language.
Pidginization is when a group of people, who do not have a common
language, use a simplified language for communication. When words from
L1 and L2 end up in the same sentence, it is referred to as code-switching.
In this case the student used “Inshallah”, which is an Arabic word, in the
same sentence with English words. Therefore, the correct answer is (C).
Item 8
The magazine TESOL Arabia Perspectives is most likely to contain which
of the following?
(A) Articles about the teaching and learning of English with a focus on the
United States and abroad.
(B) Articles about the teaching and learning of English with a focus on the
Middle East and North Africa.
Developing a Tool for Assessing English Language Teacher Readiness 195
(C) Articles about the teaching and learning of Arabic with a focus on the
Middle East and North Africa.
(D) Helpful tips for teachers of Arabic and English in the Middle East.
Explanation:
TESOL Arabia Perspectives is the quarterly publication of the TESOL
Arabia association (tesolarabia.org) and discusses the teaching and
learning of English with a focus on the Middle East and North Africa.
Therefore, the correct response is (B).
Item 9
In order to study at federal tertiary institutions in the United Arab
Emirates, prospective Emirati students must take the _____.
Explanation:
The Common Educational Proficiency Assessment-English (CEPA-
English), administered through the National Admissions and Placement
Office, is a requirement for admission to federal tertiary institutions in the
UAE. Therefore, the correct response is (A).
Item 10
A teacher is designing a new grammar assessment for her English class.
She wants to design an assessment that reflects a situation the student is
likely to encounter in the “real world”. Therefore, the selected theme and
context is visiting Yas Mall. The teacher is considering the _____ of the
assessment.
(A) transparency
(B) practicality
(C) washback
(D) authenticity
Explanation:
Authenticity in assessment involves designing “real-life” tasks in which
students use and apply their knowledge and skills. Yas Mall is a major
shopping center in Abu Dhabi, and people from all over the region travel
to shop there. Therefore, the correct response is (D).
196 Chapter Nine
Conclusion
The project outlined in this chapter will contribute to the body of
knowledge of EAL teacher education by developing a contextually
relevant independent study and self-assessment resource for EAL pre-
service and in-service teachers in educational institutions in the Gulf
Region. The independent learning and self-assessment resource will
support teacher learning and assessment of critical knowledge and
understanding that go beyond basic factual knowledge. The resource will
provide opportunities for teachers to become not only direct beneficiaries,
but also stakeholders in determining the nature of support required in a
format and time that is most beneficial to them, through self-assessment
and self-regulation of their learning.
Teacher education instructors may use the resource as part of the
formative assessment of the EAL teacher candidates’ learning in the core
curriculum courses. The use of the instrument may generate quantitative
data that teacher education units may use to improve the curricula. The
data may show the strengths and the challenges of teacher education
programmes.
One of the long-term goals for this project is to turn the resource from
a hard-copy format into an APP that teachers can use on mobile devices.
In addition, the resource will be designed to provide differentiated
learning opportunities along with synchronous (real-time) reporting to
test-takers that can help them identify areas of strengths, areas to improve,
and plan strategies to support their learning.
The resource will be designed to reinforce and evaluate teachers’ ability
to solve problems and analyze teaching-learning situations, understand
relationships that contribute to effective teaching and successful learning,
and to predict and interpret test-takers’ progress and achievement.
Therefore, by using the resource, EAL teachers’ professional content
knowledge and pedagogical capabilities may improve, thus, increasing their
effectiveness in the classroom, increasing their employability in schools,
and meeting the competitive labor market requirements as well as the
licensure standards adopted by educational authorities.
198 Chapter Nine
References
Anderson, L. W., & Krathwohl, D. (2001). A taxonomy for learning,
teaching, and assessing: A revision of Bloom's taxonomy of
educational objectives. New York: Longman.
Ashton, P., Webb, R. B., & Doda, N. (1983). A study of teachers’ sense of
self-efficacy (Final Report, National Institute of Education Contract No
400-79-0075). Gainesville, FL.: University of Florida (ERIC document
number ED 231 834).
Bandura, A. (1993). Perceived self-efficacy in cognitive development and
functioning. Educational Psychologist, 28(2), 117-148.
—. (1991). Social cognitive theory of self-regulation. Organizational
Behavior and Human Decision Processes, 50(2), 248-287.
—. (1983). Self-efficacy determinants of anticipated fears and calamities.
Journal of Personality and Social Psychology, 45(2), 464-469.
Bender, W. N., Vail, C. O., & Scott, K. (1995). Teachers’ attitudes toward
increased mainstreaming: Implementing effective instruction for
students with learning disabilities. Journal of Learning Disabilities,
28(2), 87-94.
Bloom, B., Englehart, M. Furst, E., Hill, W., & Krathwohl, D. (1956).
Taxonomy of educational objectives: The classification of educational
goals. Handbook I: Cognitive domain. New York, Toronto: Longmans,
Green.
Blue, G. (1994). Self-assessment of foreign language skills: Does it
work? CLE Working Paper, 3, 18-35.
Cavalluzzo, L., Barrow, L., Henderson, S. et al. (2015). From large urban
to small rural schools: An empirical study of National Board
Certification and teaching effectiveness. CNA Analysis and Solutions.
Retrieved from: http://www.cna.org/sites/default/files/research/IRM-
2015-U-010313.pdf
Central Intelligence Agency. (n.d.). The World Factbook. Middle East:
United Arab Emirates. Retrieved from:
https://www.cia.gov/library/publications/the-world-
factbook/geos/print/country/countrypdf_ae.pdf
Coombe, C., Folse, K., & Hubley, N. (2007). A practical guide to assessing
English language learners. Ann Arbor, MI: University of Michigan
Press.
Cowan, J., & Goldhaber, D. (2015). National Board Certification:
Evidence from Washington State. CEDR Working Paper 2015-3.
Seattle, WA: University of Washington.
Developing a Tool for Assessing English Language Teacher Readiness 199
Dajani, H., & Pennington, R. (2014, June 4). New licensing system for
teachers in the UAE. The National. Retrieved from:
http://www.thenational.ae/uae/education/new-licensing-system-for-
teachers-in-the-uae
Deemer, S., & Minke, K. (1999). An investigation of the factor structure
of the Teacher Efficacy Scale. The Journal of Educational Research,
93(1), 3-10.
Eggen, P., & Kauchak, D. (2007). Educational psychology: Windows on
classrooms, Pearson: Merrill Prentice Hall.
Educational Testing Services. (2015). Praxis. Retrieved from:
www.ets.org/praxis
—. (2011). The Praxis Series: The Official Study Guide: English to
Speakers of Other Languages. ETS.
Gitsaki, C., & Bourini, A. (2012). Innovative approaches to teaching: A
teacher professional development program for grades 6-9. In H. Emery
& F. Gardiner-Hyland (Eds.), Contextualizing EFL for young learners
(pp. 3-24). Dubai, UAE: TESOL Arabia.
Hawk, P., Coble, C. R., & Swanson, M. (1985). Certification: It does
matter. Journal of Teacher Education, 36(3), 13-15.
International Business Publications. (2013). United Arab Emirates
Country Study Guide. Volume 1, Strategic Information and
Developments. Washington D.C.: IBP.
Kane, T., & Staiger, D. (2008). Estimating teacher impacts on student
achievement: An experimental evaluation. NBER Working Paper No.
14607. Retrieved from: http://www.nber.org/papers/w14607
Layman, H. (2011). A contribution to Cummin’s Thresholds Theory: The
Madaras Al Ghad Program. Master’s dissertation, the British
University in Dubai, Dubai, UAE.
Midgley, C., Anderman, E., & Hicks, L. (1995). Differences between
elementary and middle school teachers and students: A goal theory
approach. Journal of Early Adolescence, 15, 90-113.
Midraj, J. (In Press). Self-Assessment. In J. Liontas & M. DelliCarpini
(Eds.), The TESOL Encyclopedia of English Language Teaching. New
York: Wiley.
Ministry of Education. (2014). English as an International Language
(EIL): National Unified K–12 Learning Standards Framework. Dubai:
Ministry of Education.
Pasternak, M., & Bailey, K. M. (2004). Preparing nonnative and native
English-speaking teachers: Issues of professionalism and proficiency.
In L. D. Kamhi-Stein (Ed.), Learning and teaching from experience:
200 Chapter Nine
Abstract
The definition of the linguistic aspects and the domain within which to
operate in order to assess foreign language teachers’ proficiency has been
a challenge in language assessment. Foreign language teachers’
proficiency is understood as how and when linguistic knowledge and the
competence for communication lead to effective language use in teaching
contexts. In Brazil, the discussion about the characteristics and the quality
of teachers’ language is justified by the results of several studies that attest
the low proficiency level, mainly in oral skills, achieved by pre-service
language teacher trainees and in-service teachers in various teaching
contexts. Given the importance of investigating teachers’ language, and
the connections between the domain, the testing instruments and the
criteria on which to base a valid assessment of their language proficiency,
researchers in Brazil have been exploring the implementation of the
EPPLE, a language examination for foreign language teachers. This
chapter reports on some results from these investigations, focusing on
lexical frequency and variety, as well as accuracy and grammatical
complexity. Data were collected by means of instruments designed
especially to assess foreign language teachers’ proficiency–the TECOLI (a
test for listening comprehension in Italian), the TEPOLI (a test for oral
proficiency in English) and the EPPLE examination. The data and
discussion presented in this chapter can support revisions of the construct
202 Chapter Ten
Introduction
Language proficiency is a requirement for foreign language teachers
who are non-native speakers (NNS) of the language they teach and neither
higher nor lower levels of proficiency in teacher talk have been fully
established by means of comprehensive scales so as to assess, for example,
foreign language teachers in a large country such as Brazil and its variety
of schooling contexts. The definition of both the linguistic aspects and the
domain in which to operate to assess this proficiency has been a challenge
in language assessment.
Teachers’ proficiency is understood as teachers’ linguistic performance
on occasions where linguistic knowledge and communicative competence
lead to effective language use in teaching contexts. Our discussion about
the domain, the linguistic aspects and the quality of teachers’ language is
justified by the results from several studies that attest the low proficiency
level, mainly in oral skills, among students in a number of pre-service and
in-service teacher education courses, as well as in teaching contexts,
especially in ELT in regular schools.
In order to investigate the testing instruments and the criteria on which
to base a valid assessment of foreign language teachers’ proficiency,
researchers from four Brazilian public universities–State University of Sao
Paulo (UNESP), State University of Rio de Janeiro (UERJ), State
University of Maringa (UEM) and University of Brasilia (UnB)–and three
tertiary level institutions, Faculty of Technology (FATEC), Paulista
University (UNIP) and UNISEB/Estacio (a private university centre in the
city of Ribeirao Preto, in the state of Sao Paulo), have been investigating
foreign language teachers’ proficiency through the implementation of the
EPPLE (Exame de Proficiência para Professores de Língua Estrangeira)
a language examination for foreign language (FL) teachers (Consolo,
2008; Consolo, Lanzoni, Alvarenga, Concário, Martins, & Teixeira da
Silva, 2010, 2009). The researchers interested in the development of the
EPPLE examination are also members of the ENAPLE-CCC (Ensino e
Aprendizagem de Línguas: Crenças, construtos e competências), a
research group hosted by UNESP and recognized by the CNPq (Conselho
Nacional de Desenvolvimento Científico e Tecnológico), the national
council for research and technology.
This chapter reports on four studies within the EPPLE project, the
present main interest of the ENAPLE-CCC group, focusing on the testing
Foreign Language Teachers’ Proficiency 203
The Study
Our research contexts comprise Letters courses in public universities in
Brazil, as well as in-service teacher courses, in which the data for the
studies we review were collected. The Letters course (Curso de Letras), is
a four-year, sometimes a five-year university course in Brazil to educate
language teachers, in Portuguese and in other languages. Letters students
can usually opt for certification in one language or in two languages,
which they should be able to teach after graduation. All the data and
information discussed in this chapter derive from studies conducted to
investigate the implementation and validation of the EPPLE examination.
This means that, rather than collecting new data for this discussion, we
draw on empirical data generated by our colleagues, available for public
consultation, and on information available for members of our research
group.
The participants, in the studies by Baffi-Bonvino (2010), by Borges-
Almeida (2009a) and by Silva Neto (2014), are graduating students from
Letters courses (English and Portuguese languages), some of which were
already working as teachers of English as a foreign language (EFL) in
private language schools. In Veloso’s (2012) study, participants were
students graduating from Letters courses (Italian and Portuguese), and
certified teachers of Italian as a FL.
The research instruments were two different tests of FL proficiency,
especially designed for teachers and teachers-to-be, and the oral test of the
EPPLE examination. Each of these instruments is described below.
204 Chapter Ten
The TECOLI
The TECOLI, fully described in Veloso (2012), is an abbreviation for
the Teste de Compreensão Oral em Língua Italiana (Listening Comprehension
Test in Italian), a product of Veloso’s doctoral investigation, based on six
versions of a listening comprehension test designed and applied to (future)
teachers of Italian as a FL, in Brazil and in Italy, and the final and revised
version of the TECOLI is based on data from five contexts of pre-service
teacher education courses (Letters courses) with focus on Italian as a FL in
Brazil.
The TECOLI can be a reference for tasks to test listening skills by
means of language samples and test items that were reviewed by teachers
of Italian as a FL and also underwent a detailed statistical analysis. The
bulk of the detailed results from the investigation conducted by Veloso
(2012), thus, provides grounds for the testing of FL teachers’ listening
comprehension skills and can support the assessment of oral skills in the
EPPLE examination.
The TEPOLI
The TEPOLI, short for Teste de Proficiência Oral em Língua Inglesa,
(Consolo, 2004), is a test of oral proficiency in English and consists of an
interview based on a set of pictures, some of which are accompanied by
short texts, taken from EFL course books and magazines. The pictures
work as visual prompts in the first testing task, on which the topics for the
oral interaction between examiner and examinee(s) are based. Examinees
can take the TEPOLI individually or in pairs, and when two examinees are
tested, this task is geared towards encouraging the examinees to interact
not only with the examiner but also with each other.
As of 2003, the TEPOLI includes a second test task that consists of a
role-play that aims at assessing the production of oral language that
encompasses the metalanguage EFL teachers are expected to use in
teaching contexts, for example, when they explain or talk about the
English language with their students. This task has been incorporated in
the EPPLE oral test, as described in the following section.
The data on EFL student-teachers’ proficiency in English discussed by
Baffi-Bonvino (2010) and Borges-Almeida (2009a) are largely based on
results of the TEPOLI and the levels of oral proficiency in English
achieved by students graduating from Letters courses in Brazil. All the
student-teachers who participated in Baffi-Bovino’s and in Borges-
Almeida’s studies, as well as in the study by Silva Neto (2014), reported
Foreign Language Teachers’ Proficiency 205
The EPPLE
The EPPLE examination stands for Exame de Proficiência para
Professores de Línguas Estrangeiras, and it is a proficiency examination
for FL teachers to evaluate and classify their linguistic proficiency-
henceforth LPFLT (language proficiency of foreign language teachers), a
type of language proficiency that is both general and specialized. The
examination aims at teachers-to-be, that is, undergraduates about to
conclude an initial teacher education programme at a tertiary level, usually
in a Letters course (in the case of Brazil), and FL teachers already engaged
in the profession and responsible for foreign language lessons to young
children, in primary and secondary education, at university and in private
language schools. The examination may also be taken by teachers engaged
in further education at a postgraduate level.
LPFLT includes the abilities of comprehension and production of the
foreign language focused on a given version of the EPPLE, in both verbal
and written modes. The EPPLE has so far been designed and piloted only
in English. However, a detailed study of items to test listening
comprehension in Italian has already been carried out and presented by
Veloso (2012), as reported earlier in this chapter, and our proposal
includes plans to produce the EPPLE in other foreign languages taught in
Brazil such as French, German and Spanish.
General language proficiency, as seen as part of the LPFLT, is
characterized by the quality of performance in a given language, as it is
used by the majority of its speakers in a variety of everyday situations,
from informal to formal conversations, when reading informative texts and
usual documentation, to understand oral language in verbal messages and
short videos, and in the production of e-mail messages and written texts
aiming at social networking, for example. With regard to the specialized
proficiency of FL teachers, the main part of LPFLT that mostly determines
language proficiency for professional demands, it encompasses the use of
a given FL for educational purposes, for example, to manage classroom
discourse and communication in language teaching contexts (Consolo,
2007). In this sense, teachers’ language includes providing information
and giving instructions, pedagogical explanations, evaluating students’
performance, reading academic texts and teaching materials, the
understanding of audio and videos for pedagogical purposes, and the
production of materials and instruments to evaluate students. More
206 Chapter Ten
detailed reviews and studies about teachers’ language have been the focus
of studies by members of the EPPLE research team and their supervisees,
such as Andrelino (2014), Colombo (2014), Ducatti (2010) and Fernandes
(2011).
The EPPLE examination comprises two tests: a paper for reading
comprehension and written production, and a test of listening comprehension
and speaking skills. The written test contains comprehension questions
about texts of general interest for language teachers, and items in which
the candidate must deal with writing tasks likely to be carried out by FL
teachers such as writing questions for a reading comprehension exercise
and correcting mistakes in texts produced by language students. Tasks that
require the production of argumentative texts, short messages sent by
electronic mail, or summaries of academic texts, can also be in this paper.
A sample of the written test is available at www.epplebrasil.org.
The speaking test, if given in its face-to-face format, is conducted in
pairs of candidates and in the presence of an examiner who manages the
test and of an examiner who acts more like an observer and a rater.
A fully electronic version of the EPPLE in English was produced and
has been applied to student-teachers since 2011. The ‘electronic EPPLE’ is
a computer-based examination that includes the tasks for the oral test, to
be done in around 25 minutes, and the tasks for the written test, to be done
in the second part. The whole electronic examination lasts around two and
a half hours. The electronic EPPLE includes recorded instructions given to
the candidates at the beginning of the examination, and a task to test the
camera and the microphone connected to the computer before the oral test
starts. Candidates have the possibility to report on faulty equipment, if that
is the case, and the examiner(s) and/or technician(s) in charge of the
computer laboratory can help solve technical problems that may occur.
The answers produced by the candidates are recorded in the computers and
in a data bank to be rated on a later date.
The speaking test, in both the face-to-face and the electronic versions,
has four parts. In the first part the candidates are expected to speak about
themselves: about personal and professional information, previous
experiences as FL students, and professional expectations for the future.
The second part of the oral test is based on a brief video segment, or on
two short video extracts, that firstly must be understood so that a
discussion about the content in the video(s) can be carried out by the
candidates, with each other, and with the examiner conducting the test. In
the computer-based version of the EPPLE, candidates answer questions
about the video, and the screen design for Part 1 of the video tasks is
shown in Figure 10-1.
Foreignn Language Teaachers’ Proficieency 207
Figure 10-1: T
The EPPLE exaamination, Orall Test, Part 1.
Extract 1
01 AL: uh (+) well MC (+) I (+) I was listening to (+) to you and
(+) and your colleague
02 in the class during the (+) the roleplay (+) and I (+) and I
noticed that (+) you know
03 (+) you’ve got some problems during your speaking (+) uh (+)
and (+) and in your
04 grammar too (+) but (+) I (+) you know (+) one thing that is
really (+) worrying (+) is
05 (+) uh (+) the USE of (+) auxiliary words (+) like when you
said (+) uh (+) I not worry
06 about the environment (+) what’s missing here? There’s
something missing right (+)
07 because as you know in Portuguese uh (+) you know (+) in
Portuguese structure (+) in
08 English structure (INCOMP) so (+) you were speaking at the
present in the moment (+)
09 you know (+) you want to tell your colleague (+) that you are
not WORRIED (+) if you
210 Chapter Ten
10 use the verb to be (+) you are not worried and you need (+)
you can use an adjective
11 (+) but here {ASC} you use WORRY (+) the verb (+) so you
can not just put NOT there
12 and that’s it (+) it’s missing {ASC} something and (+) this
something is (+) the
13 auxiliary (+) that you know is (+) the auxiliary (+) do (+) so
you can tell I do not worry
14 (+) about environment (+)maybe you wanna emphasize (+) or
you can just (+) contract
15 that (+) you know (+) like I don’t worry about (+) environment
(+) you know (+)
16 because (+) so try to pay attention to (+) to (+) you know (+)
when we make negative
17 using the simple present (+) right (+) you (+) you need to use
(+) uh you know (+)
18 either don’t or (+) doesn’t for third person (+) right (+) so (+)
that’s it
(TEPOLI, 28 Nov 2005. Source: Baffi-Bonvino, 2007, pp. 269-270)
AL’s performances in the two FCE mock tests and in the TEPOLI
were equivalent, if we compare the marks given to this candidate.
Similarly, equivalence was found for three other participants in the study,
LR, MR and MC (see Table 10-1).
Table 10-1: Marks in the FCE and in the TEPOLI (Source: Baffi-
Bonvino, 2007, pp. 276-277).
the criteria used for each of the three tests were based on qualitative
descriptors. Both tests, FCE and TEPOLI, are based on holistic scales that
consider the final product, and the test-takers’ performance is described
through many specific linguistic and communicative criteria.
Silva Neto (2014) analysed the lexical competence of pre-service
teachers who were graduating from a Letters course in a public university
in the state of Sao Paulo. The data were obtained by means of a trial
version of the EPPLE oral test, in the two formats aforementioned: a face-
to-face test conducted with pairs of students, and a preliminary semi-
electronic version of the oral test administered to the same students in a
computer language laboratory. The lexical characteristics and quality of
English in the speech of test-takers were analysed, such as the relevance
and type of vocabulary used in the target language when interacting, the
suitability of the lexical items to the test tasks, negotiations of meaning
that might have arisen from the difference between the lexical competence
of the candidates, the appropriateness of lexical items to the content of the
speech and the coefficient of frequency of the item according to the
subject matter. The results obtained by comparing the data from the face-
to-face test and its ‘semi-electronic’ version show that the students’
performances do not vary significantly in the two versions of the test.
Silva Neto (2014) claims that his results point to the need for revision of
the descriptors for the vocabulary produced in the EPPLE oral test and the
introduction of an analytical scale to rate speech that considers the
differences between proficiency bands based not only on the frequency
factor of lexical items but also on their appropriateness to the speaking
context.
Based on the results presented by Baffi-Bonvino (2010) and by Silva
Neto (2014), we recommend a combination of holistic scales with
analytical ones for the EPPLE oral test, which would be more adequate to
assess candidates’ oral proficiency. The existing scales assess the oral
production as a combination of the oral skills involved in the tasks and,
although the descriptors within each band focus on separate linguistic
aspects, linguistic features are seen to operate interdependently so as to
contribute to or impede communication. Analytical scales, on the other
hand, would make it possible to assess each of the criteria involved in oral
production in a separate way.
212 Chapter Ten
unit (analysis of speech unit), which more clearly presents how to deal
with the disfluent mechanisms of speech. An AS-unit is defined as:
Index Index
Band
Clauses per Unit Words per Unit
Oral Test Seminar Oral Test Seminar
B 1.45 1.46 6.62 9.07
C 1.36 1.35 6.79 8.69
D 1.35 0.93 7.32 6.90
E 1.26 1.30 6.32 7.98
Foreign Language Teachers’ Proficiency 215
Conclusion
The process of designing the EPPLE and its consolidation as a
language examination in the context of educating FL teachers in Brazil is
on the way towards a revised construct for the examination and the
definition of assessment criteria informed and supported by past and
present research studies.
Even though the results so far achieved by our research team are
mainly about English and Italian, we encourage the inclusion of other
languages in future projects and studies that can support further revision of
the constructs of the EPPLE examination and, as a consequence,
contribute in the areas of foreign language teaching and language testing,
as well as foreign language teacher education.
Once the EPPLE is widely used, as pointed out by Consolo and
Teixeira da Silva (2013), it is expected to motivate a revision of the course
contents and aims in pre-service and in-service teacher education,
especially in the Letters courses in Brazil. The standards established by
such an examination may be considered a reference for LPFLT, and for
the quality of language teaching and learning in the Brazilian educational
contexts.
216 Chapter Ten
Acknowledgement
This project was supported by FAPESP (Fundação de Amparo à
Pesquisa do Estado de São Paulo, process 2014/10544-0).
References
Andrelino, P. J. (2014). Análise da estrutura genérica das instruções na
fala do professor de Inglês: Contribuições para o teste oral do EPPLE
(An analysis of the generic structure in English teachers’ talk:
Contributions to the EPPLE oral test). Doctoral thesis, UNESP, Sao
Jose do Rio Preto, Brazil.
Bachman, L. F. (1990). Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice.
Oxford: Oxford University Press.
Baffi-Bonvino, M. A. (2010). Avaliação da proficiência oral em Inglês
como língua estrangeira de formandos em Letras: Uma proposta para
validar o descritor ‘vocabulário’ de um teste de professores de língua
Inglesa (The assessment of oral proficiency in English as a foreign
language of graduating students in a Letters course: A proposal to
validate the descriptor for vocabulary in a test for English language
teachers). Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil.
—. (2007). Avaliação do componente lexical em inglês como língua
estrangeira: Foco na produção oral (Assessment of the lexical
component in English as a foreign language: Focus on oral
production). Master’s dissertation, UNESP, Sao Jose do Rio Preto,
Brazil.
Borges-Almeida, V. (2009a). Precisão e complexidade gramatical na
avaliação de proficiência oral em Inglês do formando em Letras:
Implicações para a validação de um teste (Grammatical precision and
complexity in the assessment of oral proficiency of Letters graduating
students: Implications for a test validity). Doctoral thesis, UNESP, Sao
Jose do Rio Preto, Brazil.
—. (2009b). Pausas preenchidas e domínios prosódicos: Evidências para a
validação do descritor fluência em um teste de proficiência oral em
língua estrangeira (Filled pauses and prosodic domains: Evidence for
the validation of the descriptor for fluency in a foreign language oral
proficiency test). ALFA: Revista de Linguística, 53(1), 167-193.
Borges-Almeida, V., & Consolo, D. A. (2010). Investigating accuracy and
complexity across levels: The search for a valid scale for the Language
Foreign Language Teachers’ Proficiency 217
Abstract
Culture, education, attitude and language proficiency have been
viewed as the major causes of second language writer plagiarism
(Amsberry, 2009; Erkaya, 2009). However, research data that would
sufficiently substantiate these claims are scarce. The study described in
this chapter investigates the relationship between plagiarism and
vocabulary knowledge in the writing of over 200 students of English as a
second language. It uses both lexical error and vocabulary size assessment
as measures of vocabulary command. The study relies on an instructional
software tool called Grammarly, which identifies both textual borrowing
and language errors, as well as on the Vocabulary Size Test (VST) to
measure students’ vocabulary knowledge. The results indicate that there is
some correlation between the error count and plagiarism, and a strong
negative correlation between vocabulary size and plagiarism rate.
Therefore, the findings seem to suggest that poor vocabulary command
could be a major cause of plagiarism in second language writers. Based on
these findings, the importance of systematic vocabulary teaching and
learning as a strategy to avoid plagiarism emerges.
Introduction
Plagiarism, or using someone else’s words or ideas without
acknowledgment, appears in both native speaker (NS) and non-native
speaker (NNS) university student writing (Jaeger & Brown, 2010). Text
borrowing or appropriation, as plagiarism is sometimes called, has caused
Vocabulary Size Assessment as a Predictor of Plagiarism 223
Background
There are two types of L2 vocabulary knowledge: receptive and
productive (Nation, 2006). Receptive vocabulary is usually larger than the
productive and enables the learner to comprehend things they read and
listen to. Productive vocabulary, on the other hand, facilitates the
productive skills of speaking and writing. In addition to vocabulary size,
which is expressed in the number of words a learner knows, vocabulary is
also measured in terms of depth (Beglar & Nation, 2007). Depth concerns
everything a learner knows about a word, including ways of spelling and
pronouncing it, the sentence structure it requires, its part of speech, the
functions it can have in connected discourse, the contexts in which it can
possibly occur, other words that may accompany it, the idiomatic
expressions it is known to build and the connotations it can have (Folse,
2004). It is expected that in productive skills, such as speaking and
writing, a larger vocabulary size would have the effect of a greater lexical
range used, while a greater depth of vocabulary knowledge would result in
a more accurate use of vocabulary.
Lexical range is one of the measures of language proficiency. The
underlying vocabulary size has been found to greatly affect reading
Vocabulary Size Assessment as a Predictor of Plagiarism 225
Dodigovic (2013) also found that poor paraphrasing skills were closely
associated with plagiarism.
Other aspects of lexis have not been commonly associated with
plagiarism. In particular lexical error, which should be an indicator of
lexical accuracy or depth of lexical knowledge (Folse, 2004; Nation,
2006), has barely been examined in the context of L2 writing. According
to Augustin Llach (2011), despite the fact that lexical errors emerge as the
most numerous in the available studies, research in this area is still scarce.
The lack of accuracy, otherwise known as language error, is significant in
three respects: it informs the teacher about what should be taught; it
informs the researcher about the course of learning; and it is an outcome of
the learner’s target language hypothesis testing (James, 1998).
Vocabulary size is another aspect of vocabulary knowledge that might
be associated with plagiarism. While research focus in this area has
predominantly been on the receptive command, which enables learners to
read and listen with comprehension, not much is known about the
productive vocabulary command, which enables them to speak and write
proficiently, and its relationship with plagiarism. According to Nation
(2006), the size of productive vocabulary required for successful speaking
or writing is much smaller that the receptive vocabulary size required for
successful reading or listening. However, there might be some indications
that in L2 contexts there is little difference between the productive and
receptive vocabulary knowledge (Schmitt, 2001), which suggests that any
measure of receptive vocabulary knowledge could be helpful as a
productive vocabulary knowledge indicator.
Another important parameter in the context of ESL plagiarism might
be the academic vocabulary (Coxhead, 2000) or Academic Word List
(AWL) and the ESL student writer’s ability to use this vocabulary in
writing. Studies by Augustin Llach (2011), Storch and Tapper (2009), and
Deng, Lee, Varaprasad, and Leng (2010) tracked the development of
academic vocabulary in the writing of ESL students over the duration of
an academic English course and found evidence of significant
improvement. However, the impact of this improvement on the amount of
plagiarism has largely remained unexplored. Similarly, Dodigovic, Li,
Chen and Guo (2014) examined a range of academic vocabulary errors
committed by Chinese learners of English. However, they did not conduct
their investigation in the context of textual borrowing or plagiarism.
Therefore, examining lexical insufficiency as a possible cause of
plagiarism emerges as a worthwhile research goal. To this end, the study
reported here focused on Chinese learners of English at an English-
medium University in China and investigated the relationship between the
Vocabulary Size Assessment as a Predictor of Plagiarism 227
rate of lexical error and vocabulary command on the one hand and the
amount of plagiarism in the students’ writing on the other.
The Study
The research question that guided this investigation was: To what
extent is plagiarism related to Chinese students’ English vocabulary
command? The participants in this study were 221 Chinese students at an
English Medium Instruction (EMI) University. All of the students were in
their first year, aged between 18 and 20, speakers of Chinese as a first
language and majoring in English. All of the participants had completed
their secondary education in China.
The writing task used for the purpose of this study required expressing
opinions and a critical review of literary sources. Taking into account the
extensive need for quoting and referencing in that particular genre, this
task required the student writers to apply advanced paraphrasing
techniques in order for the writing to maintain its originality. The length of
the writing samples ranged from 800-1,200 words and represented a
typical Anglo-American academic genre commonly found at tertiary
educational level in English-speaking countries (Dodigovic, 2005).
Instruments
Grammarly
The Vocabulary Size Test (VST) was used to measure the size of the
participants’ vocabulary (Beglar & Nation, 2007). This test has been
specifically developed to “provide a reliable, accurate, and comprehensive
measure” (Beglar, 2010, p. 103) of NNS English learners’ receptive
vocabulary in its written form, including the 14,000 most frequent word
families in English. This test, available in both electronic and hard-copy
format, was used in its paper-based format and it was corrected manually.
Procedure
Student writing was analysed using the Grammarly web-based engine
from the perspective of plagiarised content and lexical error. The
identified errors were entered into a database following a manual
identification accuracy check. VST was administered in class, two weeks
after the writing samples were collected, in hard-copy format, which was
filled out manually by the participants. It was also manually marked and
moderated by two independent markers using the answer key. All of the
activities were carried out with adherence to the ethical standards called
for in the Belmont Report, Declaration of Helsinki and Nuremberg Code.
Data Analysis
The Pearson product-moment correlation coefficient (r), commonly
used to reveal a possible linear association between two variables and for
calculations on larger samples, in which normal distribution can be
expected (Stoynoff & Chapelle, 2005), was used to calculate the
correlation between the lexis related variables and plagiarism rate. This
correlation coefficient is one of the commonly used measures of effect
size, although many who use it may not be aware that it is an effect size
index (Ellis, 2010). In the discussion of r values below, reference to ‘effect
size’ is made; this serves as a response to the recent calls, from researchers
in China (e.g., Wei, 2012) and abroad (e.g., Ellis, 2010; Larson-Hall,
2012), for paying more attention to effect size, which according to the
Publication Manual of the American Psychological Association (APA) is
Vocabulary Size Assessment as a Predictor of Plagiarism 229
Results
Lexical Errors
Based on the Grammarly output, all errors were divided into three
categories: lexical, grammar and punctuation. For the purpose of this
paper, only lexical errors are of interest. Using Grammarly’s
categorisation, lexical errors were divided into four categories: confused
words, spelling mistakes, wordiness and colloquial speech. To arrive at a
more comprehensive picture, the correlation results for two types of
scenarios are presented here: results with the outliers and without them
(see Table 11-1).
Table 11-1: Correlation coefficient (effect size) values for lexical errors.
Wordiness
Colloquial
Combined
Confused
Mistakes
Spelling
Speech
Words
Cut-off
Point N
According to Table 11-1, the results for the lexical errors with the cut-
off point for plagiarism rate set at 5% revealed that Pearson’s correlation
coefficient values were the lowest for the confused words, spelling
mistakes and wordiness: r = -0.0018 (p < .05), r = -0.0085 (p < .05), r = -
0.0382 (p < .05) respectively. However, the statistical analysis of the use
230 Chapter Eleven
Vocabulary Size
Vocabulary Size Test (VST) results were obtained for 107 out the 221
research participants. Hence, the 107 pairs of VST and plagiarism rate
results were correlated, without excluding any data sets. The value of the
correlation between the plagiarism rate and vocabulary size was high (r = -
0.7791). This is a strong negative correlation, representing a ‘large’ effect
and meaning that high vocabulary scores indicate low plagiarism levels
(and vice versa). Based on this outcome, it is safe to assume that the larger
the vocabulary size of second language writers, the less the chance they
will resort to plagiarism when engaging in academic writing.
Vocabulary Size Assessment as a Predictor of Plagiarism 231
Discussion
Unlike the reviewed literature (Shi, 2006; Yu, 2013) which purports
that lexical errors might be a cause of plagiarism in in higher education,
the results of the current study revealed a weak and not significant
correlation between plagiarism and lexical error rate, suggesting that this
might not be the case. While one might argue that the result might be
statistically significant once the sample size is increased, the afore-
reported findings concerning the strength of correlation (effect size)
remain stable across studies (including those with much larger samples)
because effect size measures, unlike the p values, are unaffected by sample
size (Meline & Wang, 2004).
Since the writing that was analysed in this study was academic essays
in which the students were using the lexically correct verbatim text from a
variety of academic sources which was lexically correct, it is also possible
that the results of the study were distorted by the presence of verbatim
borrowings which could have significantly reduced the proportion of the
students’ original writing and in turn might have masked the real error
rate.
Nevertheless, based on the results of this study, it might be safe to
assume that lexical error or the absence of lexical accuracy is not a major
cause of plagiarism among Chinese students at EMI universities in China.
Furthermore, it seems that vocabulary depth, as the construct underlying
lexical accuracy, might not be directly related to plagiarism. On the other
hand, vocabulary size due to its strong negative correlation with the
textual borrowing rate, suggests that a large vocabulary might be
negatively related to the level of plagiarism. In other words, NNS writers
with a large vocabulary might be less likely to plagiarise, regardless of
how well they know the words they are familiar with. This is consistent
with Dodigovic (2013), in which the plagiarism rate was reduced by
focusing on paraphrasing skills. This skill requires both receptive
vocabulary knowledge and a large vocabulary size, both tested using VST.
Similarly, the present study indicates that in the case of a limited
vocabulary, having a good command of the entire depth of the vocabulary
known is less likely to result in plagiarism in free writing.
Conclusion
The objective of this study was to explore the relationship between
Chinese English learners’ lexical errors and vocabulary size on the one
hand and the amount of borrowed content in their written prose on the
232 Chapter Eleven
other. The study was carried out with a group of 221 Chinese students
majoring in English at an EMI university located in China. Learners’
lexical errors and unoriginal content were identified using Grammarly’s
enhancement and plagiarism detection engine, while the vocabulary size
was determined using the VST. The data was then statistically analysed
using Pearson’s correlation coefficient.
The results of the study suggest that vocabulary size might be a factor
requiring more attention in the context of fighting plagiarism in the higher
education sector, but the depth of vocabulary, which has a bearing on the
lexical accuracy, might not. The outcome indicates that pedagogical effort
should be invested in the vocabulary growth of the target learner
population. This can be done either by stimulating deliberate vocabulary
learning through the use of vocabulary cards, games and fun activities or
through extensive reading programs which rely on a combination of
graded readers and authentic texts. Vocabulary size testing as well as other
methods of vocabulary assessment should become a more common
practice (Schmitt, 2001), so that through washback they might positively
impact educational practice.
Acknowledgement
This paper is a part of the output from the Jiangsu Higher Education
Learning and Teaching Reform project #2015JSJG253 entitled
Computational methods of lexical transfer detection in the English writing
of Chinese-English bilinguals, funded by the Jiangsu Department of
Education, China.
References
Amsberry, D. (2009). Deconstructing plagiarism: International students
and textual borrowing practices. The Reference Librarian, 51(1), 31-44.
Augustin Llach, M. P. (2011). Lexical errors and accuracy in foreign
language writing. Bristol: Multilingual Matters.
Bacha, N. N., & Bahous, R. (2010). Student and teacher perceptions of
plagiarism in academic writing. Writing and Pedagogy, 2(2), 251-280.
Ballard, B., & Clanchy, J. (1984). Study abroad: A manual for Asian
students. Kuala Lumpur: Longman
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test.
Language Testing, 27(1), 101-118.
Vocabulary Size Assessment as a Predictor of Plagiarism 233
Beglar, D., & Nation, P. (2007). A vocabulary size test. The Language
Teacher, 31(7), 9-13.
Biber, D. (2012). Register as a predictor of linguistic variation. Corpus
Linguistics and Linguistic Theory, 8(1), 9-37.
Chuo, T.-W.I., & Wenzao, U. (2007). The effects of WebQuest writing
instruction program on EFL learners’ writing performance, writing
apprehension and perception. TESL-EJ, 11(3), 1-27.
Coxhead, A. (2000). The academic word list. TESOL Quarterly, 34(2),
213-238.
de Jaeger, K., & Brown, C. (2010). The tangled web: Investigating
academics’ views of plagiarism at the University of Cape Town.
Studies in Higher Education, 35(5), 513-528.
Deng, X., Lee, K. C., Varaprasad, C., & Leng, M. L. (2010). Academic
writing development of ESL/EFL graduate students in NUS.
Reflections on English Language Teaching, 9(2), 119-138.
Dodigovic, M. (2005). Artificial intelligence in second language learning:
Raising error awareness. Clevedon: Multilingual Matters.
—. (2013). The role of anti-plagiarism software in learning to paraphrase
effectively. CALL-EJ, 14(2), 23-37.
Dodigovic, M., Li, H., Chen, Y., & Guo, D. (2014). The use of academic
English vocabulary in the writing of Chinese students. English
Teaching in China, 5, 13-20.
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power,
meta-analysis, and the interpretation of research results. Cambridge:
Cambridge University Press.
Erkaya, O. R. (2009). Plagiarism by Turkish students: Causes and
solutions. Asian EFL Journal, 11(2), 86-103.
Evans, F. B., & Youmans, M. (2000). ESL writers discuss plagiarism: The
social construction of ideologies. Boston University Journal of
Education, 182(3), 49-65.
Folse, K. (2004). Vocabulary myths. Ann Arbor: University of Michigan
Press.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London:
Longman.
Hyland, K. (2001). Bringing in the reader: Addressee features in academic
articles. Written Communication, 18(4), 549-574.
James, C. (1998). Errors in language learning and use: Exploring error
analysis. London, England: Longman.
Lankamp, R. (2009). ESL student plagiarism: Ignorance of the rules or
authorial identity problem? Journal of Education and Human
Development, 3(1), 1-8.
234 Chapter Eleven
Stoynoff, S., & Chapelle, C. (2005). ESOL tests and testing. Alexandria:
Teachers of English to Speakers of Other Languages.
Ward, J. (2009). EAP reading and lexis for Thai engineering undergraduates.
Journal of English for Academic Purposes, 8(4), 294-301.
Wei, R. (2012). Zaitan waiyu dingliang yanjiu zhong de xiaoying fudu
[Effect size in L2 quantitative research revisited]. Xiandai Waiyu
[Modern Foreign Languages], 35(4), 416-422.
Wessa, P. (2014). Free Statistics Software. Office for Research Development
and Education, version 1.1.23-r7. Retrieved from:
http://www.wessa.net/
Wilkins, D. A. (1972). Linguistics in language teaching. London: Edward
Arnold.
Witte, S. P., & Faigley, L. (1981). Coherence, cohesion, and writing
quality. College Composition and Communication, 32(2), 189-204.
Yang, W. (1989). Cohesive chains and writing quality. Word, 40(1-2),
235-254.
Yu, T. (2013). Relationship between the EAP classroom approach and
plagiarism. Unpublished manuscript, Final Year Project, Xi’an
Jiaotong-Liverpool University, Jiangsu, China.
CHAPTER TWELVE
Abstract
This study has assessed the language learning strategies used by a
group of undergraduate students at a tertiary institute in Fiji to find out if
there are any correlations with their academic writing proficiency. Data for
language learning strategy use were collected through a standard
questionnaire, using Oxford’s (1990) Strategy Inventory for Language
Learning (SILL). In-depth interviews were also conducted to further
explore the students’ language learning strategies (LLS) in early
childhood. An error analysis of students’ written texts was undertaken to
determine proficiency in academic language. The Statistical Package for
the Social Sciences (SPSS) was used for quantitative data analysis. The
results of this study showed that students used language learning strategies
with a medium frequency. Metacognitive strategies were used most
frequently followed by cognitive ones while affective and memory
strategies were used the least frequently. There was no significant
difference in the number and type of errors made in students’ written texts
both before and after writing strategies were taught. In the final analysis,
using Pearson’s correlation coefficient, there was no significant correlation
found between strategy use and the academic language proficiency of the
participants. Both successful and unsuccessful English language learners
used the same strategies with almost the same frequency. This study
concludes that proficiency in the academic writing of Fiji students is not
influenced by their use of language learning strategies.
What is the Impact of Language Learning Strategies 237
Introduction
In the 21st century, English has become the dominant global language
and it can be established that today English is used as a medium of
communication by more non-native than native speakers (Crystal, 1997;
Graddol, 1997). Globalization, the current advances in technology and
social media all have fueled a demand for English. In Fiji and the Pacific,
as elsewhere in the world, English has taken on an increasingly important
role and the individual reasons for this vary widely: from personal growth
and enhancement to higher education and better employment opportunities.
This is evident by the increasing number of enrolments in primary,
secondary and tertiary institutes where English has become a mandatory
subject. People with good communication skills and qualifications in
English are sought after in most work places and educational institutes. In
Fiji, as well as in most Pacific Islands, people use English as a lingua
franca (ELF) to communicate amongst themselves. For the majority of the
urban dwellers, English is the language of business, education,
entertainment, politics, and everyday living. However, in the rural areas,
the use of English as the language of daily living is not as high.
In Fiji, students go through learning English as a compulsory subject
for thirteen years of their primary and secondary school life. However,
when they enroll in tertiary institutions, it becomes apparent that
proficiency in their academic writing skills has not developed much over
the years. Educators in Fiji tertiary institutes find that in spite of eight
years of primary and four to five years of secondary school education with
English as the medium of instruction (EMI), students who enroll in the
local universities have weak academic writing skills. Though no
comprehensive research data is currently available on the exact areas of
weaknesses in academic writing of Fiji students, the following errors are
most commonly found in their written texts: tense, subject-verb
agreement, weak sentence structures, mechanics (in particular punctuation
and spelling), usage of articles, vocabulary, connectives, participles, word
forms, word choice, and direct and reported speech. Apart from
weaknesses in grammar and punctuation, students lack appropriate skills
and knowledge of the structure of formal letters, essays and reports. This
study investigates to what extent the use of language learning strategies
can enhance the academic writing proficiency of Fijian undergraduate
students who may not be aware of language learning strategies and
therefore do not use appropriate strategies to enhance their language
learning.
238 Chapter Twelve
Background
Language Learning Strategies
Researchers in second language learning and acquisition have long
recognized the role of the learner in the learning process and this subject is
the object of enquiry in much research. According to Ehrman and Oxford
(1995), the role of the learner is complex and determined by certain
variables which might correlate with successful language learning. Since
Rubin’s (1975) article on the good language learner, there has been much
interest and discussion on what makes some people successful at learning
a second or foreign language (Ellis, 1994; Grenfell & Harris, 1999;
Naimen, Frolich, Stern, & Todesco, 1978; Nakatani, 2006; O’Malley &
Chamot, 1990).
Much research has been done over the last three decades on the
characteristics and traits of successful language learners which can be
taught to less successful learners in ways that would benefit them. It now
appears that there is a multitude of factors that can affect language
learning and these include: personality type, learning style, aptitude,
motivation, and, the focus of this research, language learning strategies
(Ehrman & Oxford, 1995; Rubin, 1975).
The field of language learning strategies, which can be defined as the
methods learners use to aid their learning of a second or foreign language,
is complex. Focused research on this subject began in the 1970s (Naimen
et al., 1978; Rubin, 1975) with identifying and classifying good language
learning strategies. However, there is still much discussion going on about
the classification of these strategies and their relevance to language
learning and acquisition (Hsiao & Oxford, 2002). What has become clear
is that there can be effective methods or techniques that learners can use to
learn a second or foreign language successfully (Lessard-Clouston, 1997;
Oxford, 1990). But the challenge still remains for many teachers and
researchers on how to isolate the language strategies and teach them to
learners in a way that can improve their ability to use the second or foreign
language and put them on a path to a more self-directed and independent
learning (Chamot, 2005).
According to Scollon and Scollon (2004), language is “a multiple,
complex and kaleidoscopic phenomenon” (p. 272). When one thinks about
the intricacies of a language, its design, structure, grammatical systems,
phonology, and how it is used according to audience, purpose and context,
the challenges of learning a second language become overwhelming. At
this point, it is important to distinguish between English as a second
language (ESL) and English as a foreign language (EFL) as there are
What is the Impact of Language Learning Strategies 239
those with lower proficiency. Peacock and Ho (2003) also found a positive
correlation between twenty-seven strategies and learner proficiency. The
most frequent strategies used were compensation followed by cognitive,
metacognitive, social, memory and affective strategies. Higher proficiency
learners used cognitive and metacognitive strategies more frequently than
those with lower proficiency. Similar results on the correlation between
high levels of proficiency and an increased use of both direct and indirect
strategies were found in earlier research by Green and Oxford (1995).
Early research, from the 1970’s and 1980’s, found that successful
language learners had “a strong desire to communicate, were willing to
guess when unsure, and were not afraid of being wrong or appearing
foolish” (Rubin, 1975, p. 43). However, these learners were mindful of
correctness, form and meaning and monitored their own language as well
as that of those surrounding them. These strategies were not employed
universally by all successful language learners. It depended on the
learners’ target language proficiency, age, situation and cultural differences.
Fillmore (1982) reported similar findings in her research on individual
differences. She found that successful learners also used social strategies
as they “spent more time...socializing” (p. 285). By and large, research has
shown that a number of variables, such as gender, ethnicity, proficiency
level, socio-economic background, and level of motivation affect the type
and frequency of strategy use by second/foreign language learners
(Ehrman & Oxford, 1990; O’Malley, Chamot, Stewner-Manzanares,
Russo, & Kupper, 1985; Oxford & Nyikos, 1989).
test results, and written and spoken tasks done in the classroom (Bremner,
1999; Ketabi & Mohammadi, 2012; Tam, 2013). There is little literature
on error analysis and its correlation with academic language proficiency.
However, researchers have mentioned the importance of error analysis and
its links to academic English proficiency (Michaelides, 1990; Richards,
Plott, & Platt, 1996).
Research done by Cohen (1998), Ehrman and Oxford (1989) and
Oxford (1990, 1993) showed that more frequent use of language learning
strategies is often related to higher levels of academic language
proficiency. However, according to Green and Oxford (1995), the picture
is not crystal clear as “it shows prominent features of the landscape but
only gives hints as to what the trees and buildings in the picture would
look like up close” (p. 261). In their study of university students studying
at different course levels in Puerto Rico, Green and Oxford (1995) found
that there was a positive correlation between strategy use and academic
proficiency. Bremner (1999), working with students from the City
University of Hong Kong, investigated strategy use and its correlation
with language proficiency. The results showed that the participants were
medium users of the learning strategies. The most frequently used strategy
was compensation, followed by metacognitive, cognitive, social, memory,
and affective strategies. The correlations between proficiency and strategy
use showed positive relations with cognitive and compensation strategies,
while there was a negative correlation with affective strategies. Goh and
Kwah (1997) had similar results in their study of Singaporean learners,
while Green and Oxford (1995) found that, in addition to these two
strategies, metacognitive and social strategies also showed positive
variation. As for the negative correlation between proficiency and
affective strategies, it could be that as learners become more proficient in
their language skills, they have less use of such strategies because their
confidence, knowledge and motivation have all increased.
The Study
The aim of this research study was to identify the second language
learning strategies used by tertiary students from the Republic of Fiji, and
investigate the impact these strategies have on their academic writing
skills. The research subjects were 95 first year undergraduate students and
10 final year students in a Bachelor of Arts in Literature/Language
program. Even though it was planned to have a balance of gender and
ethnicity in the sample, on the day of the data collection the females (67%)
outnumbered the males (33%), and the Fijian Indian students (64%)
242 Chapter Twelve
Cognitive Strategies:
• Item 6: I watch English language TV shows or go to English
movies (Mean = 3.6).
• Item 8: I write notes, messages, letters, or reports in English (Mean
= 3.6).
• Item 9: I first skim an English passage (read over the passage
quickly) then go back and read carefully (Mean = 3.6).
What is the Impact of Language Learning Strategies 243
Metacognitive Strategy:
• Item 2: I notice my English mistakes and use that information to
help me do better (Mean = 3.5).
Strategies
Communication
Metacognitive
Cognitive
Ethnicity
Affective
Memory
Gender
Social
r 1 .075 .169 .187 .050 .137 -.020 -.085
Gender
Interview Analysis
The interview data confirmed the results from the analysis of the SILL
questionnaire. Most of the participants were not high users of language
learning strategies. Social strategies were seldom used by the participants
to learn English, both within the family and with relatives and friends.
Often social activities were conducted using the participants’ first
What is the Impact of Language Learning Strategies 245
language (L1). Hence, social, memory and affective strategies once again
are at the bottom of strategy use.
Error % Error %
Punctuation 14.3 Article 2.7
Word Choice 10.9 Verb Form 2.6
Cut (Unnecessary text) 7.9 Conjunction 2.3
Repetition 7.7 Vague Reference 1.9
Agreement (subject/verb) 7.1 Sentence Fragment 1.9
Plural (singular/plural) 6.3 Capitalization 1.3
Preposition 6.1 Word Order 0.7
Incomprehensible Text 5.3 Inaccurate Quotation 0.7
Missing Word/s 4.8 Parallel Construction 0.3
Word Form 4.1 Missing Space 0.3
Modifier (misplaced) 3.9 Count/Non-Count 0.2
Spelling 3.3 Paragraphing 0.2
Verb Tense 3.1 Formatting 0.01
It was hypothesized in this study that the higher the use of language
strategies, the fewer the errors in students’ academic writing. The Pearson
correlation coefficient was used for this analysis because the data were
parametric. In parametric correlations, the correlation coefficient (r) shows
the strength of the relationship between two variables. Table 12-3 below
shows the results of the Pearson correlation between language learning
strategies and overall errors from all the written texts analyzed. According
to Cohen and Cohen (1983), a correlation coefficient of 0.22 indicates a
small positive linear correlation. The significance was 0.04, which is p <
.05. Therefore, the results are statistically significant. There was a small
positive linear correlation between errors and learning strategies. As more
writing strategies were used by students over the semester, the number of
errors in their written work also increased. Therefore, the results show that
the language learning strategies used by the students did not have any
significant impact on their academic language proficiency.
Conclusion
The research study reported in this chapter found that university
students in Fiji used language learning strategies at a medium level. The
most frequent strategies used were metacognitive followed by cognitive
and social strategies. Affective strategies were used the least.
Undergraduate students in Fiji are aware of the strategies they use to learn
English and are taking control of their learning, albeit at a medium, level.
The study also found that ethnicity did not have a significant influence on
strategy use. Students’ ethnic background, i.e., whether they were
indigenous Fijian or Fijians of Indian origin, was not significantly
correlated with strategy use. Both major ethnic groups displayed
What is the Impact of Language Learning Strategies 247
This study has shown that second language learners in Fiji are not quite
aware of their learning strategies. There is a need for further research into
the language learning strategies of Fijian students with a larger sample size
and from institutions at all levels: primary, secondary and tertiary. Other
factors must be explored to determine what is impacting academic
language proficiency (or lack of it) among undergraduate students of Fiji.
There is a need to consciously teach language learning strategies in the
teacher training courses so teachers can then integrate strategy use and
training in their lessons. All teachers, irrespective of the subjects they
teach, should be able to identify strategies by name, describe them and
model them. Strategy training should be integrated within the curriculum
rather than taught as a separate entity and it should start with beginner
students even if this means providing the training in the students’ first
language. Students need to have experience with a variety of strategies to
be able to choose the one that works well with them. In case of failure in
language learning, students need to be assured that their failure may not be
due to lack of intelligence, but to the inability to choose appropriate
strategies.
References
Al-Hebaishi, S. M. (2012). Investigating the relationships between
learning styles, strategies and the academic performance of Saudi
English majors. International Interdisciplinary Journal of Education,
1(8), 510-520.
Bharuthram, S. (2012). Making a case for the teaching of reading across
the curriculum in higher education. South African Journal of
Education, 32, 205-214. Retrieved from:
http://www.ajol.info/index.php/saje/article/viewFile/76602/67051
Boyce, A. (2009). The effectiveness of increasing language learning
strategy awareness for students studying English as a second
language. Master’s dissertation, Auckland University of Technology,
New Zealand.
Bremner, S. (1999). Language learning strategies and language
proficiency: Investigating the relationship in Hong Kong. Asia Pacific
Journal of Language in Education, 1(2), 490-514. Retrieved from:
http://utpjournals.metapress.com/content/d27w088833436k7x/
Brown, H. D. (2007). Principles of language learning and teaching (5th
Edition). White Plains, NY: Person Education.
What is the Impact of Language Learning Strategies 249
Abstract
This chapter presents a research study conducted at an English for
Specific Purposes (ESP) one-to-one course focusing on speaking skills, in
order to find out if the course met the students’ learning needs and
prepared them to take the Test of English as a Foreign Language–Internet-
based Test (TOEFL iBT). The study was grounded on the theoretical
principles of teaching ESP, needs analysis, task-based teaching, and
language assessment. The instruments for the data collection were: initial
and final questionnaires; an audio recording of two speaking tasks on the
first and last day of class; and the teacher-researcher’s diaries at the end of
every class containing the students’ perceptions of their performance in
class. The results revealed the students’ satisfaction regarding the course
methodology and material, as well as the students’ perception of
improvement in their speaking and writing skills. The students’ narratives
also indicated the importance of teacher-student interaction and praised the
attention given by the teacher to their emotional aspects. The study
contributes to the field of ESP and language assessment, and fills the
research gap that exists in the teaching of speaking skills in private classes.
Introduction
The increasing number of students seeking to study graduate courses in
English-speaking countries has led to an unprecedented demand for the
Test of English as a Foreign Language–Internet-based Test (TOEFL iBT).
Many universities in English-speaking countries require international
254 Chapter Thirteen
Background
English for Specific Purposes (ESP)
Needs Analysis
According to Hutchinson and Waters (1987), a needs analysis should
gauge the students’ learning needs and not the teachers’ teaching needs.
For them, the difference between an ESP course and a general English
course is not so much the nature of the need, but the awareness of such
need. This is one of the most important aspects for ESP course design,
which should be divided between the target-situation needs (what the
student must do in the target-situation), that can be further subdivided into
necessities, lacks and wants, and the student learning needs (what the
student should do to learn).
Dudley-Evans and St. John (1998) also consider needs analysis
extremely important for ESP courses, as it allows for a much more focused
course. They claim that needs analysis is the process to determine “what to
do” and “how to do” a course (Dudley-Evans & St. John, 1998, p. 121).
The data collection for the needs analysis can be carried out through
questionnaires, interviews, surveys, assessments and discussions. Long
(2005, p. 19) states that “in changing times, educators are increasingly
relying on their needs analysis results in order to develop new courses.”
But he also warns that the respondents are usually the very same students
who are not always aware of what they will need in the target language
(L2). One example is the international students who are preparing to attend
graduate courses in English-speaking countries.
Tasks in ESP
For Willis (1996), task-based teaching should: stimulate students to use
the target language collaboratively and meaningfully; allow students to
participate in a complete interaction and to use different communication
strategies; and help students develop self-confidence to reach their
communicative goals. Based on the consensus of several researchers and
educators, Skehan (1998) suggested four criteria to define a task: (i) the
meaning is essential; (ii) the focus is on the objective; (iii) the task product
must be assessed, and (iv) there must be a relation to the real world. A
similar concept was also proposed by Willis (1996), as for her, tasks are
activities in which the target language is used by the learners with a
communicative objective in order to reach a result. Willis (1996)
highlights that task-based teaching should give the learners the freedom to
256 Chapter Thirteen
“The process by which learners plan what they are going to say or write
before commencing a task. Pre-task planning can attend to prepositional
content, to the organization of information or to the choice of language.
Strategic planning is also referred to as pre-task planning.” (Ellis, 2005a, p.
50)
Aspects Measures
Number of words and syllables per minute
Number of pauses (one/two seconds or longer)
Fluency
Number of repetitions and reformulations
Number of words per turn
Number of self-corrections
Percentage of error-free clauses
Accuracy
Use of verb tenses/articles/vocabulary/plurals/negatives
Ratio of definite and indefinite articles
Number of turns per minute
Lexical richness
Complexity
Amount of subordinate clauses
Frequency of use of prepositions and conjunctions
Ellis (2005a) recommends that, when the learner has the opportunity to
make use of the strategic planning before the task, his language production
will be more fluent and show more complexity. Although there are many
Speaking Practice in Private Classes for the TOEFL iBT Test 257
learning needs of the students and prepared them to take the TOEFL iBT
in a one-to-one class environment.
The Study
This is a qualitative exploratory research study. The main research
questions that guided the study were:
1. What are the students’ needs with regards to taking the TOEFL
iBT?
2. How do students perceive their language development during the
TOEFL iBT preparatory course?
3. How do students perceive the TOEFL iBT preparatory course?
Participants
The participants of this study were 17 students attending the
preparatory course for the TOEFL iBT. They were 7 female and 10 male
young adults, mostly (71%) between 21 and 30 years old. All but one were
graduate students, most of them had advanced levels of English, and only
three were at an intermediate proficiency level. Their language proficiency
level (basic, intermediate, intermediate/advanced or advanced) was
classified informally on the first day of class, taking into consideration
their grammar level, their capacity to express themselves without
hesitations and their vocabulary mastery. In order to protect the students’
identity, their names have been omitted and each student is identified by
the letter ‘A’ followed by a number, from 1 to 17.
Speaking Practice in Private Classes for the TOEFL iBT Test 259
Data Collection
The study data collection instruments and the procedures for the data
collection helped to answer the three research questions. The data
collection was divided into three consecutive phases. Table 13-2 provides
a summary of the data collection procedures in each phase of the project,
and the procedures in each of these phases is explained in detail in the
following sections.
Phase Procedure
1st x initial questionnaire
x recording of student’s initial speaking task
x assessment of student’s initial speaking task
2nd x interview at the end of each class
Phase 1
The initial questionnaire was used to find out and analyze the students’
learning needs. From the tabulation of the data, it was possible to
understand the target audience profile, their needs, and tailor the course
content for each student.
The initial oral production recording of each student responding to
sample Task 1 and 2 of the TOEFL iBT Speaking Section, aimed to
provide some data for each student, such as fluency, lexical richness,
accuracy, time used to formulate answers, attitude and reaction to the
limited time for responses. Task 1 of the TOEFL iBT Speaking Section
asks the test-taker to give a personal opinion about a topic, and Task 2
asks the test-taker to make a personal choice between two options and give
reasons and examples.
The recordings were assessed according to similar criteria used by the
Educational Testing Service (ETS) who are responsible for the TOEFL
iBT, and the theoretical framework by Ellis (2003) (see Table 13-1 above)
regarding fluency, complexity and accuracy of oral production. Based on
these criteria, an evaluation form was developed for the speaking task
260 Chapter Thirteen
Phase 2
The second phase of data collection occurred at the end of each class.
An interview consisting of three open-ended questions was carried out
with each participant, in order to understand and evaluate the perceptions
of students with regards to their oral production difficulties and the
activities they performed during that lesson. The three interview questions
were:
With the transcription of all the answers, the data were classified into
three categories elaborated a priori; i.e., activities, difficulties and
performance (see Bardin, 2011). The themes emerged after the analysis of
all the responses which were initially grouped by similarity of content. The
topics that were most often mentioned and later on guided the analysis
were: cognition, affection and methodology.
Phase 3
In the third and final phase of data collection, the final questionnaire
was used in order to find out if the course had met the students’ specific
needs raised in the beginning of the course, and the level of support the
course offered them to take the TOEFL iBT.
Also, the students’ performance on sample Task 1 and 2 of the TOEFL
iBT Speaking Section was recorded. The content of this recording was
compared with the initial speaking tasks recording and provided
information for analysis of the development of students’ speech
production. As with the initial oral production, the evaluation of these final
tasks was performed using the same evaluation form.
The analysis of the participation and performance of the 17 students
aimed at evaluating the adequacy of the course from the students’
perspective and assessing their linguistic ability.
Speaking Practice in Private Classes for the TOEFL iBT Test 261
The results showed that all students considered their oral production
either fair (6 students) or good (11 students) before the course started. That
was a good indication that they needed to improve this skill during the
course. It should be noted that as all the students needed to reach a
minimum score of 85% in the test, considering their speaking ability was
just ‘good’ was not enough to reach the desired score. This need was also
highlighted in the initial questionnaire as their main reason for attending
the course was the enhancement of their speaking ability. Writing skills
were also worked on extensively throughout the ESP course, as this was
the only skill rated as weak (3 students). Interestingly, students reported
having greater difficulty in language production (speaking and/or writing)
and less difficulty in language comprehension (listening and/or reading).
Students’ Performance
With regards to student language development and performance, the
final questionnaire data at the end of the course showed that fluency and
vocabulary were still a problem for students. Although 35% of students
mentioned fluency as the greatest difficulty in speaking English even after
they attended the course, and 29% of them signaled a lack of vocabulary
as something that would still hinder their oral production, the vast majority
(71%) felt more confident at the end of the course (see Figure 13-1).
The perceptions of students regarding factors that are more measurable,
such as lack of vocabulary or grammar, cause them less concern or are
even minimized when compared to the more subjective factors, such as
fluency and objectivity when describing details and reasons in the strictly
timed answers. Interestingly, all these factors are interrelated, because the
grammar and the vocabulary level will influence the fluency and the
objectivity of the answers within the 45 seconds allowed in the TOEFL
iBT.
At the end of the course 47% of students claimed to feel more
confident and fluent in English (see Figure 13-2). The high number of
answers related to greater confidence (71%) shows that one of the main
initial difficulties of the students was overcome by the end of the course.
Sppeaking Practicee in Private Claasses for the TO
OEFL iBT Test 263
7
6
Number of answers
5
4
3
2
1
0
In the finnal questionnnaire, students were asked tto rate (on a scale
s of 0
to 5) their level of learnning as a ressult of the acctivities in thhe regular
classes. Figuure 13-3 showws the results.
10
8
Number of answers
0
5 4 3 2 1 0
5 = highest 0 = lowest
Figure 13-3: S
Student self-perrceived learning
g (N = 17).
Results showed that 7 students (4 41%) rated thheir learning with the
highest scorre (5) and 9 students
s (47%%) each gave a 4 for their language
ability at thhe end of thee course. Acccording to thhese results, it i can be
inferred thatt the ESP couurse met the needsn mentionned by the sttudents in
the initial quuestionnaire att the beginnin
ng of the coursse.
In both tthe initial andd final questio
onnaires, studeents were askeed to rate
their four laanguage skillls. This questtion offered ffour responsee options:
excellent, goood, fair and poor.
p The folllowing two figgures (Figure 13-4 and
Figure 13-5)) show the evoolution of thiss perception frrom the studen nts’ point
of view, whhile Table 133-4 comparess the initial w with the finaal student
perceptions..
Speaking Practice in Private Classes for the TOEFL iBT Test 265
12
10
8 Reading
6 Listening
4 Speaking
2 Writing
0
Weak Fair Good Excellent
Figure 13-4: Students’ perceptions of their language skills before the course (N =
17).
12
10
Reading
8
Listening
6
Speaking
4
Writing
2
0
Weak Fair Good Excellent
Figure 13-5: Students’ perceptions of their language skills after the course (N =
17).
Perception
Skill Same Better Worse
Reading 12 4 1
Listening 12 4 1
Speaking 9 7 1
Writing 5 11 1
266 Chapter Thirteen
A17
A16
A15
A14
A13
A12
A11
A10
A9
A8
A7
A6
A5
A4
A3
A2
A1
their speaking skills and at the end of the course 33% of the students stated
that their oral production had improved significantly, and three of the
students were positively surprised with how much they had improved.
At the end of the study, and from a careful analysis of the data (see
Figure 13-4 and 13-5), it was possible to quantify the improvement
perceived by students in all four language skills. All findings were
compiled into a single spreadsheet; the data were then compared and
analyzed with the goal of finding the most usual pattern of improvement
among the different students, as well as the most common correlations
among the skills studied. Table 13-5 summarizes the results of this
analysis.
Surprisingly, the improvement in the perception of written production
was much higher (38.7%) when compared to the improvement of oral
production (17.6%), although the latter was the main focus of the
preparatory course for the TOEFL iBT.
The average perception of all students for each one of their skills was
taken into account, both at the beginning of the course and at the end of it.
From these values, the median variation was calculated. As the figures
show in the table above, in the comprehension skills, i.e., reading and
listening, there was only an increase of 12.0% and 11.5% respectively in
the way students viewed their improvement.
The course in question, having a narrower focus, was able to help
students improve more than one communicative ability, because even
though it focused on the practice of the speaking skills, the wide variety of
materials and extra writing content to support this focus may have also
helped students develop their other language skills such as writing.
268 Chapter Thirteen
According to the students’ scores above, it can be said that there was
indeed much learning along the course as the majority of the students (13
of them, or 76%) obtained a higher score than what they needed to be
accepted into their graduate courses. Among the four students who did not
reach the desired score, two of them had an intermediate level of English
with grammar difficulties, and the two others reported serious anxiety
issues during the test.
Speaking Practice in Private Classes for the TOEFL iBT Test 269
Every year since 2006, ETS publishes a report with the average scores
of students around the world. The figures reflect each skill and the total
score. The average score in the period when the present study took place
was 80 points (see Table 13-7).
Conclusion
Through the data obtained in the initial needs analysis, I had the
opportunity to investigate students’ needs and tailor the ESP course to
achieve certain goals, aligned with the needs and weaknesses of each
student. Based on the initial questionnaire, it was possible to detect that for
most students (53%) oral fluency was the greatest difficulty. After the
course, it was found that this problem was indeed overcome, not only in
terms of students’ perceptions, but also in terms of the evaluation of their
speaking skills in the initial and final speaking tasks. The study also
showed that students experienced an increase in their self-esteem and self-
confidence to express themselves in English. The feedback and the
270 Chapter Thirteen
Acknowledgement
This research study received support from CAPES, the Brazilian
Agency for the Development of Graduate Studies.
References
Bardin, L. (2011). Análise de Conteúdo. São Paulo: Edições 70.
Basturkmen, H. (2010). Developing courses in English for specific
purposes. Great Britain: Palgrave Macmillan.
Clapham, C. (2000). Assessment and testing. Annual Review of Applied
Linguistics, 20, 147-161.
Dudley-Evans, T., & St. John, M. (1998). Developments in English for
specific purposes: A multi-disciplinary approach. Cambridge:
Cambridge University Press.
Ellis, R. (2003). Task-based language learning and teaching. Oxford:
Oxford University Press.
—. (2005a). Instructed second language acquisition - A literature review.
Auckland: The University of Auckland.
—. (2005b). Planning and task performance in a second language.
Language Learning & Language Teaching Series. Amsterdam: John
Benjamins.
Speaking Practice in Private Classes for the TOEFL iBT Test 271
SELWYN A. CRUZ
AND ROMULO P. VILLANUEVA JR.
Abstract
The Far Eastern University (FEU) has one of the highest numbers of
international undergraduate students among Philippine universities. A
considerable number of these students are Korean students of English as a
foreign language (EFL) who are enrolled in general education courses,
such as English language classes which are also attended by Filipino
learners for whom English is a second language (ESL). Recognizing the
constantly increasing population of international students in the university,
this research study intended to compare the English grammar proficiency
of learners from two Asian English variants, namely Philippine English
and Korean English. In fulfilling the objectives of the study, 30 Korean
and 30 Filipino students were randomly selected to answer a 130-item
grammar test based on the syllabus of their course, namely, Introduction to
Language Arts English (Eng AN). Recommendations and implications for
English language teaching and learning are also discussed in the study.
Introduction
The term ESL (English as a Second Language) is attributed to the use
of English in countries like the Philippines and India where English is
used in daily activities but not as the main language. On the other hand,
Level of Grammar Proficiency of EFL and ESL Freshman Students 273
Background
Looking at the concept of grammatical proficiency, Chomsky posited
that grammatical competency involves one’s knowledge of grammar.
Hymes (1972), however, thought that this concept was inadequate; thus,
the elaboration that the grammatical proficiency that Chomsky was
referring to, alongside the meaning or value of one’s utterance, is part of
what is termed as communicative competence. Canale and Swain (1980)
supported this idea and added that:
empirical evidence for revising the English language syllabus intended for
first year students.
The Study
A total of 60 first year students, 30 Filipino and 30 Korean, from
various disciplines, enrolled in Eng AN or Comm Arts 1 at the Far Eastern
University (FEU), were the participants of this study. Only those taking
the course for the first time were chosen to take part in the study in order
to ensure the reliability of results since a prior exposure to the course
materials could result in apparent differences in performance. The first
year students were chosen because the syllabus for first year students
mainly concentrates on grammar. Convenience sampling was used in the
study because filtering the entire population of freshman students in terms
of class standing and proficiency level was not plausible at the time that
the study was conducted.
The Filipino participants consisted of 9 males and 21 females from the
Institute of Arts and Sciences enrolled in the Bachelor of Science Major in
Medical Technology. All of the participants were students of one of the
researchers. The Korean participants were all part of a Filipino for
Foreigners class, which was handled by a colleague. The Korean
participants comprised 19 males and 11 females who were taking different
courses. The researchers were not able to gather a homogenous group of
Korean participants in terms of the courses they were enrolled in because
of the relatively small number of foreigners in each class compared to
Filipinos. The Filipino participants’ age ranged from 15 to 18 years old
while the Koreans were from 17 to 20 years old. There were no specific
requirements for students to be part of the study apart from being enrolled
in the Eng AN class. At the time the data were gathered, the students were
about to have their midterm examination; hence, a good portion of the
syllabus was expected to have been taught in the class already.
Instruments
A 130-item grammar test covering the parts of speech and other
aspects of the English language that require the use of rules was used to
collect data. All grammatical aspects in the test were mostly based on the
syllabus of the Basic English course (Eng AN or Comm Arts 1) which is a
course for all first year students in FEU. There was an average allocation
of five test items per grammatical aspect as suggested by Brown (2005) on
language testing. The researchers took the questions from the grammar
Level of Grammar Proficiency of EFL and ESL Freshman Students 277
book of Folse, Ivone and Pollgreen (2005) and modifications were made
for contextualization. There were numerous targeted grammatical aspects
to be examined in the current study; hence, an objective type test was
needed for convenience in marking and analysis. The test contained
multiple choice questions, cloze tests, gap fill exercises, and fill in the
blank type of questions. A pilot test was administered with 2nd year
English Language students (21 Filipinos, 7 Koreans and 2 Chinese). The
Filipino students obtained a general mean of 101.64 while the Koreans had
a 96.08 mean score. Minor modifications were done in the test to make the
questions more suitable for the first year students.
Procedures
The test was administered separately to each group. The Filipinos were
given the test during their Eng AN class. The class was composed of 45
students, so 30 students were randomly selected to take the test, while the
remaining 15 students were asked to perform a classroom task. The
Korean learners, on the other hand, were given the test during their
Filipino for Foreign Students class. At that time, there were exactly 30
students enrolled in the class.
The students were given forty minutes to answer the test since the
participants of the pilot test were able to accomplish it between 25 to 35
minutes. After the test, the researchers marked all of the test papers in two
separate days. Two colleagues helped in verifying the reliability of the
marks. The marks were re-checked for possible errors and a recount for
the test scores was also conducted.
Data Analysis
The researchers made use of descriptive statistics to analyze the data. In
addition, the researchers devised a scale in order to measure the level of
grammar proficiency of the participants (see Table 14-1). The researchers
also analyzed the participants’ prominent mistakes. The mistakes
committed were collated and tabulated.
278 Chapter Fourteen
Standard Standard
Nationality Mean Deviation Error t-test p-
Mean value
Korean 92.90 11.174 2.066 2.28 .026
Filipino 100.67 14.615 2.672
Tables 14-3 and 14-4 show the means and standard deviations per
grammar topic for each group of participants (ESL and EFL) with the
corresponding interpretations from the devised scale. Based on the total
mean scores of each group in each area, it can be seen that the ESL
students achieved a proficient level in 12 areas of grammar (Table 14-3),
while the EFL students achieved a proficient level in six areas of grammar
(Table 14-4). On the other hand, there are 10 and 11 areas in grammar
where the ESL and EFL students achieved an average level respectively.
It is also interesting to note that the EFL participants seem to be less
proficient in the present progressive tense, modals and articles, while the
ESL participants appear to be less proficient in adverbs of frequency.
Sentence errors seem to be the most difficult area of grammar for EFL
learners based on their total mean score.
Level of Grammar Proficiency of EFL and ESL Freshman Students 279
Table 14-3: Means and standard deviations for the ESL participants.
Table 14-4: Means and standard deviations for the EFL participants.
The learners seem to interchange the with a and it also appears that
they tend to use an article when it is not needed.
Adverbs of Frequency
Modals
EFL learners used shall instead of will and can instead of would. Both
EFL and ESL learners used couldn’t/wouldn’t or didn’t have instead of
shouldn’t ignoring the previous sentence that gave them the clue.
Prepositions
Subject-Verb Agreement
Students had difficulty locating the real subject of the sentence. For
example, in sentence (4) they mistook ‘street’ as the subject rather than
‘vendors’.
Overall, the Korean EFL learners achieved a less proficient level for a
greater number of test items than the Filipino ESL learners. This could be
attributed to the fact that Korean EFL learners studying in the Philippines
have a lesser need to speak the English language after their classes finish
because they would interact almost exclusively with their fellow Korean
students using their mother tongue.
284 Chapter Fourteen
Conclusion
This small-scale study supports Kachru’s (2005) model in which
learners of English differ in various aspects. The study highlights how
Korean students studying in the Philippines need to be closely monitored
in their English language learning progress since they are in an
environment which may not be too conducive for learning English due to
the fact that the majority of the learners in the class they attend come from
the Outer Circle. A special English class for the Korean students is needed
in order to effectively address their English language needs. However, this
might pose problems since the Korean students are learning two foreign
languages simultaneously (i.e., English and Filipino).
Despite its limitations, this study may offer some insights to English
teachers to modify their course to further address the grammar deficiencies
of both EFL and ESL learners. Knowing the grammar areas where both
EFL and ESL learners are less proficient would give English teachers a
better idea as to how to design their lessons and grammar activities in
order to address these issues. Finally, it is recommended that teachers
incorporate in their teaching strategies activities that would highlight the
use of the target grammar in communicative situations.
Acknowledgement
This study received financial support from the University Research
Center of the Far Eastern University.
References
Bauman, N. (2010). A catalogue of errors made by Korean learners of
English. Paper presented at the Annual Conference of the Korea
Teachers of English to Speakers of Other Languages (KOTESOL),
October 26th-28th, Seoul, South Korea.
Bautista, M. L. S. (2000). Defining standard Philippine English: Its status
and grammatical features. Manila: De La Salle University Press.
Borlongan, A. M. (2010). On the management of innovations in English
language teaching in the Philippines [Editorial commentary]. TESOL
Journal, 2(2), 1-3.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative
approaches to second language teaching and testing. Applied
Linguistics , 1(1), 1-47.
Level of Grammar Proficiency of EFL and ESL Freshman Students 285
GLADYS QUEVEDO-CAMARGO
AND MATILDE VIRGINIA RICARDI SCARAMUCCI
Abstract
This chapter reviews studies on the washback of language assessment
from 2004, when Cheng, Watanabe and Curtis published the first book on
washback methodology, to 2012. Taking into account Alderson and Wall’s
(1993) admonition for the search for empirical evidence about the
phenomenon and the use of a more ethnographic approach in the
investigations, this review aimed at investigating the researchers’
methodological options during this period. By means of an electronic
search, 78 studies from 31 countries were identified. The analyses showed
that Alderson and Wall’s words were heard, as the identified works
significantly diversified the ways to investigate washback by involving
different stakeholders, using a variety of data collection instruments such
as document analysis, questionnaires, interviews and classroom
observation, as well as by adopting quantitative and mixed approaches of
investigation.
Introduction
Washback in language learning, that is, the impact or influence that
external exams, particularly high-stakes exams, as well as achievement
tests may potentially have on language teaching and learning processes,
the curriculum, material design and stakeholders’ attitudes (Scaramucci,
2004), is a relatively new concept (Cheng, 2008). Studies carried out
mainly after the 1990s consolidated the idea that washback is a frequent,
complex and highly important phenomenon that involves several
288 Chapter Fifteen
Geographical Distribution
Continent Studies %
Asia 12 38.7
Europe 11 35.5
South America 3 9.7
North America 2 6.4
Oceania 2 6.4
Africa 1 3.3
Total 31 100
Note: Mean = 5.16; Standard Deviation = 4.95.
14
12
10
8
6 13
11 10 11
4 7 8 7
5 6
2
0
2004 2005 2006 2007 2008 2009 2010 2011 2012
Methodological Issues
Year of Study
Nr. of
Total
instru-
2004
2005
2006
2007
2008
2009
2010
2011
2012
%
ments
1 0 0 0 1 1 3 0 0 0 5 6.4
2 6 2 4 4 9 3 1 1 2 32 41.0
3 3 4 3 0 1 5 2 4 2 24 30.8
4 1 1 1 3 0 2 2 0 2 12 15.4
5 1 0 0 2 0 0 0 1 1 5 6.4
Total 78 100
Document Analysis
Documents are the basis for the majority of qualitative research studies
(Schensul, 2008). Document analysis can be a main source of data or a
complementary instrument, depending on the study object and aims
(Lüdke & André, 1986). All 78 studies reviewed used information
obtained from documents, even when the researcher does not explicitly
mention this. This is the case with studies like Caine (2005), for instance.
Shawcross (2007) and four other studies reported document analysis as the
sole source of data collection. Others such as Barletta and May (2006)
mentioned it as a secondary source. In the studies that reported document
analysis as one of the instruments used, the way the analyses were
conducted varied considerably, including quantitative approaches using
data codification, and qualitative approaches, with the predominance of
selection of relevant information made by the researcher, followed by
possible interpretations of the data.
Therefore, document analysis is an instrument inherent to all research
that aims at investigating exam washback, since a thorough understanding
of the exam as well as of the educational context where the study is carried
out are essential conditions for any conclusions about this phenomenon.
Questionnaire
“to consider, in detail, the situation in which the texts resulting from such
procedures (interviews and questionnaires) are produced; as well as the
(illocutionary and perlocutionary) values of the act of asking and the ways
of asking which favour the diffusion of assumptions of the researcher
about the required information.” (Machado & Brito, 2009, pp. 140-141 –
our own translation)
Interview
Classroom Observation
Focus Group
Diary
Researcher’s Writing
Student Follow-up
Conversational Analysis
Conclusion
Based on Alderson and Wall’s (1993) article, in which the authors call
the attention of language assessment researchers’ to the need to adopt
ethnography in the search for empirical evidence on washback, this study
aimed at investigating whether the authors’ admonition had any effect on
subsequent works. By means of an electronic literature review, 78 studies
conducted in 31 countries were identified.
The analysis of the studies revealed that Alderson and Wall’s (1993)
words had an impact on the research community, as the number of
methodological options to investigate language assessment washback had
significantly increased. Evidence of this is the use of mixed-methods
research in which qualitative and quantitative perspectives merged to
produce a more complete picture of the studied phenomenon. Furthermore,
the use of ethnography and triangulation by means of a variety of
stakeholders, sources of information (or documents) and data-collection
instruments was found in the great majority of the studies reviewed.
Taking into consideration Alderson and Wall’s (1993) suggestion for
triangulating the researcher’s perceptions with those of the participants,
inside and outside classroom, in an attempt to capture the complexity of
the phenomenon, the reviewed studies gave voice to different
stakeholders: teachers, students or examination candidates, teacher
supervisors, school directors, coordinators, school supervisors, educational
authorities, exam designers, students’ parents, and material writers.
As far as the sources of information are concerned, researchers used
different kinds of documents such as previous exams, guidelines and other
exam publications, statistical data on test takers’ performance, and
teaching materials in order to understand more deeply the construct and
the history of the exam they were working with as well as to characterize
the research context.
In relation to triangulation, ten instruments were identified and are
listed here from the most to the least frequently used: document analysis,
300 Chapter Fifteen
References
Alderson, J. C. (2004). Foreword. In L. Cheng, Y. Watanabe, & A. Curtis
(Eds.), Washback in language testing: Research contexts and methods
(pp. ix-xii). New Jersey: Lawrence Erlbaum Associates.
Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied
Linguistics, 14(2), 115-129.
Andrews, S., Fullilove, J., & Wong, Y. (2002). Targeting washback-a case
study. System, 30(2), 207-223.
Barletta, N., & May, O. (2006). Washback of the ICFES Exam: A case
study of two schools in the Departamento del Atlántico. Íkala revista
de lenguaje y cultura, 11(17), 235-261.
Brinkmann, S. (2008) Interviewing. In L. M. Given (Ed.). The Sage
encyclopedia of qualitative research methods (pp. 470-472). Thousand
Oaks, CA: Sage Publications, Inc.
Bourdieu, P. (1998). Compreender. In P. Bourdieu (Ed.), A miséria do
mundo (2a ed., pp. 693-732). Petrópolis: Vozes.
Brown, J. D. (2001). Using surveys in language programs. Cambridge:
Cambridge University Press.
Methodology in Washback Studies 301
CONTRIBUTORS
306 Contributors
Current Issues in Language Evaluation, Assessment and Testing: 307
Research and Practice
Viorica Marian (Ph.D. from Cornell University) is the Ralph and Jean
Sundin Professor of Communication Sciences and Disorders and Professor
of Psychology and Cognitive Science at Northwestern University in the
United States. Since 2000, she has directed the Bilingualism and
Psycholinguistics Research Group, with funding from the National
Institutes of Health and the National Science Foundation. Her research
centers on bilingualism and its consequences for linguistic, cognitive, and
neural function, with a focus on language processing, learning, and
memory. Her research has been disseminated in over 100 publications,
over 200 conference and invited presentations, and receives extensive
press coverage (http://www.bilingualism.northwestern.edu/).
group (English for Specific Purposes, ESP, teaching and learning research
group - PUC-SP), and editorial assistant in the Academic Journal ‘the
ESPecialist’. She is responsible for designing and teaching ESP courses
mainly for language proficiency tests such as TOEFL iBT and IELTS. Her
main research interests include ESP, one-to-one classes, and language
proficiency assessments.