Language Testing

Language Testing
http://ltj.sagepub.com/
Re-examining the content validation of a grammar test: The (im)possibility of

distinguishing vocabulary and structural knowledge
J. Charles Alderson and Benjamin Kremmel
Language Testing 2013 30: 535 originally published online 2 July 2013
DOI: 10.1177/0265532213489568
The online version of this article can be found at:

http://ltj.sagepub.com/content/30/4/535
Published by:
http://www.sagepublications.com
Additional services and information for Language Testing can be found at:
Email Alerts: http://ltj.sagepub.com/cgi/alerts
Subscriptions: http://ltj.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://ltj.sagepub.com/content/30/4/535.refs.html
>> Version of Record - Sep 25, 2013

OnlineFirst Version of Record - Jul 2, 2013
What is This?
Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

489568
2013
LTJ30410.1177/0265532213489568Language TestingAlderson and Kremmel
/$1*8$*(
Article 7(67,1*
Language Testing
30(4) 535–556
Re-examining the content © The Author(s) 2013
Reprints and permissions:
validation of a grammar sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0265532213489568
test: The (im)possibility of ltj.sagepub.com
distinguishing vocabulary and

structural knowledge
J. Charles Alderson
Lancaster University, UK
Benjamin Kremmel
University of Innsbruck, Austria
Abstract
“Vocabulary and structural knowledge” (Grabe, 1991, p. 379) appears to be a key component
of reading ability. However, is this component to be taken as a unitary one or is structural
knowledge a separate factor that can therefore also be tested in isolation in, say, a test of syntax?
If syntax can be singled out (e.g. in order to investigate its contribution to reading ability), this
test of syntactic knowledge would require validation. The usefulness and reliability of using expert
judgments as a means of analysing the content or difficulty of test items in language assessment
has been questioned for more than two decades. Still, groups of expert judges are often called
upon as they are perceived to be the only or at least a very convenient way of establishing key
features of items. Such judgments, however, are particularly opaque and thus problematic when
judges are required to make categorizations where categories are only vaguely defined or are
ontologically questionable in themselves. This is, for example, the case when judges are asked to
classify the content of test items based on a distinction between lexis and syntax, a dichotomy
corpus linguistics has suggested cannot be maintained. The present paper scrutinizes a study by
Shiotsu (2010) that employed expert judgments, on the basis of which claims were made about
the relative significance of the components ‘syntactic knowledge’ and ‘vocabulary knowledge’ in
reading in a second language. By both replicating and partially replicating Shiotsu’s (2010) content
analysis study, the paper problematizes not only the issue of the use of expert judgments, but,
more importantly, their usefulness in distinguishing between construct components that might,
in fact, be difficult to distinguish anyway. This is particularly important for an understanding and
diagnosis of learners’ strengths and weaknesses in reading in a second language.
Keywords
Content analysis, grammar, judgments, reading in a second language, vocabulary
Corresponding author:
J. Charles Alderson, Lancaster University – Linguistics and English Language, County College South,
Lancaster University, Lancaster, LA1 4YL, UK.
Email: c.alderson@lancaster.ac.uk

536 Language Testing 30(4)
Alderson (1993a) argues that the use of so-called experts to judge the content or difficulty
of test items is highly questionable as these judgments are often of limited accuracy, reli-
ability and validity. Bachman et al. (1996) and Alderson et al. (2012) have also demon-
strated that judgments about salient item characteristics appear rather obscure and arbitrary
and that agreement between judges is moderate at best. Nevertheless, language testers still
frequently rely on expert judgments when attempting to validate the content or construct
of a particular test. In a study investigating the relative significance of different compo-
nents of second language (L2) reading, Shiotsu (2010), as part of a preliminary study,
employed expert judges to justify a test of 35 items as a valid and suitable measure of
syntactic knowledge. Based on these judgments, Shiotsu (2010) then decided which items
should be included in the test of syntactic knowledge and thus form the basis of further
statistical analyses, their results and inferred claims about the nature of L2 reading.
Given Alderson’s (1993a) caution against the use of expert judgments and other
criticisms that have been voiced concerning the study in question (Brunfaut, 2009),
Shiotsu’s preliminary study should be scrutinized in more detail, analysing closely the
logic and rationale behind the inclusion or exclusion of certain items and replicating
the study to see whether findings can be corroborated with different groups of experts.
The aim of the present paper is thus twofold: It will problematize the use of expert
judgments for content validation in general, but will also discuss the particular diffi-
culty when the judges’ task is to make a clear construct distinction between syntactic
and lexico-semantic knowledge, two categories that, by nature, overlap and have
blurred boundaries.
This paper will present the findings of an examination of Shiotsu’s (2010) content
analysis study. The paper briefly outlines the context of the study and Shiotsu’s (2010)
findings in the original study in order to contextualize and facilitate interpretation of the
insights gained from the present study. It then presents the results of replications of this
study and discusses findings from using an alternative approach to judgment gathering to
investigate whether Shiotsu’s (2010) test can be confirmed as an instrument that mainly
measures syntactic knowledge and therefore has yielded trustworthy results that form the
basis of claims about the relative significance of syntactic knowledge in L2 reading abil-
ity. Finally, the implications of the study for our understanding of L2 reading and for
future research are discussed.
Context
Adopting a component model of reading rather than focusing on the cognitive process-
ing, numerous researchers in the past have attempted to model L2 reading ability and
explain the relative contribution of different components to reading, or rather reading
test performance. Amongst these, “vocabulary and structural knowledge” (Grabe,
1991, p. 379) appears to be one of the most prominent components according to
research. Nevertheless, it is generally agreed that vocabulary knowledge best predicts
reading test performance.
Alderson (2000) states that “factor analytic studies of reading have consistently found
a word knowledge factor on which vocabulary tests load highly” (p. 99) and that there-
fore vocabulary knowledge is an important predictor of variance in reading test

Alderson and Kremmel 537
performances (Qian, 2002). Baddeley et al. (1985), Dixon et al. (1988), Cunningham et
al. (1990), Beck and McKeown (1991) and Daneman (1991) identified vocabulary as an
important component of fluent L1 reading. Hacquebord (1989), Bossers (1989), Laufer
(1992) and Schoonen et al. (1998) report similar findings for the L2 context. Yamashita
(1999) also claims that L2 vocabulary knowledge surpasses L2 grammar knowledge in
explaining L2 reading variance. Brisbois (1995), using grammar and vocabulary as inde-
pendent predictor variables in her analysis, found that vocabulary measures showed a
higher correlation with reading scores than did the grammar measure.
However, a recent paper by Shiotsu and Weir (2007) criticizes the methodological
bias and shortcomings in previous studies and concludes that “the literature on the rela-
tive contribution of the grammar and vocabulary knowledge to reading performance is
too limited to offer convincing evidence for supporting one or the other of the two pre-
dictors” (Shiotsu & Weir, 2007, p. 105).
Instead, Shiotsu and Weir claim that “the role of vocabulary appears somewhat
overstated while that of grammar understated” (p. 104), which would be in accord-
ance with studies by Alderson (1993b) and Bachman et al. (1989) which found that
grammar tests do explain a substantial percentage of variance in reading test perfor-
mances. Kaivanpanah and Zandi (2009) concluded from their findings that “syntac-
tic behavior is more related to reading comprehension than vocabulary knowledge”
with students’ scores on the TOEFL grammar test outperforming their scores on Qian
and Schedl’s (2004) Depth of Vocabulary Knowledge Test as a predictor of reading
test performance.
While it is beyond the remit of this paper to examine the methodology of all of these
studies which appear to support Shiotsu’s (2010) and Shiotsu and Weir’s (2007) claim,
the present paper will scrutinize the methodology and results of Shiotsu’s (2010) prelimi-
nary study, on which the claim is based that syntactic knowledge is a better predictor of
L2 reading test performance than vocabulary knowledge. Only if Shiotsu’s (2010) test
can be confirmed as measuring mainly, or exclusively, syntactic knowledge, can the
results it has produced be regarded as a reliable basis for any claims regarding the con-
struct of reading L2 ability.
The original study

Shiotsu (2010), in an investigation of the predictive power of a range of components of
L2 reading, employed expert judges in a content analysis to legitimize a test of 35 items
as a valid measure of syntactic knowledge. Shiotsu’s expert group consisted of
“11 L1-English ELT experts with at least a master’s degree in applied or theoretical lin-
guistics or TEFL” (Shiotsu, 2010, p. 62) and three Japanese lecturers of English syntax
with at least a master’s degree in linguistics. Based on an adaptation of Bachman et al.’s
rating scale for content analysis (1995), these experts were asked to evaluate the content
of 35 discrete multiple-choice items. These items were collated from 15 past TOEFL
(Test of English as a Foreign Language) Structure and Written Expression questions and
20 past TEEP (Test of English for Educational Purposes) grammar items. Judges were
given this new test and were asked to indicate against each item whether they thought it
was mainly testing (a) syntactic knowledge, (b) lexico-semantic knowledge, or (c)

sentence comprehension (Shiotsu, 2010). Lexico-semantic knowledge was loosely

defined as “knowledge of the meanings of certain words and phrases” (Shiotsu, 2010, p.
61), syntactic knowledge as “knowledge of sentence structures and that of acceptable
sequences and forms of words in terms of syntax” (p. 61) and sentence comprehension
as “understanding of the meaning of the overall sentence” (p. 61). No prototypical exam-
ples of the individual categories were provided to the judges. A summary of the judges’
responses is displayed in Table 1, where category A stands for syntactic knowledge, B for
lexico-semantic knowledge, and C for sentence comprehension.
Of the total 483 ratings,1 331 were for syntactic knowledge (68.5%), indicating that
this 35 item test tends to be one of syntactic knowledge overall. Shiotsu’s rationale for
the inclusion of items in the final test is that any item for which the syntax category did
not receive the highest number of votes should be excluded from the test as it appears to
be testing something other than syntactic knowledge. For this reason, items 12, 18 and
21 were eliminated in the main study on the basis of these findings from the preliminary
study (items highlighted in grey).
However, one could argue that this logic is hardly convincing and that only items for
which the majority of judges (in this case at least eight) indicated that it is testing mainly
syntactic knowledge should be legitimately included in the syntactic knowledge meas-
ure. This would mean that another five items would need to be eliminated from further
analyses (items hatched). One could thus claim that for items 4, 6, 14 and 31, the judges
were clearly undecided, while for item 10 the majority of judges thought that it is testing
something other than syntactic knowledge. Importantly, therefore, it remains to be seen
whether scores from a syntax test consisting of the 28 remaining items would have
yielded findings similar to Shiotsu’s results and claims from his main study.2
To investigate this further, five replications of the original study were conducted,
three times with different groups of judges using the same methodology as the original
study, twice using a slightly modified methodological approach. The results and implica-
tions of these studies will be presented in the following.
First replication study

In order to subject the findings of Shiotsu’s preliminary study to closer inspection, his
content analysis study using expert judges was replicated. Twenty-one international par-
ticipants in a workshop on diagnostic reading assessment, all involved in language test
development at a national or institutional level, of whom 15 were also teaching at a
university level, were asked to judge each of the 35 individual items as to whether they
thought it was mainly testing (a) syntactic knowledge, (b) lexico-semantic knowledge, or
(c) sentence comprehension.
The judgment procedure of the original study was replicated with no modifications so
as to highlight the potential problematic nature of the distinction between the categories
offered by Shiotsu. Although Weir, Hughes and Porter (1990) as well as Lumley (1993)
maintain that inter-judge agreement could be increased by training the experts, this pro-
cedure was deliberately avoided in all replication studies as the authors did not want to
clone judges through training (Alderson, 2000) but establish the problematic nature of
the construct. As Alderson (2000) further states that a forced agreement through

Table 1. Results of Shiotsu’s original content analysis study (Shiotsu, 2010, p. 63).
Shiotsu’s results (N = 14)
Item A B C
1 12 2 0
2 12 1 1
3 11 0 3
4 7 5 2
5 13 1 0
6 7 3 4
7 11 2 1
8 9 1 4
9 9 4 1
10 6 4 4
11 12 2 0
12 5 8 1
13 13 1 0
14 7 5 2
15 11 1 2
16 9 3 2
17 12 1 1
18 5 9 0
19 10 0 4
20 10 0 4
21 4 0 8
22 11 0 2
23 9 4 1
24 9 1 3
25 8 3 3
26 10 1 3
27 12 1 1
28 8 0 5
29 10 3 1
30 10 1 2
31 7 6 1
32 13 1 0
33 9 4 1
34 11 2 1
35 9 3 1
discussion in such a study would only show “the success of the cloning process” (p. 96)
but not provide an unbiased picture of what experts actually thought the items were test-
ing, just as in the reference study no training, discussion or category modification was
offered to the judges. The results of this study can be found in Table 2.

Table 2. Results of the first replication study.

Replication study 1 (N = 21)
Item A B C A/B A/C

1 20 1 – – –
2 11 6 4 – –
3 15 0 6 – –
4 9 11 – 1 –
5 14 5 2 – –
6 15 4 2 – –
7 13 – 8 – –
8 16 – 5 – –
9 12 6 3 – –
10 8 11 2 – –
11 14 5 2 – –
12 3 15 3 – –
13 16 3 2 – –
14 5 13 1 1 1
15 19 1 1 – –
16 13 7 1 – –
17 13 7 1 – –
18 8 13 – – –
19 13 – 8 – –
20 13 – 8 – –
21 8 1 12 – –
22 18 – 3 – –
23 15 3 3 – –
24 15 – 6 – –
25 6 4 11 – –
26 13 2 6 – –
27 18 3 – – –
28 7 – 14 – –
29 15 4 2 – –
30 12 5 4 – –
31 5 13 3 – –
32 18 – 3 – –
33 10 6 5 – –
34 11 1 9 – –
35 13 4 4 – –
The replication study confirmed the legitimacy of Shiotsu’s exclusion of items 12, 18
and 21 from the test. While the group of judges also found that this was in principle a test

of syntax (437 out of 735 ratings included the syntax category), nine items clearly emerged
as problematic (items 4, 10, 12, 14, 18, 21, 25, 28 and 31), even if Shiotsu’s questionable
rationale for item exclusion was applied. In addition, item 33 does not show a clear ten-
dency or majority vote in terms of its categorization. This replication study therefore
indicates that items 10, 14, 25, 28, 31 and 33, which were all included in Shiotsu’s original
test, cannot be unambiguously justified in this test of syntactic knowledge.
Second replication study

An international group of 19 language testing experts, all with at least a master’s degree
in linguistics or applied linguistics was identified as potential judges for a second replica-
tion study. Employing the identical methodology as in the original and the first replication
study, the test was sent out via email to the experts, 16 of whom returned the completed
content analysis to the researchers. Of these 16 responses, two judges had to be removed
from the analysis as it emerged from their responses and comments that they had not
adhered to the instructions provided. The 14 remaining judges all had at least five years of
professional experience in language testing, nine of them were L1 English speakers and
eight of the 14 judges held a PhD in linguistics, applied linguistics or language testing.
However, as Weir, Hughes and Porter (1990) state, “experience does not guarantee relia-
bility of judgment” (p. 507) and it has yet to be shown which kind of expert group makes
the best judges of item content. The number of judges in the second replication study was
thus identical to the number of judges in the original study (N = 14). For this reason and
also because different groups of expert judges came up with different results, the results
of the individual studies were not collated but are reported separately in what follows. A
summary of the results from the second replication study can be found in Table 3.
Of the total 4973 ratings, 329 included the category syntactic knowledge (66.2%),
confirming Shiotsu’s finding that this 35-item test tends to be one of syntactic knowledge
overall. In addition, Shiotsu’s elimination of items 12, 18 and 21 appears to be justified
as they were judged to be testing something other than syntactic knowledge.
However, applying Shiotsu’s rationale for excluding items if the number of votes for the
syntax category is not the highest of the three categories, three more items would have to be
eliminated from further analyses on the basis of the results of this replication study. Items 9,
14 and 25 emerge as being non-syntax items from the second replication study judgments.
Applying the alternative principle of a majority decision (eight judges or more) neces-
sary for the inclusion of an item in the syntax test, another four items would not meet this
criterion and would have to be eliminated. The summary of judgments for items 10, 28, 31
and 33 does not provide conclusive evidence that these are syntax items. It is therefore
questionable whether these items are worth including in a measure of syntactic knowledge.
This finding seems not only to buttress Alderson’s (1993a) above-mentioned caution but
also to relativize and undermine Shiotsu’s results and the claims of the main study.
Third replication study

The results of the second replication study were then compared with another study by
Alderson (2011) who also attempted to replicate Shiotsu’s original content analysis

Table 3. Results of the second replication study.
Item A B C A/B A/C A/B/C B/C n.a.

1 13 1 – – – – – –
2 9 5 – – – – – –
3 11 1 2 – – – – –
4 8 6 – – – – – –
5 10 4 – – – – – –
6 11 2 – 1 – – – –
7 13 1 – – – – – –
8 12 1 1 – – – – –
9 4 9 1 – – – – –
10 7 3 3 – – – – 1
11 8 6 – – – – – –
12 4 10 – – – – – –
13 12 1 1 – – – – –
14 7 7 – – – – – –
15 11 2 – – – – 1 –
16 8 6 – – – – – –
17 9 4 1 – – – – –
18 5 9 – – – – – –
19 11 1 2 – – – – –
20 12 1 1 – – – – –
21 5 1 6 – – 1 – 1
22 12 1 1 – – – – –
23 10 2 1 1 – – – –
24 11 – 2 1 – – – –
25 5 3 6 – – – – –
26 11 1 1 – 1 – – –
27 10 4 – – – – – –
28 7 1 6 – – – – –
29 9 4 1 – – – – –
30 10 4 – – – – – –
31 7 5 2 – – – – –
32 11 1 1 1 – – – –
33 6 4 4 – – – – –
34 11 1 1 1 – – – –
35 12 2 – – – – – –
study. The results of this study, in which 14 university professors of applied linguistics
participated, are shown in Table 4.
Table 4 probably best illustrates the insecurity of judges with this categorization task.
Again, the findings seem to suggest that this 35-item test is in general one of syntactic

Table 4. Results of the third replication study.
Item A B C A/B B/A B/C A/C C/A C/B C/A/B A/B/C C/B/A A? ?
1 10 1 – 2 1 – – – – – – – –
2 7 3 1 1 1 – 1 – – – – – –
3 12 – 2 – – – – – – – – – –
4 5 4 1 2 1 – – – 1 – – – – –
5 9 2 – 1 – – 1 – – – – 1 –
6 12 1 – – – – – – – – – 1 –
7 9 1 – – – – 3 – – – – 1 –
8 12 1 – – – – – 1 – – – – –
9 7 4 – 3 – – – – – – – – –
10 6 5 – 2 1 – – – – – – – –
11 6 3 – 2 2 – – – – – – 1 –
12 2 10 1 1 – – –
13 13 – – – – – – – – – – 1 –
14 4 8 – – 1 1 – – – – – – –
15 11 1 – – – – – 1 – – – 1 –
16 9 1 1 2 – – – – – – – 1 –
17 9 2 – 3 – – – – – – – – –
18 5 6 – – 1 2 – – – – – – –
19 13 – 1 – – – – – – – – – –
20 12 – 1 – – – 1 – – – – – –
21 3 – 5 – – – – 1 – – – – 5
22 9 – 2 – – – 2 1 – – – – –
23 10 – 2 – – – 2 – – – – – –
24 12 – 1 – – – – 1 – – – – – –
25 4 2 3 3 1 – – – 1 – – – – –
26 7 – 1 1 – – 3 – 1 – 1 – – –
27 12 – – – – – 1 – – – – 1 – –
28 6 – 2 2 – – 1 2 – 1 – – – –
29 10 – 2 – – – 2 – – – – – – –
30 10 – 2 1 – – 1 – – – – – – –
31 4 3 2 2 – – – – 2 – 1 – – –
32 11 – 1 – – – 1 1 – – – – – –
33 5 2 1 3 – – 2 – 1 – – – – –
34 10 – 2 – – – – – – – 2 – – –
35 8 – – 1 – – 2 1 1 – 1 – – –
knowledge, albeit not as convincingly as the results of the studies already discussed. Of the
490 ratings, 294 included the syntax category (60%). As in both the original study and the
second replication study, the judgments for items 12, 18 and 21 clearly suggest that they should
be excluded from this test of syntax as they are tapping into a different reading component.

Table 5. Results of the fourth replication study.

Item Mean Item Mean Item Mean
1 2.7 13 1.2 25 3.9
2 3.8 14 4.75 26 1.9
3 2 15 1.5 27 1.8
4 2.8 16 3.1 28 2
5 1.5 17 2.6 29 2.4
6 3.2 18 3.7 30 2
7 1.3 19 1.2 31 4.7
8 1.3 20 1.3 32 1.9
9 3.5 21 1.7 33 3.8
10 3.1 22 1.1 34 2
11 2.8 23 1.8 35 2
12 4.3 24 1.3
However, as in the second replication study, ratings for item 14 would also suggest
that, adhering to Shiotsu’s logic, this item should be eliminated as it is not clearly and
convincingly an item testing syntactic knowledge. Items 9 and 25, identified as possible
drop-outs in the second replication study, would not have to be eliminated on the basis of
the judgments of the third replication study if Shiotsu’s principle for inclusion was applied.
However, both of these items do not seem to be clear-cut syntax items, as Table 4 sug-
gests. Applying the alternative principle of majority decisions to the judgments, these
two (items 9 and 25) and another 8 items (items 2, 4, 10, 11, 26, 28, 31 and 33) would
emerge as problematic, four of which (items 10, 28, 31 and 33) were also identified as
problematic in the second replication study.
Fourth replication study – approximate replication with

alternative rating approach
In a further study, the same test was sent out to a different group of expert judges, this time
with a different rating grid. Since the two categories ‘syntactic knowledge’ and ‘lexico-
semantic knowledge’ appeared to be predominantly used in the earlier replication studies,
only these two categories were selected for study. Nineteen judges, comparable in exper-
tise to those in the second replication study, were asked via email to indicate against each
of the items what they thought each item was mainly testing, grading their judgment on a
continuum of a 6-point Likert scale ranging from (1) ‘mainly syntactic knowledge’ to (6)
‘mainly lexico-semantic knowledge’. Eleven completed content analyses were returned to
the researchers. One judge had to be removed from the file as the ratings and comments
suggested that the instructions provided had not been adequately followed. The summary
of the Likert scale judgments (mean scores) are shown in Table 5.
Since the rating scale ranged from 1 to 6, 3.5 was chosen as the cut score. An item
showing a mean rating higher than 3.5 indicates that it tends to the lexico-semantic end

Table 6. Results of the fifth replication study.
Item Mean Item Mean Item Mean

1 2.19 13 2.19 25 3.76
2 2.62 14 4.14 26 2
3 1.71 15 2.05 27 2.48
4 3.19 16 2.95 28 3.29
5 1.86 17 2.57 29 2
6 2.52 18 3.9 30 2.43
7 2 19 1.67 31 3.76
8 1.38 20 1.71 32 1.86
9 3.05 21 2.9 33 3.1
10 3.48 22 1.76 34 2.43
11 3.33 23 2.05 35 2.33
12 3.95 24 1.71
of the continuum and thus suggests that it should not be included in a syntactic knowl-
edge measure. Again, the results suggest that items 12 and 18 should be removed from
the test as they are testing lexico-semantic knowledge rather than syntactic knowledge.
Item 21, identified as a drop-out in the other investigations, does not emerge as problem-
atic from these results. This, however, may be due to the fact that it was classified as a
‘sentence comprehension’ item in the other studies, for which there was no category
available in this rating procedure.
However, item 14, identified as a potential drop-out in both the first and the second
replication study and not convincingly justified as a syntax item in the original study,
clearly emerges as a problematic item with a mean rating of 4.75. This strongly suggests
that this item should be removed from the syntactic knowledge measure. Items 9, 25, 31
and 33, highlighted as potentially problematic in both the second and the third replication
study, also have values well above the cut-score and might thus not be justifiable as items
testing syntactic knowledge. The previous findings for item 28, which suggested that this
item might also be removed from the syntactic knowledge measure, could not be con-
firmed in the fourth replication study. For item 2 the results of the third replication study,
in which this item was identified as a potential drop-out, could be confirmed.
Fifth replication study

In a fifth replication study, the same group of judges as in replication study 1 were asked
to judge the content of the test items again, this time classifying items according to the
rating grid of replication study 4. The results are displayed in Table 6.
Again the cut-score was set at 3.5 as in the fourth replication study since an item
showing a mean rating higher than 3.5 was found by the group to be testing lexico-
semantic knowledge rather than syntactic knowledge. Interestingly, three items fewer
than in the fourth replication study were found to be problematic. Items 2, 9 and 33 were

rated by this group as tending to test syntactic knowledge. This echoes the judgments of
this group on these items using the original classification grid. Also, item 10, identified
clearly by the group as a lexico-semantic item in the first replication study, is just below
the stipulated cut with a mean rating of 3.48.
Overview of results
Table 7 shows which items were identified as potential candidates for exclusion from the
test of syntactic knowledge in question by the five replication studies outlined above.
‘XX’ marks items that should be excluded according to the original rationale of Shiotsu,
that is, if the syntax category did not receive the highest number of votes. ‘X’ marks
items that should be excluded according to an alternative principle, that is, if the syntax
category did not receive the majority of votes. ‘?’ marks problematic items identified in
studies four and five, which employed a slightly different methodology and thus also a
different rationale for item exclusion.
Contrary to Shiotsu’s study, which only yielded three problematic items, five replica-
tion studies, conducted using identical as well as similar methods of gathering expert
judgments in content analyses in the original study, found that only 20 items out of 35
emerged as unproblematic and clear syntax items from all studies. Fifteen items in total
were shown to require further scrutiny as they could clearly be questioned: items 2, 4, 6,
9, 10, 11, 12, 14, 18, 21, 25, 26, 28, 31 and 33 (see Appendix).
Three items (14, 25 and 31), which were not excluded in the original study, were identi-
fied as problematic by all five replication studies. As all five replication studies suggest that
these three items cannot justifiably be maintained as syntax items, the legitimacy of the test
in question and all results and claims based on it are questioned. An exclusion of at least
these three items (in addition to the three originally excluded items) and a subsequent re-
run of the original analysis, as well as a comparison of results against a re-analysis using
the 20 remaining syntactic items, appears necessary as different findings might result.
How many judges are sufficient?

An important issue to address when using expert judgments is how many judges are
needed to reliably classify items. Obviously, the larger the sample of judgments and the
more agreement there is between judges the more accurate the classifications will be.
With binary judgment responses, such as ‘does this item test this construct yes/no’, the
properties of the binomial distribution can be used to allow the application of inferential
statistics to the issue, via one-tailed lower-bound confidence intervals.
Due to the relatively small sample sizes inherent in the collection of judgment data the
score confidence interval (Wilson, 1927) should be used to provide the intervals, over the
more usual Wald test, in line with the recommendations of Agresti & Coull (1998).
Table 8 gives the approximate 0.05 lower bound of the one-tailed score confidence inter-
val for various numbers of judges at various levels of agreement. In order to use the table
to assess the levels of agreement we have, we need to select a cut-off value for the mini-
mum amount of agreement acceptable. In line with the studies in this paper a value of 50%
minimum agreement has been chosen. In order to accept that an item has been reliably

Table 7. Overview of results.

Problematic items identified by replication studies
Item Shiotsu study Study 1 Study 2 Study 3 Study 4 Study 5
1      
2    x ? 
3      
4 x xx  x  
5      
6 x     
7      
8      
9   xx x ? 
10 x xx x x  
11    x  
12 xx xx xx xx ? ?
13      
14 x xx xx xx ? ?
15      
16      
17      
18 xx xx xx xx ? ?
19      
20      
21 xx xx xx xx  
22      
23      
24      
25  xx xx x ? ?
26    x  
27      
28  xx x x  
29      
30      
31 x xx x x ? ?
32      
33  x x x ? 
34      
35      

Table 8. One-tailed lower 0.05 score confidence interval.
Number of judges Percentage agreement
100% 90%+ 80%+ 70%+ 60%+ 50%+

5 65% 53% 44% 35% 27% 20%
6 69% 57% 47% 37% 29% 22%
7 72% 59% 49% 40% 31% 24%
8 75% 62% 51% 41% 33% 25%
9 77% 64% 53% 43% 34% 26%
10+ 79% 65% 54% 44% 35% 27%
15+ 85% 71% 59% 49% 39% 30%
20+ 88% 74% 62% 52% 42% 33%
classified as having 50% minimum agreement with a level of confidence of 0.05 the value
in the cell for the lower bound of the confidence interval must be 50% or greater.
For example, if six out of eight judges (75%) agreed that an item tested a specific
construct, a value of ~41% would be read from Table 8 (down the 70%+ column and
across the eight judges column). Thus, we could not say, with a level of confidence of
0.05, that if we collected more data the value for agreement would not drop below 50%.
So, either, more data must be collected or the judges should not be considered to be pro-
viding sufficient evidence that the item is testing the specific construct, given the cut-off
of 50%. However, if seven out of eight judges (88%) agreed then a value of ~51% would
be read and we could accept the evidence, with a 95% level of confidence, that the item
tested the specific construct.
This method of assessing agreement, if it were adopted as a standard, would have two
main benefits. First of all, it would provide guidance to researchers as to the number of
judges that they need for acceptable levels of confidence in their classifications. And,
secondly, it would provide a standardized tool for researchers using judgments which
would allow better comparability between the judgments gathered by different studies.
However, it should be noted that this method tends towards the conservative; an observed
agreement of 50% will never yield an acceptable result and minimum proportions of 5/5,
8/10 and 14/20 are required to accept agreements with a 0.05 level of confidence.4
Discussion and conclusion

These results have several implications, not only for the study in question by Shiotsu
(2010). One implication is a reinforced caution in conducting content analyses and con-
tent validation through expert judgments only. The second inference to be drawn from
the five replication studies is that a much clearer definition is needed of what construct
is being tested for both syntactic knowledge and lexico-semantic knowledge, at the very
least for diagnostic purposes.
The first implication is not a novel finding. However, it is a finding still highly rele-
vant and one that appears necessary to be replicated, given the fact that expert judgments

are still used frequently, if not exclusively, to validate the content or construct of tests. If
a sufficient number of studies had already made the problematic nature of ‘experts’ and
their judgments in language testing clear, the field needs to ask itself why it is that
‘expert’ judgments are still often solely relied upon in these matters. The question
becomes even more pertinent when considering that the literature has, over the past dec-
ades, suggested several alternative (statistical) methods of content validation (Buck &
Tatsuoka, 1998; Lee & Sawaki, 2009, who propose using Q-matrices of ‘expert’ judg-
ments against item attributes). But a Q-matrix is, after all, nothing more than a collection
of human judgments and if the human judgments comprising the Q matrix are incorrect,
the resulting diagnostic classifications will also be incorrect. However, it is rare for other
empirical but non-statistical approaches to be employed instead of ‘expert’ judgments.
The fact that the use of ‘expert’ judgments continues to be a widespread and recom-
mended procedure (e.g. in large-scale assessments such as OECD PISA in 2000, 2003,
2006, 2009, as well as 2012), indicates that the findings of the present studies are still
worth discussing in order to further raise awareness.
In addition, the qualifications, degree of expertise and reliability of judges in content
validation studies need to be problematized as well. It is not at all clear exactly what
criteria should be used to qualify judges as ‘experts’, and the authors are not aware of any
studies, including the reference study, that would provide supporting evidence for such
criteria or qualificatory credentials.
The fact that all ‘experts’ employed in the replication studies held a degree in linguis-
tics or applied linguistics should ensure that the judgment categories were familiar and
that “their qualification for the task should not be in doubt” (Weir, Hughes & Porter,
1990, p. 507), but the different amounts of experience within the expert groups should be
acknowledged as a limitation of all such studies to date.
The second, arguably more important and more interesting insight concerns the con-
struct of ‘grammar’ and, potentially, also ‘L2 reading ability’, since one would ideally
require a clear distinction to be made between vocabulary and structural knowledge, par-
ticularly for diagnostic assessment. Such a clear distinction between syntactic and lexico-
semantic knowledge, however, might be neither achievable nor desirable since several
linguists have argued for abandoning the vocabulary–grammar dichotomy. Although lexis
and grammar have traditionally been kept apart, evidence from corpus linguistics suggests
that vocabulary and grammar, because of the highly patterned structure of language, “are
in fact inseparable” (Römer, 2009, p. 141). Römer argues that the traditional grammar–
lexicon dichotomy “may hold true for sentences which have been invented in order to
illustrate it, but it collapses when we consult real language data” (2009, p. 142). Lewis
(1993) claims that “the grammar/vocabulary dichotomy is invalid” (p. vi) and argues that
“language consists of grammaticalised lexis” (p. vi). Lewis (1993) further maintains that
“dichotomies simplify, but at the expense of suppression” (p. 37) and suggests placing
“lexical items” (p. 89), that is, words, multi-word units, polywords (e.g. phrasal verbs) or
collocations, on a cline or scale instead. Nattinger and DeCarrico (1992) claim that “lexi-
cal phrases [are] form/function composites, lexico-grammatical units that occupy a posi-
tion somewhere between the traditional poles of lexicon and syntax” (p. 36). Sinclair
(2004) asserts that “so strong are the co-occurrence tendencies of words, word classes,
meanings and attitudes that we must widen our horizons and expect the units of meaning

to be much more extensive and varied than is seen in a single word” (p. 39), suggesting
that traditional tests of vocabulary employed to investigate the contribution of vocabulary
knowledge to reading ability only paint half the picture and that “lexicogrammar”
(Sinclair, 2004, p. 39) should perhaps instead be treated as a unitary component of reading
ability rather than attempting to distinguish between vocabulary and grammar.
This concern about the real divisibility of the two components has already been raised
by Shiotsu and Weir (2007) and Brunfaut (2008) in comments on the original study in
question. Findings from other studies investigating the relative contribution of vocabu-
lary knowledge and grammar knowledge appear to confirm that the relation between
syntax and lexis is a continuum, as researchers have consistently found high correlations
between the two components (Brisbois, 1995; Shiotsu & Weir, 2007; Brunfaut, 2008). It
might therefore be of interest for future research to construct tests of lexis in a more
phraseological approach and to examine tests of formulaic sequences or multi-word units
to see whether they would account for the same amount of variance in reading test per-
formances as traditional vocabulary and grammar measures taken together. In any case,
testers and applied linguists need to recognize the slipperiness of the slope between the
constructs and need to qualify or describe their dichotomies. Using Likert scales symbol-
izing this continuum as operationalized in replication studies 4 and 5 instead of categori-
cal classifications might be a first step towards this but further replication studies and
increased problematization of judgments employed in research are needed.
Most importantly, future research needs to define its constructs better, needs to avoid
simplistic statements to the effect that Grammar is more important than Vocabulary, but
rather should make more nuanced and properly researched statements about which
aspects of which constructs seem more or less relevant to predicting reading ability in a
second language.
Acknowledgements
We wish to thank Ari Huhta and Tineke Brunfaut as well as the judges who took part in the various
replication studies, and the anonymous reviewers for their valuable feedback on earlier versions of
this paper. Part of this paper is based on a Master’s dissertation submitted to Lancaster University,
UK in December 2012.
Funding
This research received no specific grant from any funding agency in the public, commercial, or
not-for-profit sectors.
Notes
1. For some items, no response was given by some judges, which is why the total amount of rat-
ings does not amount to the expected 14 × 35 = 490.
2. The findings from Shiotsu`s main study (2010) showed that “Syntactic Knowledge (β =
.73, p < .001) is the strongest predictor of the overall Passage Reading Comprehension
performance while Vocabulary Breadth (β = .13, p < .05) […] made additional but much
smaller contributions to the prediction” (p.124f.). In Shiotsu and Weir’s (2007) study, feed-
ing partly into Shiotsu’s main study (2010), syntax was shown “to exceed vocabulary in
standardized regression weight (.61* vs. .34*), percentage of reading variance explained

(79% vs. 72%) and percentage of reading variance uniquely explained (11% vs. 4%)”
(Shiotsu & Weir, 2007, p. 114). In contrast to the original study, there are 7 extra ratings
(14 × 35 = 490) in this study, which is due to the fact that some raters allocated some items
to two or more categories.
3. In contrast to the original study, there are seven extra ratings (14 × 35 = 490) in this study,
which is due to the fact that some raters allocated some items to two or more categories.
4. We are indebted to our colleague Gareth McCray for this solution, the rationale and the
examples.
References
Agresti, A., & Coull, B. A. (1998). Approximate is better than ‘exact’ for interval estimation of
binomial proportions. The American Statistician, 52(2), 119–112.
Alderson, J.C. (1993a). Judgments in language testing. In D. Douglas & C. Chapelle (Eds.), A new
decade of language testing research (pp. 46–57). Alexandria, VA: TESOL.
Alderson, J.C. (1993b). The relationship between grammar and reading in an English for academic
purposes test battery. In D. Douglas & C. Chapelle (Eds.), A new decade of language testing
research (pp. 203–219). Alexandria, VA: TESOL.
Alderson, J.C. (2000). Assessing reading. Cambridge: Cambridge University Press.
Alderson, J.C. (2011). Investigating content analysis judgments. Unpublished manuscript.
Alderson, J. C., Brunfaut, T., McCray, G., & Nieminen, L. (2012). Components of reading in first
and second language, test item difficulty and overall reading ability. Paper presented at AAAL.
Boston, MA, March 24–27.
Bachman, L.F., Davidson, F., Lynch, B., & Ryan, K. (1989). Content analysis and statistical mod-
eling of EFL proficiency tests. Paper presented at The 11th Annual Language Testing Research
Colloquium, San Antonio, Texas.
Bachman, L.F., Davidson, F., Ryan, K., & Choi, I.C. (1995). An investigation into the compara-
bility of two tests of English as a foreign language. Cambridge: Cambridge University Press.
Bachman, L.F., Davidson, F., & Milanovic, M. (1996). The use of test method characteristics in
the content analysis and design of EFL proficiency tests. Language Testing, 13, 125–150.
Baddeley, A., Logie, R., Nimmo-Smith, I., & Brereton, N. (1985). Components of fluent reading.
Journal of Memory and Language, 24, 119–131.
Beck, I.L., & McKeown, M. (1991). Conditions of vocabulary acquisition. In R. Barr, M.L. Kamil,
P. Mosenthal & P.D. Pearson (Eds.), Handbook of Reading Research, Vol. II (pp. 789–824).
Mahwah, NJ: Lawrence Erlbaum.
Bossers, B. (1989). Lezen in de tweede taal: een taal- of leesprobleem? Toegepaste Taalweten-
schap in Artikelen [Reading in the second language: a language or reading problem?], 38,
176–188.
Brisbois, J.E. (1995). Connections between first- and second-language reading. Journal of Read-
ing Behavior, 27, 565–584.
Brunfaut, T. (2008). Foreign language reading for academic purposes. Students of English (native speak-
ers of Dutch) reading English academic texts. Unpublished PhD thesis, University of Antwerp.
Brunfaut, T. (2009). The relative contribution of grammar and vocabulary to explaining reading
test performance. Paper presented at the 6th Annual Conference of EALTA, Turku, Finland.
Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language testing:
examining attributes of a free response listening test. Language Testing, 15(2), 119–157.
Cunningham, A.E., Stanovich, K. E., & Wilson, M.R. (1990). Cognitive variation in adult college
students differing in reading ability. In T.H. Carr & B.A. Levy (Eds.), Reading and its develop-
ment: Component skills approaches. San Diego, CA: Academic Press.

Daneman, M. (1991). Individual differences in reading skills. In R. Barr, M.L. Kamil, P. Mosen-
thal & P.D. Pearson (Eds.), Handbook of reading research, Vol. II (pp. 512–538). Mahwah,
NJ: Lawrence Erlbaum.
Dixon, P., LeFevre, J.A., & Twiley, L.C. (1988). Word knowledge and working memory as predic-
tors of reading skill. Journal of Educational Psychology, 80(4), 465–472.
Grabe, W. (1991). Current developments in second-language reading research. TESOL Quarterly,
25(3), 375–406.
Hacquebord, H. (1989). Tekstbegrip van Turkse en Nederlandse leerlingen in het voortgezet
onderwijs. [Reading Comprehension of Turkish and Dutch Students Attending Secondary
Schools] Groningen: RUG.
Kaivanpanah, S., & Zandi, H. (2009). The role of depth of vocabulary knowledge in reading com-
prehension in EFL contexts. Journal of Applied Sciences, 9, 698–706.
Laufer, B. (1992). Reading in a foreign language: how does L2 lexical knowledge interact with the
reader’s general academic ability? Journal of Research in Reading, 15, 95–103.
Lee, Y., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL reading and
listening assessments. Language Assessment Quarterly, 6, 239–263.
Lewis, M. (1993). The lexical approach: the state of ELT and a way forward. London: Language
Teaching Publications.
Lumley, T. (1993). Reading comprehension sub-skills: Teachers’ perceptions of content in an EAP
test. Melbourne Papers in Language Testing, 2(1), 25–60.
Nattinger, J., & DeCarrico, J. (1992). Lexical phrases and language teaching. Oxford: Oxford
University Press.
Qian, D.D. (2002). Investigating the relationship between vocabulary knowledge and academic
reading performance: an assessment perspective. Language Learning, 52, 513–536.
Qian, D.D., & Schedl, M. (2004). Evaluation of an in-depth vocabulary knowledge measure for
assessing reading performance. Language Testing, 21, 28–52.
Römer, U. (2009). The inseparability of lexis and grammar: corpus linguistic perspectives. Annual
Review of Cognitive Linguistics, 7, 141–163.
Schoonen, R., Hulstijn, J., & Bossers, B. (1998). Metacognitive and language-specific knowledge
in native and foreign language reading comprehension: an empirical study among Dutch stu-
dents in grades 6, 8 and 10. Language Learning, 48(1), 71–106.
Shiotsu, T. (2010). Components of L2 reading: Linguistic and processing factors in the reading
test performances of Japanese EFL learners. Cambridge: Cambridge University Press.
Shiotsu, T., & Weir, C.J. (2007). The relative significance of syntactic knowledge and vocabulary
breadth in the prediction of reading comprehension test performance. Language Testing, 24,
99–128.
Sinclair, J.McH. (2004). Trust the text: Language, corpus and discourse. London: Routledge.
Weir, C.J., Hughes, A., & Porter, D. (1990). Reading skills: Hierarchies, implicational relation-
ships and identifiability. Reading in a Foreign Language, 7(1), 505–510.
Wilson, E. B. (1927). Probable inference, the Law of Succession, and statistical inference. Journal
of the American Statistical Association, 22, 209–212.
Yamashita, J. (1999). Reading in a first and a foreign language: a study of reading comprehension
in Japanese (the L1) and English (the L2). Unpublished PhD thesis, Lancaster University.

Appendix A: Problematic test items (Shiotsu, 2010)

2. The metal was ______ hot that he couldn’t touch it.
A. very B. too C. so D. extremely
4. ______ many years he studied hard for his doctorate.

A. During B. For C. Since D. From
6. My research findings were not ______ to be published.

A. interesting so B. interesting enough
C. enough interesting D. so interesting
9. ______ a pity you did not check the figures with your
partner.
A. What’s B. That’s C. There’s D. It’s
The penguin is a bird adapted to life ______ on land and in

10.
water.
A. both B. not only C. and D. either
11. My results are the same ______ yours.

A. that B. as C. than D. like
12. ______ how hard he worked, his tutor never commented on it.
A. Of no account B. No matter C. Without regard D. Mindless
14. I ______ to finish my thesis next year.

A. intend B. think C. decide D. will
18. I am taller than you ______ three inches.

A. with B. by C. of D. in
At thirteen ______ at a district school near her home, and

21.
when she was fifteen, she saw her first article in print.
A. the first teaching position that Mary Hawes had
B. the teaching position was Mary Hawes’ first
C. when Mary Hawes had her first teaching position
D. Mary Hawes had her first teaching position
25. ______ some mammals came to live in the sea is not known.
A. Which B. Since C. Although D. How
26. ______ their nests well, but also build them well.
A. Not only brown thrashers protect B. Protect not only brown
thrashers
C. Brown thrashers not only protect D. Not only protect brown
thrashers
28. Biochemists use fireflies to study bioluminescence, ______ .

A. the heatless light given off by certain plants and animals
B. certain plants and animals give off the heatless light

C. which certain plants and animals give off the heatless light
D. is the heatless light given off by certain plants and ani-
mals
Conifers first appeared on the Earth ______ the early Perm-
31.
ian period, some 270 million years ago.
A. when B. or C. and D. during
33.
______ a baby turtle is hatched, it must be able to fend for
itself.
A. Not sooner than B. No sooner C. So soon that D. As soon as
Appendix B: Test items judged to test syntactic

knowledge (Shiotsu, 2010)
1. ______ of the students has started the course.

A. Several B. Both C. Neither D. Most
3.
By the time this course finishes ______ a lot about engi-
neering.
A.
I will learn B. I learn C. I will have learnt D. I have
learnt
5. We found ______ to understand his lecture.

A. difficulty B. difficult C. so difficult D. it difficult
7.
As a result of his lectures she ______ by this new approach
to teaching.
A.
was influenced B. has influenced C. influenced D. had influenced
8. If he had known the problem, he ______ the task.

A. will not have undertaken B. had not undertaken
C. should not undertake D. would not have undertaken
Caramel is a brown substance ______ by the action of heat

13.
on sugar.
A. form B. forming C. formed D. forms
15. You’d better ______ to the doctor next time you feel ill.
A. to go B. going C. go D. gone
16. ______ I need is a long holiday.

A. What B. That C. Which D. The which

17.
He is ______ proud man that he would rather fail than ask
for help.
A. so a B. such C. a so D. such a
19. ‘Your English is very good.’

‘It should be. I ______ it ever since I started school.’
A. have been learning B. was learning
C. had learned D. had been learning
20.
If only he ______ down the results when he did the experi-
ments!
A. writes B. had written C. has written D. was writing
22.
Vitamin C, discovered in 1932, ______ first vitamin for
which the molecular structure was established.
A. the B. was the C. as the D. being the
23.
The behavior of gases is explained by ______ the kinetic
theory.
A. what scientists call B. what do scientists call
C. scientists they call D. scientists call it
24.
Ironically, sails were the salvation of many steamships
______ mechanical failures.
A. they suffered B. suffered C. were suffered D. that had suf-
fered
27. The name Nebraska comes from the Oto Indian word ‘nebrathka,’
______ flat water.
A. to mean B. meaning C. it means D. by meaning
29.
Rich tobacco and champion race horses have ______ of Ken-
tucky.
A. long been symbols B. been long symbols
C. symbols been long D. long symbols been
30. Today’s libraries differ greatly from ______

A. the past B. those of the past C. that are past D. those past
32.
There are very few areas in the world ______ be grown suc-
cessfully.
A. where apricots can B. apricots can
C. apricots that can D. where can apricots

34.
Tungsten, a gray metal with the ______, is used to form the
wires in electric light bulbs.
A. point at which it melts is the highest of any metal
B. melting point is the highest of any metal
C. highest melting point of any metal
D. metal’s highest melting point of any
Rattan comes from ______ of different kinds of palms.

35.
A. its reedy stems B. the reedy stems C. the stems are reedy
D. stems that are reedy

Language Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Testing

Uploaded by

Copyright:

Available Formats

Language Testing

Re-examining the content validation of a grammar test: The (im)possibility of

The online version of this article can be found at:

Email Alerts: http://ltj.sagepub.com/cgi/alerts

>> Version of Record - Sep 25, 2013

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

distinguishing vocabulary and

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

The original study

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

sentence comprehension (Shiotsu, 2010). Lexico-semantic knowledge was loosely

First replication study

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Shiotsu’s results (N = 14)

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 2. Results of the first replication study.

Item A B C A/B A/C

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Second replication study

Third replication study

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 3. Results of the second replication study.

Replication study 2 (N = 14)

Item A B C A/B A/C A/B/C B/C n.a.

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 4. Results of the third replication study.

Replication study 3 (N = 14)

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 5. Results of the fourth replication study.

Replication study 4 (N = 10)

Fourth replication study – approximate replication with

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 6. Results of the fifth replication study.

Replication study 5 (N = 21)

Item Mean Item Mean Item Mean

Fifth replication study

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

How many judges are sufficient?

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 7. Overview of results.

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Table 8. One-tailed lower 0.05 score confidence interval.

Number of judges Percentage agreement

100% 90%+ 80%+ 70%+ 60%+ 50%+

Discussion and conclusion

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Appendix A: Problematic test items (Shiotsu, 2010)

4. ______ many years he studied hard for his doctorate.

6. My research findings were not ______ to be published.

The penguin is a bird adapted to life ______ on land and in

11. My results are the same ______ yours.

14. I ______ to finish my thesis next year.

18. I am taller than you ______ three inches.

At thirteen ______ at a district school near her home, and

28. Biochemists use fireflies to study bioluminescence, ______ .

Downloaded from ltj.sagepub.com at UNIV OF SOUTHERN CALIFORNIA on April 4, 2014

Appendix B: Test items judged to test syntactic

1. ______ of the students has started the course.