Analysis of The Vocabulary Size Test: Paul Leeming

Analysis of the Vocabulary Size Test
Paul Leeming
Abstract
Vocabulary is an important element of language proficiency, and acquisition of an
extensive vocabulary should be a goal for every language learner (Nation, 2008). As
teachers, it is important that we incorporate a focus on vocabulary into our language
courses, but in order to do so it is helpful to have some knowledge of our students
current vocabulary level（Beglar, 2010）
. The Vocabulary Size Test is a free vocabulary
test available online, and is designed to test the level of students vocabulary, with 140
items designed to measure the first 14000 words of English. This paper presents a
Rasch analysis of the first 80 test items used with university students in Japan. Rasch
analysis was used to determine the relative difficulty of each item, and also to assess the
validity and usefulness of the test in this context. Results show that although the test is
useful in assessing the relative vocabulary levels of students, the items do not behave
entirely as predicted by the difficulty according to the word levels, and care should be
taken by teachers hoping to use the test to ascertain student knowledge of different
vocabulary levels.
Introduction
Vocabulary is an important element of language proficiency. Put simply, if we do
not know a given vocabulary item we are unlikely to understand it in spoken or written
text, and will be unable to express ourselves fully in a second language. Vocabulary
level will determine what level of reading materials can be used in class（Hu & Nation,
2000）
, and many other decisions regarding teaching materials. Teachers aim to build
students vocabulary and most language courses feature some focus on vocabulary
（Nation, 2001）
. Having decided to include a vocabulary component however, teachers
are faced with the difficult decision of what vocabulary to include. Should there be a
focus on the General Service List of vocabulary（West, 1953）which has proved its
reliability over time, or should teachers concentrate on developing a more academic
vocabulary（Coxhead, 2011）
, and assume that students know the most basic vocabulary?
−73−
教養・外国語教育センター紀要
In order to answer this question we need a test of our students current vocabulary
level, and one such test is the Vocabulary Size Test（VST）developed by Paul Nation
and David Beglar（2007）
.
This paper begins by introducing and describing the VST, before presenting a
Rasch analysis of the results of the first 80 items of the test which were administered to
81 students in a Japanese university context. The test was limited to the first 80 items
as the students in the current study were relatively low in English proficiency（average
TOEIC score of 390）
. The analysis will ascertain whether the theoretical hierarchy of
difficulty for the words is borne out in a Japanese context, and whether the test is
effective in differentiating between the different vocabulary sizes of the students in this
study. A correlation analysis with TOEIC test scores will be used to determine how
closely vocabulary knowledge can be considered a measure of overall English language
proficiency. A basic knowledge of the Rasch model（Rasch, 1960）is assumed, although a
brief description of the key parts of the model are described（for a comprehensive
introduction to the Rasch Model for Measurement see Bond & Fox, 2007）
.
The Vocabulary Size Test

The VST is a free online test of English vocabulary.（Available online at http://
www.victoria.ac.nz/lals/staff/paul-nation/nation.aspx）
. It was developed by Nation and
Beglar（2007）
, and is meant as a measure of vocabulary proficiency rather than a
specific diagnostic of levels at which students may have a lack of vocabulary. The words
for the test are based on word lists developed by Nation（2006）
, which were derived
from the British National Corpus（BNC）
. Nation created word lists for the first 14000
word families of English based on the BNC but made some adaptations. The BNC is
largely based on a written corpus of English, and therefore words such as appear
in the first one-thousand words. Nation（2006）therefore adapted the list to make it
more representative of spoken English. Questions for the VST are multiple choice
format and students are required to select a word that has the same meaning as that
presented. The test has 10 test items for each 1000 word level. An example item targets
the word . Students are presented with four options for the meaning
of SEE; ））））（see Appendix 1 for the first five
questions of the test used in this study）
. For a full description of the development of the
test see Nation and Beglar（2007）
.
−74−
Analysis of the VST
Beglar（2010）conducted a comprehensive Rasch analysis analysis of the test, and

part of this study is a replication of his work. He first tested the hypothesis that the
items increased in difficulty with a decline in frequency, so that the eight thousand word
list should be more difficult than the one thousand word list, and found support for this.
Beglar（2010）went on to test the item fit and dimensionality of the test, and found that
the majority of the items fit the Rasch model, and that the test was a unidimensional
measure of receptive vocabulary knowledge. He also established strong reliability for
the test and concluded that“test-takers were measured with a high degree of precision
on multiple versions of the test”
（Beglar, 2010, p116）
. For the current study, in addition
to the Rasch analysis mirroring that conducted by Beglar（2010）
, I conducted a
correlation analysis to assess how closely vocabulary relates to overall English
proficiency as measured by the TOEIC test, and the results of this analysis are
presented.
The Rasch Model for Measurement

The Rasch model for measurement was created by Rasch（1960）and there are
several key principles to the model. First the model assumes that the difficulty of an
item is derived from its interaction with the person, and that each item has a specific
difficulty. Based on the difficulty of items, the model is able to provide an overall
measurement for a person, and this is on a scale. The unit of measurement is logits,
which can have negative and positive values. This means that we are able to attain a
scaled score for people which can be used in subsequent statistical analyses.
Another assumption of the Rasch model is that all of the items are measuring the
same unidimensional construct. Items are regarded as misfitting the model if they are
measuring a different construct, and there are several ways of assessing the
dimensionality of a given measure. In this study I discuss item fit, and the Rasch
Principal Components Analysis（PCA）of item residuals which are both used to discuss
dimensionality. The PCA is somewhat similar to more standard factor analysis in
seeking to determine the dimensionality of a test. Therefore the Rasch model can be
used to assess dimensionality and also to attain scaled scores for the people in a given
study. Rasch analysis in the current study was conducted using the software
WINSTEPS（Linacre & Wright, 2007）
.
−75−
Methodology
Participants
The participants in the study were 81 students（58 male and 23 female students）
in a first-year compulsory English communication course of a science department at a
private university in western Japan. The age range of students was from 18 to 22, with
77 first year students and 4 students who were repeating the course and were in the
third or fourth year. All of the participants were native speakers of Japanese. Students
in the school of science and engineering are grouped according to major rather than
English proficiency, and students in the three classes that participated in this study
were majoring in biology, chemistry, and physics. The entry requirements were slightly
different for each major so there were slight differences between classes in terms of
English proficiency, and within each class there was a range of English proficiencies.
Although generally proficiency levels were similar, in one class there was a difference of
over 600 points on the TOEIC test between the highest and lowest level students, as
one student had spent her childhood in America and therefore scored highly on the
TOEIC test. All students had six years of formal education in English in Japanese
secondary schools. Only three of the participants had lived in English-speaking countries
for periods greater than one year, and two other students had taken part in short study
abroad programs. Approximately 15% of the students had experience of English
conversation classes outside of general education.
Test administration
The VST was administered to students at the start of the academic year and was
given to the students online through the website Survey Monkey. Instructions were
given in Japanese and the test format was explained. Students were given 15 minutes to
complete the test which consisted of the first 80 items of the VST. This gave students
little over 10 seconds per item with the rationale being that if the students knew the
item this would be sufficient time and would prevent excessive guessing. Students were
told to skip items if they did not know the answer. The majority of students were
unable to finish the test in the allotted time, and the last 10 items were answered by
only a small number of students, as shown by the lack of response to these items.
The TOEIC test was administered by the university approximately two months
after the vocabulary test. The test is a measure of English proficiency specifically
−76−
Analysis of the VST
designed to test business English and predict how effectively one can function in a
business environment（see http://www.ets.org/toeic for details）
. The first part of the
test focuses on listening, and the second part on reading, with questions related to
grammar and vocabulary included in the reading section. The vocabulary in the test
tends to be of a more academic or business nature, and is more likely to be covered by
the Academic Word List（Coxhead, 2011）
. Students were required to take the test but it
was zero stakes and students motivation was low. Colleagues proctoring the test
claimed that it was not uncommon for students to sleep during the test, and therefore
the reliability of scores is somewhat limited.
Following the administration of both tests a Rasch analysis was performed on the
VST results using WINSTEPS version 3.64.2（Linacre & Wright, 2007）
. The logit scores
derived from the test were correlated with the TOEIC scores for students.
Results and Discussion

Rasch Analysis of Item Fit and PCA of Item Residuals for the Vocabulary Size Test
The 80 items from the first 8,000 level of the VST were analyzed with the Rasch
dichotomous model. Table 1 shows the item fit statistics for the 80 items. The first digit
of the item number reveals frequency, so that item 35 is from the four thousand list,
while item 74 is from the eight thousand list. Item fit statistics reveal problematic items
which may be measuring a different dimension or may not be behaving as predicted by
the Rasch model. McNamara（1996）recommended assessing item fit to the model using
values of twice the standard deviation of the fit as an acceptable range. Therefore items
with infit mean square values within the range .86 and 1.14（i.e., 2 SDs）were considered
to display acceptable fit to the Rasch model for this study. Following this criterion, all
items fit the model with the exception of items 72（）
（Infit MNSQ = .85）and 61
（）
（Infit MNSQ = .80）
. There is no obvious reason why these two items would
misfit the model. is a loanword that is regularly used in Japanese, but there are
other examples of loanwords in the test that fit the Rasch model. is a difficult
item but again is not exceptional. Beglar（2010）found that most items fit the Rasch
model, and the above items fit the model in his study.
−77−
Table 1.
Infit Infit Outfit Outfit Pt-measure

Item Measure
MNSQ ZSTD MNSQ ZSTD correlation
60 Veer 2.64 .47 .99 .1 1.14 .4 .15

71 Erratic 2.64 .47 .95 .0 .65 -.6 .29
80 Mumble 2.64 .47 1.00 .1 1.05 .3 .13
77 Locust 2.43 .43 1.03 .2 1.30 .8 .08
10 Basis 2.26 .40 1.02 .2 .97 .1 .30
45 Compost 2.11 .38 .95 -.1 .76 -.5 .16
49 Fracture 2.11 .38 1.11 .5 1.17 .5 .02
34 Tummy 1.97 .36 1.02 .2 .98 .1 .17
40 Allege 1.97 .36 1.02 .2 .99 .1 .18
58 Cavalier 1.97 .36 1.03 .2 .89 -.2 .19
16 Nil 1.84 .35 1.10 .5 1.19 .7 .05
67 Demography 1.84 .35 .95 -.1 .84 -.4 .30
75 Eclipse 1.73 .33 .90 -.4 .72 -.9 .40
78 Authentic 1.73 .33 .96 -.1 .88 -.3 .29
33 Candid 1.62 .32 1.09 .5 1.17 .7 .07
19 Microphone 1.52 .31 1.03 .2 1.05 .3 .19
66 Bloc 1.52 .31 .93 -.3 .82 -.6 .35
72 Palette 1.52 .31 .85 -.7 .70 -1.2 .47
79 Cabaret 1.52 .31 .86 -.7 .83 -.6 .43
29 Rove 1.43 .30 1.08 .5 1.15 .6 .10
69 Azalea 1.43 .30 .89 -.5 .78 -.9 .41
57 Strangle 1.34 .30 1.10 .6 1.19 .8 .08
76 Marrow 1.34 .30 .94 -.3 .86 -.5 .35
65 Bristle 1.25 .29 1.01 .1 .98 .0 .24
73 Null 1.25 .29 .97 -.1 .92 -.3 .31
41 Deficit 1.17 .28 1.13 .8 1.23 1.1 .06
43 Nun 1.17 .28 1.07 .5 1.29 1.3 .11
64 Shudder 1.09 .28 .92 -.5 .93 -.3 .35
68 Gimmick 1.09 .28 .97 -.1 .92 -.4 .31
51 Devious .94 .27 1.07 .6 1.05 .4 .18
59 Malign .94 .27 .94 -.4 .90 -.5 .36
55 Threshold .87 .26 1.00 .1 .95 -.3 .28
70 Yoghurt .80 .26 .95 -.4 .93 -.4 .35
74 Kindergarten .80 .26 .95 -.4 .96 -.2 .34
56 Thesis .74 .26 .97 -.3 .93 -.4 .33
15 Patience .55 .25 .97 -.3 1.05 .4 .30
52 Premier .55 .25 1.10 1.0 1.10 .8 .14
63 Stealth .55 .25 .97 -.3 .96 -.3 .33
39 Remedy .49 .25 .95 -.5 .89 -.9 .38
（）
−78−
Analysis of the VST
（）
Infit Infit Outfit Outfit Pt-measure

Item Measure
MNSQ ZSTD MNSQ ZSTD correlation
44 Haunt .43 .24 1.04 .4 1.18 1.4 .20

62 Quilt .25 .24 .99 -.1 .97 -.2 .31
48 Peel .19 .24 1.05 .6 1.09 1.0 .21
53 Butler .19 .24 .95 -.6 .94 -.6 .37
32 Latter .14 .24 .97 -.3 .94 -.6 .34
47 Miniature .14 .24 .95 -.7 .90 -1.0 .39
13 Upset .08 .24 1.03 .4 1.02 .3 .26
14 Drawer .03 .24 1.07 1.0 1.11 1.2 .19
04 Figure -.03 .23 1.08 1.1 1.10 1.1 .18
23 Jug -.03 .23 1.12 1.8 1.19 2.1 .11
54 Accessory -.03 .23 .95 -.7 .93 -.8 .37
61 Olive -.08 .23 .80 -3.3 .76 -3.1 .60
31 Compound -.14 .23 1.13 2.0 1.16 1.9 .10
27 Pave -.19 .23 1.11 1.6 1.09 1.1 .16
22 Restore -.25 .23 .97 -.5 .97 -.4 .35
42 Wept -.25 .23 .95 -.8 .93 -.9 .38
24 Scrub -.57 .24 1.08 1.1 1.10 1.1 .18
03 Period -.74 .24 1.14 1.7 1.20 2.0 .08
50 Bacterium -.86 .24 .91 -1.1 .87 -1.2 .43
25 Dinosaur -1.03 .25 .92 -.8 .92 -.6 .40
18 Circle -1.10 .25 1.05 .5 1.06 .5 .22
26 Strap -1.10 .25 1.04 .4 1.00 .0 .24
37 Crab -1.29 .26 .95 -.4 .87 -.8 .37
46 Cube -1.29 .26 .97 -.2 .91 -.6 .34
11 Maintain -1.42 .26 1.00 .1 .99 .0 .27
30 Lonesome -1.57 .27 .96 -.2 1.13 .7 .27
09 Standard -1.64 .28 1.08 .6 1.17 .9 .13
20 Pro -1.64 .28 1.03 .2 1.02 .2 .22
21 Soldier -1.80 .29 .94 -.3 .88 -.5 .34
07 Jump -2.17 .32 .98 .0 .81 -.6 .31
36 Input -2.17 .32 1.04 .3 1.63 2.0 .07
12 Stone -2.28 .33 .98 .0 .85 -.4 .28
17 Pub -2.52 .36 1.06 .3 1.13 .5 .11
35 Quiz -2.81 .40 .94 -.1 1.28 .8 .22
01 See -3.43 .52 1.02 .2 1.02 .2 .11
08 Shoe -3.43 .52 1.03 .2 1.09 .4 .09
28 Dash -3.43 .52 .92 .0 .54 -.8 .33
02 Time -4.88 1.01 1.01 .3 .98 .4 .04
05 Poor -4.88 1.01 .99 .3 .57 .0 .13
06 Drive -4.88 1.01 1.02 .3 1.36 .7 -.02
38 Vocabulary -4.88 1.01 .98 .3 .47 -.2 .16
−79−
In order to determine the degree to which these two misfitting items were
influencing the person measures, Rasch person ability estimates were calculated with
and without these two items. The Pearson correlation was significant with a value of
.998; thus, it was concluded that these items were not affecting the overall person ability
estimates and they were therefore retained in subsequent analyses.
The Wright map（Figure 1）shows the persons on the logit scale on the far left of
the figure. The persons are displayed as Xs according to their Rasch person ability
measures, with persons with larger vocabularies toward the top of the map and persons
with smaller vocabularies toward the bottom of the map. The items are displayed on the
right side of the figure according to their difficulty estimates: More difficult items are
toward the top of the map and easier items toward the bottom. A person has a 50%
chance of correctly answering an item that is at the same point on the logit scale. The
average measure for persons was -.28, indicating that the items were well matched to
the participants although a little difficult for this group as shown by the negative value
of the mean for persons. This is supported by the Wright Map, which shows that
participants and items are well distributed about the mean and that there are no
significant gaps in the item hierarchy. Linacre（2002）considers gaps in item hierarchy
of greater than .59 logits to indicate a problem, and there are no gaps of this size close
to the people ability measures. Again the findings mirror those of Beglar（2010）
, which
showed that the test had sufficient items to avoid floor and ceiling effects and to
accurately measure the range in respondents’receptive vocabulary knowledge.
The VST separates words by frequency levels, and item difficulty is hypothesized
to increase with item number, as word frequency decreases（Nation & Beglar, 2007）
.
This claim is generally supported by the distribution of items, with low frequency words
being more difficult. One exception to this is English loan words in Japan, which should
generally be slightly easier, regardless of their frequency in the English language,
although words such as 70（）may be difficult due to spelling. The most difficult
items are 60（）
, 71（）
, and 80（）
, which are all in the 6,000-word
frequency level or above, and these items were too difficult for all the students. The
easiest items are 2（）
, 5（）
, 6（）
, and 38（）
, which were
answered correctly by all the participants.
−80−
Analysis of the VST
--------------------------------------------------------------------------------
Students with a | More difficult items
larger vocabulary |
3 +
|
|
| 60: Veer 71: Erratic 80: Mumble
| 77: Locust
| 10: Basis
| 45: Compost 49: Fracture
2 + 34: Tummy 40: Allege 58: Cavalier
|S 16: Nil 67: Demography
| 75: Eclipse 78: Authentic
| 19: Microphone 33: Candid 66: Bloc 72: Palette
X | 79: Cabaret 29: Rove 69: Azalea
| 57: Strangle 65: Bristle 73: Null 76: Marrow
X | 41: Deficit 43: Nun 64: Shudder 68: Gimmick
1 XXX T+ 51: Devious 59: Malign
| 55: Threshold 70: Yoghurt 74: Kindergarten
XX | 56: Thesis
XX | 15: Patience 52: Premier 63: Stealth
XXX S | 39: Remedy 44: Haunt
XX | 62: Quilt
XXXXXXXX | 13: Upset 32: Latter 47: Miniature 48: Peel
0 XXXXXXX +M 53: Butler 04: Figure 14: Drawer 23: Jug
XXXXXXXX | 54: Accessory 27: Pave 31: Compound 61: Olive
XXXXXXXXXXX M | 22: Restore 42: Wept
XXXX |
XXXXXXXXX | 24: Scrub
XXX | 03: Period
XX | 50: Bacterium
-1 XXXXXX S + 25: Dinosaur
XX | 18: Circle 26: Strap
XXX | 37: Crab 46: Cube
X | 11: Maintain
T | 09: Standard 20: Pro 30: Lonesome
X |
X |S 21: Soldier
-2 +
X | 07: Jump 36: Input
| 12: Stone
|
| 17: Pub
|
| 35: Quiz
-3 +
|
|
| 01: See 08: Shoe 28: Dash
|
|T
|
-4 +
|
|
|
|
|
| 02: Time 05: Poor 06: Drive 38: Vocabulary
-5 +
Students with a | Easier items
smaller vocabulary |
--------------------------------------------------------------------------------
Wright map for the vocabulary size test items. M = Mean; S = 1 SD; T 2 SDs.
Each X = 1 person. −81−
Items 2, 5, and 6 are in the first 1,000 high frequency words of English and would
be expected to be easy. However item 38 is in the 4,000-word level and is therefore
relatively infrequent and yet was easy for these students. The vocabulary item for 38 is
and although a relatively infrequent word, it is used by the teacher regularly,
and was used to introduce this test, and therefore known to all students. Item 35（）
is also easy for these students as it is a loan word, commonly used in Japanese and
therefore known to all students. Item 50（）is relatively easy for these
students as they are science students and therefore this is a high frequency word. Item
10 （） is high frequency and yet not known by any of these students. An
examination of the item shows that it is difficult for these students, with a mixture of
responses. Item 15（）is in the 2,000-word frequency level, and yet proved to be
difficult for these students. is not typically taught in Japanese English classes, and
although a high frequency word of English, it is quite specialized, being used generally
to talk about football scores in England. This explains why this item was difficult for
these students.
The logit difficulty for items was aggregated by level and the results are shown in
Table 2 below. Results show that generally the item difficulty increase by level although
there are several anomalies in the hierarchy. Words in the two thousand level are
harder than the next two subsequent levels, and there is a negligible difference in item
difficulty for the six and seven thousand words levels. One possible explanation is that
the students were in their first year, and had been studying vocabulary to pass the
entrance exam for university. The vocabulary in the entrance exam is unlikely to focus
on simple language, and therefore the students have studied the third and fourth
thousand extensively, making these items easier for this particular group of students.
Table 2.
Average Logit Score
1000 -23.82
2000 -4.94
3000 -8.54
4000 -5.1
5000 4
6000 10.15
7000 9.74
8000 17.6
−82−
Analysis of the VST
These results somewhat mirror those of Beglar（2010）

, who found that the three
thousand level items were easier than the second thousand, and there was little
difference in difficulty between the high frequency words. Beglar（2010）explained that
most participants in his study were very familiar with the high frequency words and
therefore there was very little to distinguish differences in knowledge at the lower
levels. The results of this study, combined with Beglar（2010）
, suggest that there is
potentially some problems with the words at the three thousand level and that they
should be re-examined for difficulty.
The Rasch person reliability estimate was .79, and the person separation statistic
was 1.97, which is close to the benchmark of 2.0（Bond & Fox, 2007）
, indicating good
separation, and suggesting that these items were well distributed in terms of difficulty.
The Rasch item reliability estimate was .96, indicating high reliability and the item
separation statistic was 4.81, which was well above the benchmark of 2.0. These results
support the conclusion that the test is effective in differentiating between the different
levels of vocabulary knowledge for participants in this study.
Tests should be unidimensional, measuring a single construct, and in order to
investigate the dimensionality of the test, a Rasch PCA of item residuals analysis was
run. The results showed that 39.3% of the variance（eigenvalue = 51.9）was explained
by the Rasch model, 32.3% of the variance（eigenvalue = 42.5）was explained by the
items, and 3.6% of the variance（eigenvalue = 4.8）was explained by the first residual
contrast. Although the raw variance explained by the Rasch measures fell short of the
50% benchmark（Linacre 2007）
, the unexplained variance in the first contrast is below
the 5% criterion, and the variance explained by items is greater than four times the
unexplained variance in the first contrast, suggesting that the measure is unidimensional
（Linacre 2007）
.
Factor loadings from the 80 items comprising the VST can also be analyzed to
determine the dimensionality of the test. Item loadings above .40 are considered to be
strong, and suggestive of different factors or constructs in the test（Bond & Fox, 2007）
.
There were eight items with positive loadings above .40, and two items with negative
loadings above -.40（for example -.69）
. An analysis of these items did not reveal any
factors that suggested that they were measuring different constructs. The items with
positive residual loading were generally low frequency items that were relatively
difficult, while the two items with negative residual loading above -.40 were higher
−83−
frequency items that were less difficult. These results combined with the results of the
PCA, indicate no evidence that a meaningful second dimension exists in the data and
that it might simply be difficulty; thus, it was concluded that the test items form a
fundamentally unidimensional construct. Beglar（2010）reached the same conclusion,
and had strong results supporting the claim for unidimensionality of the test.
Descriptive Statistics for the Vocabulary Size Test

Following Rasch analysis of the VST, the descriptive statistics were examined in
order to ascertain the normality of the distribution. The results are shown in Table 3.
Table 3.
Mean -.27
.08
95% CI ［-.42, -.12］
.68
Skewness -.06
.27
Kurtosis .40
.54
The results are in logits attained from the Rasch analysis, and show that the distribution
is normal for this measure. There were no outliers on this measure.
Correlation Analysis
In order to test the assumption that the VST was a measure of English proficiency
a bivariate correlation analysis was performed with the listening and reading sections of
the TOEIC test. Table 4 shows the results of the correlation analysis.
Table 4.
1. VST 2. LIST 3. READ
1. VST ̶
2. LIST .48 ̶
3. READ .56 .69 ̶
. VST = Vocabulary Size Test; LIST = TOEIC listening; READ = TOEIC reading. All
correlations significant at < .01（2-tailed）
.
−84−
Analysis of the VST
All of the correlations are significant indicating that there is a clear link between
vocabulary size and overall language proficiency. Reading correlates more strongly with
the VST, as reading is more likely allow time for off-line processing of vocabulary, and
both reading and the VST use the same modality. The correlations are statistically
significant and strong, suggesting that vocabulary size is an important factor in
language proficiency, but that there are other variables. Again this is expected, as
language proficiency is considered to be complex and multifaceted（Fulcher & Davidson,
2007）
.
The pedagogical implications of the current study are that the VST is a useful
measure of the receptive vocabulary knowledge of students such as those in the current
study, and can be used to differentiate between vocabulary knowledge of students, even
within a fairly homogenous sample. This makes the test useful for researchers looking
to measure elements of language proficiency, and also for teachers who may want to
use relative proficiency differences in constructing groups within the language
classroom. As Nation and Beglar（2007）state, the test should not be used to decide
which level of vocabulary to focus on, as there was a reasonable mix of items from
different levels. The fact that the current test is free and readily available will also
appeal to teachers.
Conclusion
The results of the Rasch analysis support the previous analysis by Beglar（2010）
,
and show that the VST is a useful test of the general vocabulary level of students in
this context. Loan words do prove a slight complication in a Japanese context, giving
students knowledge of some relatively obscure vocabulary. Also, although the test
generally followed the predicted order in that items in the higher word bands became
more difficult, there was quite a large degree of mixing of difficulty, and based on these
results it would be unadvisable to use the test to try to identify specific weaknesses in
vocabulary within a given level. Again supporting the findings of Beglar（2010）
, the test
was unidimensional and was effective in differentiating between the different levels of
vocabulary knowledge for the students in this study, with a good level of separation of
items. The strong correlation with the TOEIC test supports the claim that vocabulary is
a relatively important part of overall language proficiency and suggests that teachers
−85−
should include a vocabulary component in their language courses.

The current study also demonstrated how Rasch analysis can be used to analyze a
test and provide scaled scores for use in further analyses. It should be noted that test
validity and reliability involve the interaction of the test items with the test takers, and
therefore each researcher should conduct an analysis of any tests used to ensure that
they are valid in that specific context.
There are several limitations to the current study which should be considered.
The students were given insufficient time to complete all of the test, and therefore
results for the seven and eight thousand word levels are likely to be affected by
guessing, as some students rushed to finish the test within the time allotted. This could
have influenced results, and future studies should ensure that students have sufficient
time to complete all of the items on the test. Although correlation with the TOEIC test
suggests concurrent validity, it would be useful to assess if the vocabulary test predicts
performance in class.
References
Beglar, D.（2010）. A Rasch-based validation of the Vocabulary Size Test.

, （1）, 101‒118.
Bond, T. G., & Fox, C. M.（2007）.
（2nd Edition）. New Jersey: Routledge.
Coxhead, A.（2011）. The Academic Word List ten years on: Research and teaching
implications. , （2）
:355 - 362.
Fulcher, G., & Davidson, F.（2007）.
. New York: Routledge.
Hu, M., & Nation, I. S. P.（2000）. Vocabulary density and reading comprehension.
, （1）, 403-430.
Linacre, J. M.（2002）. Optimizing rating scale category effectiveness.
, 3, 85-106.
Linacre, J. M.（2007）. .
Chicago, IL: MESA Press.
Linacre, J. M., & Wright, B. D.（2007）. WINSTEPS: Multiple-choice, rating scale, and
−86−
Analysis of the VST
partial credit Rasch analysis［Computer software］. Chicago, IL: MESA.

McNamara, T. F.（1996）. . London: Longman.
Nation, I. S. P.（2001）. . Cambridge: Cambridge
University Press.
Nation, I. S. P.（2006）. How large a vocabulary is needed for reading and listening?
, （1）, 59-82.
Nation, I. S. P.（2008）. . Boton, MA:
Cengage Learning.
Nation, P. & Beglar, D.（2007）. A Vocabulary Size Test. （7）,
9-13.
Rasch, G.（1960）. .
Copenhagen, Denmark: Denmark Paedagogiske Institut.
West, M.（1953）. London: Longman, Green & Co.
−87−

Analysis of The Vocabulary Size Test: Paul Leeming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of The Vocabulary Size Test: Paul Leeming

Uploaded by

Copyright:

Available Formats

Analysis of the Vocabulary Size Test

The Vocabulary Size Test

Beglar（2010）conducted a comprehensive Rasch analysis analysis of the test, and

The Rasch Model for Measurement

Results and Discussion

Infit Infit Outfit Outfit Pt-measure

60 Veer 2.64 .47 .99 .1 1.14 .4 .15

Infit Infit Outfit Outfit Pt-measure

44 Haunt .43 .24 1.04 .4 1.18 1.4 .20

Average Logit Score

These results somewhat mirror those of Beglar（2010）

Descriptive Statistics for the Vocabulary Size Test

1. VST 2. LIST 3. READ

should include a vocabulary component in their language courses.

Beglar, D.（2010）. A Rasch-based validation of the Vocabulary Size Test.

partial credit Rasch analysis［Computer software］. Chicago, IL: MESA.

You might also like