You are on page 1of 3


Principles of Language Assessment
2. Validity
'Validity' is an all-encompassing term which is related to questions about what the test is actually
assessing. Is the test telling you what you want to know? Does it measure what it is intended to
measure? A test is not valid, for example, if it is intended to test a student's level of reading
comprehension in a foreign language but instead tests intelligence or background knowledge.
When a new test is constructed it should be assessed for validity in as many ways as possible. The
aspects of validity which are looked at will, of course, depend on the purpose for which the test has
been designed and will partly depend on the importance of the assessment. A teacher writing a
classroom quiz will not have the time or the inclination to carry out many different investigations of
validity, but the constructors of an examination which will affect candidates' futures are duty-bound
to examine as many aspects of validity as possible
There are different views on the best ways of assessing validity, but there are some key aspects, and
it is good practice to investigate as many of these as possible:
2.1 Construct validity
The term 'construct validity' refers to the overall construct or trait being measured. It is an inclusive
term which, according to some testing practitioners, covers all aspects of validity, and is therefore a
synonym for 'validity'. If a test is supposed to be testing the construct of listening, it should indeed
be testing listening, rather than reading, writing and/or memory. To assess construct validity the
test constructor can use a combination of internal and external quantitative and qualitative
methods. For more about this, see section 6. An example of a qualitative validation technique would
be for the test constructors to ask test-takers to introspect while they take a test, and to say what
they are doing as they do it, so that the test constructors can learn about what the test items are
testing, as well as whether the instructions are clear, and so on.
Construct validation also relates to the test method, so it is often felt that the test should follow
current pedagogical theories. If the current theory of language teaching emphasises a
communicative approach, for example, a test containing only out-of-context, single-sentence,
multiple-choice items, which test only one linguistic point at a time, is unlikely to be considered to
have construct validity.
2.2 Content validity
The content validity of a test is sometimes checked by subject specialists who compare test items
with the test specifications to see whether the items are actually testing what they are supposed to
be testing, and whether the items are testing what the designers say they are. (On specifications as a
whole, see Davidson & Lynch 2002, and Alderson, Clapham & Wall 1995: Chapter 2.) In the case of
a classroom quiz, of course, there will be no test specifications, and the deviser of the quiz may
simply need to check the teaching syllabus or the course textbook to see whether each item is
appropriate for that quiz.
One of the advantages of even the most rudimentary content validation is that it identifies those
items which are easy to test but which add nothing to our knowledge of what the students know; it is
tempting for a test writer to write easy-to-test items, and to ignore essential aspects of a foreign
language, for example, because they are difficult to assess.
2.3 Face validity

For example. Consequential validity (impact) How well of assessment results accomplishes intended purposes and avoids unintended effect. as well as the order of the scores. Face validity is an important aspect of a test. A test must be reliable. to see whether students who have passed the test do actually have enough of the foreign language to be able to teach it in the classroom.). it may not work as it should. 'criterionrelated' validity of a test. To assess criterion-related validity. If these non-specialists do not think the test is testing candidates' knowledge in a suitable manner. and may have to be redesigned. based on the subjective judgment of the examinees who take it. for example. a reliability estimate such as the Alpha Coefficient or Kuder Richardson 21 may be calculated (see Alderson. (See Alderson. be correlated with other measures of the students' language ability such as teachers' rankings of the students. for example. for example. and appears to measure the knowledge or ability it claims to measure. a reliable test is one where. possibly with a different examiner. or with the scores on a similar test. 4. they may. Clapham & Wall 1995: 87-89). are similar. and correlations between two different markers' scores (inter-rater reliability) can be estimated. However. Reliability The reliability of a test is an estimate of the consistency of its marks. a student will get the same mark if he or she takes the test. If the test lacks face validity. for example. also assess the 'external'.It refers to the degree to wich a test looks right. but some methods. Clapham & Wall 1995: 172-73. 3. and these are widely used for 'high-stakes' tests. as a test cannot be valid unless it is reliable. the students' test scores may. but more simple measures such as correlations between the scores a marker gives on Day 1 and Day 5 (intra-rater reliability). perhaps by classroom observation. a multiple-choice test of grammatical structures may be wonderfully reliable. the test should be validated.) 2.4 Criterion-related validity The aspects of test validity described so far relate to the 'internal' validity of the test. Similarly the future ability of the students can be assessed (the test's predictive validity) to see if the test can accurately foretell how the candidates will fare in the future. but it is not valid if teachers are not interested in the grammatical abilities of their students and/or if grammar is not taught in the related language course. A statistic which can be used by the statistically sophisticated is based on Generalizability Theory (see Crocker & Algina 1986: Chapter 8. it relates to the question of whether non-professional testers such as parents and students think the test is appropriate. then other forms of test reliability must be estimated. complain vociferously and the candidates may not tackle the test with the required zeal. the converse is not true: it is perfectly possible to have a reliable test which is not valid. If the test consists of right/wrong items such as multiple-choice items or some sorts of short answer questions. along with calculations of whether the levels of raters' marks. Such measures assess the concurrent validity of the measure. For example. Washback . on a Monday morning or a Tuesday afternoon. but if the test consists of an essay or an oral interview. if a test is supposed to assess whether students have a high enough level of a foreign language to be able to teach that language to secondary school children.

for example. . This should be held in mind by the test constructors.Any language test or piece of assessment must have positive washback (backwash). to candidates learning material by heart or achieving high marks by simply applying test-taking skills rather than genuine language skills (see Wall 1997). by which I mean that the effect of the test on the teaching must be beneficial. it is only too easy to construct a test which leads.