You are on page 1of 58

ENGLISH LANGUAGE TESTING

CHAPTER I
EVALUATION IN LANGUAGE LEARNING

This chapter discusses the position of evaluation in language learning, the explanation of some
terms in language testing, the functions of evaluation, and the target of evaluation in language learning.

A. The Place of Evaluation in Language Learning


In general, in the process of teaching and learning, including in language learning, there are
three main instructional components which serve as the anchor points of instructional design. Those
components are learning goal, learning activity and evaluation. Those components are closely related
and cannot be separated from each other.
Learning goal is the first component in the instructional design which contains some general
and specific objectives of a certain instructional program. Those objectives are used as the main
reference in planning and implementing the instructional activity.
Next, learning activity is any activity done by the teachers and the students interactively using
books and other learning tools and intended to reach the learning goal. These activities should be
based on and intended to reach the designed objectives.
The last component is evaluation which has the purpose of measuring how far the learning goal
has been achieved through the implementation of learning activity. Therefore, based on the evaluation,
someone can decide whether the learning activity is good or not in order to reach a certain learning
goal. From this, we know that the three components have not only one way relationship, but also
iterative and cyclical relationship (Djiwandono, 2006:5).
In the process of teaching and learning, especially that of language teaching, evaluation is
meant to know the students’ language mastery level. As it deals with language mastery, the four
language skills are concerned. They are listening, reading, speaking, and writing skills. Besides
language skills, some language components such as grammar, phonology, and vocabulary should also
be evaluated.

B. Evaluation, Assessment, Measurement, and Testing


In the discussion of language testing, some technical terms, which are variously defined by the
experts, are usually used, including evaluation, assessment, measurement, and test. Instead of
contrasting some various and different opinions, the similar definitions by Allison (1999) Brown (2004 &
2006) and Djiwandono (2008) are referred here.
Djiwandono (2008:14) used the terms evaluation and assessment almost interchangeably in
the teaching learning process. He mentions that in general, evaluation is a systematic gathering of
information for the purpose of making decision. The information deals not only with the students’
improvement in achieving the learning goals but also with the accomplishment of teaching learning
program in general. Similarly, the process of gathering information about what is known and can be
done by the students to show their progress in the teaching learning process is called assessment. It is
supported by Brown (2004:4) stating that assessment is an ongoing process that encompasses a much
wider domain than simply measuring the students’ performance at identifiable times in a curriculum.
Further, in his other book, Brown (2006:402) explains that any student’s answer, comment, or written
work can be assessed any time with intention or incidentally by the teacher to reveal the students’
performance.
While assessment can be used as an alternative term as evaluation, measurement is one way
of evaluation which deals with the process of quantifying the characteristics of persons or things
according to explicit procedures or rules. The result of measurement can be used for evaluation or
assessment.
Related with the test, Brown (2006:401) sees a test as an instrument or procedure designed to
elicit performance from the learners with the purpose of measuring their attainment of specified criteria.
Tests always occur at identifiable times in a curriculum when learners master all their faculties to offer
peak performance, knowing that their response are being measured and evaluated.
The same opinion is proposed by Djiwandono (2008:12) stating that a test is a tool or
procedure used to measure the students’ language proficiency. From a test, the teacher will get
quantitative score which can be analyzed by the tester. Further discussion about a test will be
described in the next chapter.
Concerning with the relationship between evaluation, measurement and testing, Allison
(1999:8) makes conclusion by quoting Bachman’s statement that all tests involve measurement, but not
all measurement involves testing. Some evaluation also involves measurement, both in tests and in
other forms of measurement, while some does not. Conversely, not all tests and other measurement
are used for evaluation. It is only when the result of tests are used as a basis for making decision that
evaluation is involved.

C. The Functions of Evaluation in Teaching Learning Program


In our daily life, the word evaluation is always associated with the students’ achievement in the
end of teaching learning process. Parents and students usually concern whether they get A, B, C, D or
6, 7, 8, 9 for a certain language class. In fact, evaluation has more functions and objectives than that.
To know more about the functions of evaluation in a language program, the explanation following
Djiwandono (2008: 5-8) is presented.
As it is always associated by most people, evaluation definitely concerns with the students’
improvement in achieving the learning goals. It is commonly believed that the better the result of
evaluation, the higher the students’ achievement is. Therefore, when the students know that their result
of evaluation is bad, they review and improve their learning techniques and habit to increase their
achievement as indicated by the result of their learning evaluation.
Besides being very useful for the students, the result of evaluation is very useful in giving
feedback for the teacher about his/her learning activities and interaction with the students in the
classrooms. Although the teacher is not the only factor in determining the success of a teaching
learning process, unsatisfying result of evaluation can remind him or her whether s/he has planned the
teaching learning process well, whether the planned activities have been implemented well, whether
the materials, techniques of teaching and media have been selected well, etc. This kind of reflection is
very crucial and must be done by the teacher if s/he expects a successful teaching learning program.
Finally, the result of evaluation can also give a beneficial feedback for the curriculum designers
in determining the appropriate learning goal for the students. As it has been mentioned earlier, the
three components of learning are interrelated. They are learning goal, learning activity and evaluation.
In this case, when the result of evaluation is not satisfying, it is possible that the learning goal set earlier
is not very suitable with the students’ need and condition and therefore, it should be modified to get the
effective learning.

D. The Target of Evaluation in Language Learning


As the name indicates, evaluation in language learning must be focused on evaluating the
learners’ language proficiency. Commonly, language proficiency is divided into four language skills;
they are listening, reading, speaking and writing skills. Listening skill is the skill in understanding
speech in a second or foreign language. Reading skill is the skill in perceiving a written text in order to
understand its contents. Speaking skill is the ability to express ideas in acceptable spoken English
form. Writing skill is the ability to express ideas in acceptable written English form.
Based on these four skills, Johnson (2001:269) divides into two conventional ways. The first
way of classifying the skills is based on the medium, with listening and speaking occurring in the
spoken medium, reading and writing in the written medium. The second division is into the receptive
skills of listening and reading, and the productive skills of speaking and writing.
Besides the four language skills, the other target of evaluation in language learning is language
components that consist of vocabulary, pronunciation, and grammar. Vocabulary deals with the
meaning of words and word formation in the language. While pronunciation concerns the way a certain
sound or sounds are produced, grammar has something to do with the description of the structure of a
language and the way in which linguistic units such as words and phrases are combined to produce
sentences in the language.
Those language skills and components described earlier should be evaluated well in order to
know the students’ language proficiency. The detail aspects of each skill and component will be
described in the next chapter.

Discussion
1. Mention the three main components of teaching learning process and explain the relationship
between them.
2. Discuss the differences between ‘evaluation’ and ‘test’
3. How can the result of evaluation affect the students’ learning?
4. What do you think a teacher can do when the result of evaluation is not very satisfying?
5. In what case should the learning goal be modified?
6. When the students involve in a language program, what aspects need to be evaluated?
CHAPTER II
APPROACHES AND TYPES OF LANGUAGE TESTING

This chapter provides some explanation of approaches of language testing with its historical
background and the important issues in language testing. This will be followed by some classifications
of language testing based on its purpose, interpretation, and scoring method.

A. Approaches to Language Testing: A Brief History and Some Current Issues in Classroom Testing
In order to get a comprehensive understanding of classroom-based testing, the classification of
approaches to language testing is presented in relation with history of language testing over the past
half-century. The next discussion will be about some current issues in language classroom testing.
Dealing with approaches to language testing, Heaton (1988) classifies it into four main approaches: (1)
the essay-translation approach, (2) the structuralist approach, (3) the integrative approach and (4) the
communicative approach. Brown (2004) adds one more approach: performance-based assessment.

The Essay-Translation Approach (Traditional Approach)


As other languages began to be taught in educational institutions, the Classical Method or
better known as Grammar Translation Method was adopted as the chief means for teaching foreign
languages (Brown, 2001:18). There was no teaching on how to master oral communication using
language. Instead, only reading proficiency with grammatical focus was concerned.
The essay-translation/ traditional approach or what Brown (1996:23) calls as pre-scientific
movement is associated with grammar-translation method. Such approach has existed for ages, and
the end of this movement is usually delimited. The period of this approach ended with the onset of the
structuralist approach, but clearly, such approach has no end in language teaching because, without a
doubt, such teaching and testing are going on in many parts of the world at this very moment.
No special skill or expertise in testing is required for this approach. The subjective judgment of
the teacher is considered to be of paramount importance. Tests usually consist of essay writing,
translation, and grammatical analysis (often in the form of comments about language being learnt).
This approach is said not to have theoretical justification on the basis of linguistics, psychology, or
education. Therefore, a more communicative approach of testing was demanded.
The Structuralist Approach (Discrete Approach)
This approach is characterized by the view that language learning chiefly concerns with the
systematic acquisition of a set of habits. It draws on the work of structural linguistics, in particular the
importance of contrastive analysis and the need to identify and measure the learners’ mastery of the
separate elements of the target language: phonology, vocabulary, and grammar. The skills of listening,
speaking, reading and writing are also separated from one another as much as possible because it is
considered essential to test one thing at a time. It was claimed that an overall language proficiency test
should sample all four skills and as many linguistic discrete points as possible (Brown, 2004:8).
However, such an approach demanded a decontextualization that often confused test-taker. So, as it
came an era of emphasizing communication, authenticity, and context, new approaches were sought.

The Integrative Approach


Since discrete-point testing could not fulfill the demand of communicative competence which is
global and needs integration of all language skills and components, integrative approach emerged. This
approach involves the testing of language in context and is thus concerned primarily with meaning and
the total communicative effect of discourse. Consequently, integrative tests do not seek to separate
language skills into neat divisions in order to improve test reliability: instead they are often designed to
assess the learners’ ability to use two or more skills simultaneously. Thus integrative tests are
concerned with the global view of proficiency which it is argued every learner possesses regardless of
the purpose for which the language being learnt. Integrative testing involves functional language but not
the use of functional language.
Brown (2004:8-9) provides two types of tests that have historically been claimed to be
examples of integrative: cloze tests and dictation. A cloze test is a reading passage (perhaps 150 to
300 words) in which roughly every sixth or seventh word has been deleted; test-taker is required to
supply words that fit into those blanks. According to theoretical constructs supporting the effectiveness
of having cloze test, the ability to supply appropriate words in blanks requires a number of abilities that
lie at the heart of competence in a language: knowledge of vocabulary, grammatical structure,
discourse structure, reading skills and strategies, and an internalized ‘expectancy’ grammar (enabling
one to predict an item that will come next in a sequence).
Dictation is a familiar language-teaching technique that evolved into a testing technique.
Essentially, learners listen to a passage of 100 to 150 words read aloud by an administrator (or
audiotape) and write what they hear, using correct spelling. The proponents of such test type argue that
dictation is an integrative test because it taps into grammatical and discourse competencies required
for other modes of performance in a language. Success on a dictation requires careful listening,
reproduction in writing of what is heard, efficient short-term memory, and, to an extent, some
expectancy rules to aid short-term memory.
However, despite the strong argument on an indivisible view of language proficiency (well-
known as unitary trait hypothesis), some research results indicated that such hypothesis was not totally
right. Those studies found significant and widely varying differences in performance on an ESL
proficiency test, depending on subjects’ native country, major field of study, and graduate versus
undergraduate status. Farhady in Brown (2004:9) gave an example that Brazilians scored very low in
listening comprehension and relatively high in reading comprehension. Based on that research and
other related researches, the notion of unitary trait hypothesis was, then, questioned.

The Communicative Approach


When many educators were beginning to abandon the unitary trait hypothesis, the focus
changed into designing communicative language-testing tasks. However, the communicative approach
to language testing is sometimes linked to the integrative approach. Both approaches emphasize the
importance of the meaning utterances rather than their form and structure. The fundamental difference
between the two approaches is that communicative tests are concerned primarily (if not totally) with
how language is used in communication. Consequently, most aim to incorporate tasks which
approximate as closely as possible to those facing the students in real life. Or, on the other words,
language test performance must have correspondence with language use. Language use is often
emphasized to the exclusion of language usage. Success is judged in terms of the effectiveness of the
communication which takes place rather than formal linguistic accuracy.
Communicative testing presented challenges to test designers in identifying the kinds of real-
world tasks that language learners were called upon to perform. Weir (in Brown, 2004:10) provided
some accounts that test-designers should take; where, when, how, with whom, and why language is to
be used, and on what topics, and with what effect. Those are all what authenticity of tasks and
genuineness of text concern about.

Performance-Based Assessment
In language courses and programs around the world, test designers are now tackling this new
and more student-centered agenda. Instead of just offering paper-and-pencil selective response tests,
performance-based assessment of language typically involves oral production, written production,
open-ended responses, integrated performance (across skill areas), group performance, and other
interactive tasks. To be sure, such assessment is time-consuming and therefore expensive, but those
extra efforts are paying off in the forms of more direct testing because students are assessed as they
perform actual or simulated real-world tasks. In technical terms, higher content validity is achieved
because learners are measured in the process of performing the targeted linguistic acts.
In an English language-teaching context, performance-based assessment is more likely to be
in the border line between formal and informal assessment. The extreme differences between formal
and informal assessment will be described in the section of traditional and alternative assessment.

Some Currents Issues in Classroom Testing


a. Traditional and Alternative Assessment
In recent years, there has been a growing interest in the application of assessment procedures
that are radically different from traditional forms of assessment. More authentic assessment, such as
portfolios, interviews, journal, project work and self- or peer assessment have become increasingly
common in the ESL classroom.
It is difficult to draw a clear line of distinction between the traditional and alternative
assessment because some forms of assessment fall in between the two and some combine the best of
both. Table 2.1 will describe both concepts in general. It should be noted, however, that the table
represents some overgeneralizations and should therefore be considered with caution. Besides, it
shows a bias toward alternative assessment, and one should not be misled into thinking that everything
on the left-hand side is tainted (negative) while the list on the right-hand side offers salvation to the field
of language assessment. Therefore, it is better for us to consider both kinds of assessment fairly. The
traditional assessment should be valued and utilized for the functions that it provides. At the same time,
we might all be stimulated to look at the right-hand list and ask ourselves if, among those concepts,
there are alternatives to assessment that we can constructively use in our classrooms.

Table 2.1 Traditional and Alternative Assessment


Traditional Assessment Alternative Assessment
One-shot, standardized exams Continuous long-term assessment
Timed, multiple choice format Untimed, free-response format
Decontextualized test items Contextualized communicative tasks
Score suffice for feedback Individualized feedback and washback
Norm-referenced scores Criterion-referenced scores
Focus on the right answer Open-ended, creative answers
Summative Formative
Oriented to product Oriented to process
Non-interactive performance Interactive performance
Foster extrinsic motivation Foster intrinsic motivation

Considerably more time and higher institutional budgets are required to administer and score
assessments that presupposition, more subjective, more individualization, and more interaction in the
process of offering feedback. The payoff of the latter, however, comes with more useful feedback to
students, the potential for intrinsic motivation, and ultimately a more complete description of a student’s
ability (Brown, 2004:14).

b. Computer-Based Testing
Recent years have seen a burgeoning of assessment in which the test-taker performs
responses on a computer. Almost all computer-based test items have fixed, close-ended responses.
However, tests like the Test of English as a Foreign Language (TOEFL) offer a written essay section
that must be scored by humans (as opposed to automatic, electronic, or machine scoring).
A specific type of computer-based test, a computer-adaptive test, has been available for many
years but has recently gained momentum. In a computer-adaptive test (CAT), each test-taker receives
a set of questions that meet the test specifications and that are generally appropriate for his or her
performance level. The CAT starts with questions of moderate difficulty. As test-takers answers each
question, the computer scores the question and uses that information, as well as responses to previous
questions, to determine which question will be presented next. As long as examinees respond correctly,
the computer typically selects questions of greater or equal difficulty. In correct answers, however,
typically bring questions of lesser or equal difficulty. The computer is programmed to fulfill the test
design as it continuously adjusts to find questions of appropriate difficulty for test-takers at all
performance levels. In CATs, the test-taker sees only one question at a time, and the computer scores
each question before selecting the next one. As a result, test-takers cannot skip questions, and once
they entered and confirmed their answers, they cannot return to questions or to any earlier part of the
test.

B. Types of Language Testing


We use tests to obtain information. The information that we hope to obtain will of course vary
from situation to situation. It is possible, nevertheless, to categorize tests according to a small number
of kinds of information being sought. This categorization will prove useful both in deciding whether an
existing test is suitable for a particular purpose and in writing appropriate new tests where these are
necessary. Four types of tests described by Hughes, 1989:9-14) are: proficiency tests, achievement
tests, diagnostic tests, and placement tests. Djiwandono (2008) and Brown (2004:43) provide one more
type of test: aptitude test.

Proficiency Tests
Proficiency tests are designed to measure people’s ability in a language regardless of any
training they may have had in that language. The content of a proficiency test, therefore, is not based
on the content or objectives of language courses which people taking the test may have followed.
Rather, it is based on a specification of what candidates have to be able to do in the language in order
to be considered proficient. ‘Proficient’ means having sufficient command of the language for a
particular purpose.
An example of proficiency test would be a test used to determine whether a student’s English is
good enough to follow a course of study at a British university. Such a test may even attempt to take
into account the level and kind of English needed to follow courses in particular subject areas.
There are other proficiency tests which, by contrast, do not have any occupation or course of
study in mind. For them, the concept of proficiency is more general. A typical example of a
standardized proficiency test is the Test of English as a Foreign Language (TOEFL) produced by the
Educational Testing Service. The function of such test is to show whether the candidates have reached
a certain standard with respect to certain specified abilities. Usually, the test administrators are
independent teaching institutions and so can be relied on by potential employers to make fair
comparisons between candidates.
Proficiency tests have traditionally consisted of standardized multiple choice items on grammar,
vocabulary, reading comprehension, aural comprehension and sometimes a sample of writing.

Achievement Tests
In contrast to proficiency tests, achievement tests are directly related to language courses. The
purpose of this kind of test is to establish how successful individual students, group of students, or the
courses themselves have been in achieving objectives.
There are two kinds of achievement tests: final and progress ones. Final achievement tests are
those administered at the end of a course of study. The content of final achievement test should be
based directly on course objective. Progress achievement tests, on the other hand, are intended to
measure the progress that students are making. Since ‘progress’ is towards the achievement of course
objectives, these tests too should relate to objectives (short-term objectives).
Diagnostic Tests
Diagnostic tests are used to identify students’ strengths and weaknesses. They are intended
primarily to ascertain what further teaching is necessary. Therefore, it is designed to diagnose a
particular aspect of a language.
A diagnostic test in pronunciation might have the purpose of determining which phonological
features of English are difficult for a learner and should therefore become the part of a curriculum.
Usually, such tests offer a checklist of features for the administrator (often the teacher) to use in
pinpointing difficulties. It is not advisable to use a general achievement test as a diagnostic, since
diagnostic tests should information on what students need to work on in the future. Therefore,
diagnostic test will typically offer more detailed subcategorized information on the learner. Conversely,
achievement tests are useful for analyzing the extent to which students have acquired language
features that have already been taught.

Placement Tests
Placement tests, as their name suggests, are intended to provide information which will help to
place students at the stage (or in the part) of the teaching program most appropriate to their abilities.
Typically, they are used to assign students to classes at different levels.
Placement tests can be bought, but this is not to be recommended unless the institution concerned is
quite sure that the test being considered suits its particular teaching program. No one placement test
will work for every institution and the initial assumption about any test that is commercially available
must be that it will not work well.
The placement tests which are mostly successful are those constructed for particular situations.
They depend on the identification of the key features at different levels of the teaching in the institution.
Such placement tests will result in accurate placement.

Aptitude Test
Finally, we need to consider the type of test that is given to a person prior to any exposure to
the second language, a test that predicts a person’s future success. A language aptitude test is
designed to measure a person’s capacity or general ability to learn a foreign language and to be
successful in that undertaking. Aptitude tests are considered to be independent of a particular
language.
Two standardized aptitude tests have been used in the US – the Modern Language Aptitude
Test (MLAT) and the Pimsleur Language Aptitude Battery (PLAB). Both are English language tests and
require students to perform such tasks as memorizing numbers and vocabulary, listening to foreign
words, and detecting a spelling clues and grammatical patterns.

Norm-Referenced Test versus Criterion Referenced Test


In terms of interpretation of test result, there are two classifications of tests. They are norm-
referenced test and criterion-referenced test. Having norm-referenced tests, the result of the test of a
student is interpreted in relation to that of other students in the form of a mean (average score), median
(middle score), standard deviation and/or percentile rank. The purpose in such test is to place test-
takers along a mathematical continuum in rank order. Scores are usually reported back to test-taker in
the form of numerical score (for example, 230 out of 300) and a percentile rank (such as 84 percent,
which means that test-taker’s score was higher than 84 percent of the total number of test-takers, but
lower than 16 percent in that administration).
Criterion-referenced test, on the other hand, does not compare one student with other students’
performance. Rather, it classifies students according to whether or not they are able to perform some
tasks satisfactorily. It is designed to give students feedback, usually in the form of grades on specific
course or lesson objective. The tasks are set and the performances are evaluated. It does not matter
whether all students are successful or not. Those who perform satisfactorily, pass; those who don’t, fail.
This means that students are encouraged to measure their progress in relation to meaningful criteria,
without feeling that, because they are less able than their fellows, they are destined to fail.

Objective Testing versus Subjective Testing


The distinction here is between methods of scoring. If no judgment is required on the part of the
scorer, then the scoring is objective. The forms of such testing are multiple choice test, matching test,
and true-false test (Djiwandono, 2008:37)
If it requires the scorer’s judgment, the scoring is called subjective. The composition test or
essay test is an example of subjective test.

Discussion
1. Try to summarize the strengths and the weaknesses of each testing approach!
2. Among the five approaches described in this book, which one do you think the most effective
approach in assessing the students’ language proficiency?
3. What can computer-based assessment assist the test-administrator and test-takers?
4. By the end of a language learning program, the teacher gives the students a test. What kind of
test does she do?
5. Give one example of proficiency test and explain the purpose of such test.
CHAPTER III
PRINCIPLES OF LANGUAGE TESTING

This chapter explores how principles of language assessment can and should be applied to formal
tests, but with ultimate recognition that these principles also apply to assessments of all kinds. Brown
(2004) proposes five principles of reliability, validity, practicality, authenticity, and washback.

A. Reliability
A reliable test is consistent and dependable. If the students are given the same test on two
different occasions, the test should yield similar results. The word ‘similar’ is used here because it is
almost impossible for the test-takers to get exactly the same scores when the test is repeated the
following day. This is because of the fact that human beings do not simply behave in exactly the same
way on every occasion, even when the circumstances seem identical. Therefore, the more similar the
scores are, the more reliable the test is.
Hughes (1989) presents some examples of students’ test scores. Table 3.1a represents the
scores obtained by ten students who took 100-item test A for a particular occasion, and the scores
obtained by them a day later. Note the size of difference between the two scores for each student.

Table 3.1 Scores on Test A

Student Score obtained Score obtained on the following day


Bill 68 82
Mary 46 28
Ann 19 34
Harry 89 67
Cyril 43 63
Pauline 56 59
Don 43 35
Colin 27 63
Irene 76 62
Sue 62 49

Now look at Table 3.1b which displays the same kind of information for another 100-item test B. Again
note the difference in scores for each student.

Table 3.1b Scores on Test B


Student Score obtained Score obtained on the following day
Bill 65 69
Mary 48 52
Ann 23 21
Harry 85 90
Cyril 44 39
Pauline 56 59
Don 38 35
Colin 19 16
Irene 67 62
Sue 52 57

From the above tables, it can be seen that the differences between the two sets of scores are
much smaller for Test B than for Test A. Therefore, it can be concluded that Test B appears to be more
reliable than Test A although in practice the claims about reliability would not be that simple for the
small number of test-takers.

The Reliability Coefficient


It is possible to quantify the reliability of a test in the form of a reliability coefficient. Reliability
coefficients allow us to compare the reliability of different tests. The ideal reliability coefficient is 1. It
means that a test with a reliability of coefficient of 1 has precisely the same results for a particular set of
test-takers regardless of when it happened to be administered. On the contrary, a test which has a
reliability coefficient of zero would give sets of results quite unconnected with each other.
Quoting Lado’s statement, Hughes (1989) mentioned that a reliability coefficient for vocabulary,
structure and reading tests should be in the range of 0.90 to 0.99, while listening comprehension tests
are usually in the 0.80 to 0.89 range. Speaking tests may be in the range of 0.70 to 0.79.
There are some ways to get reliability coefficient as the followings.

1. Test-Retest Method
To arrive at reliability coefficient of a test, the following requirements should be followed. First,
we have to get two sets of scores for comparison. The most obvious way of obtaining these is to get a
group of subjects to take the same test twice. This is known as test-retest method.
Consequently, there will be two sets of scores from the first and the second administration of
the same test. Then, the two sets of scores are calculated to get the correlation between them.
Pearson Product-Moment formula is usually used to find the correlation.

Formula 3.1 Pearson Product-Moment formula


_ _
rxy = ∑ (Y – Y) (X – X)
N Sy Sx

where
rxy = Pearson product-moment reliability coefficient
Y = each student’s score on test Y
Y = mean on test Y
Sy = standard deviation on test Y
X = each student’s score on test X
X = mean on test X
Sx = standard deviation on test X
N = The number of students who took the two tests

However, this method has weakness as sometimes because the second administration of the
test is too soon after the first, then, subjects are likely to recall items and their responses to them. Then,
the reliability is spuriously high.

2. Alternate Forms Method


The above effect of using test-retest method can be reduced by using two different forms of the
same test (the alternate forms method). This method also results in two sets of scores. These two sets
of scores are then calculated for the correlation using Pearson Product-Moment formula (see Formula
3.1).

3. Split Half Method


The most common methods of obtaining the necessary two sets of scores involve only one
administration of one test. Such method provides us with a coefficient of ‘internal consistency’. The
most basic of these is the split half method. In this, the subjects take the test in the usual way, but each
subject is given two scores. One score is for one half of the test, the second is for the other half. The
two sets of scores are then used to obtain the reliability coefficient as if the whole test had been taken
twice. In order for this method to work, it is necessary for the test to be split into two halves which are
really equivalent, through the careful matching of items (in fact where items in the test have been
ordered in terms of difficulty, a split into odd-numbered items and even-numbered items may be
adequate).
Then the two sets of scores are calculated for the correlation using Pearson Product-Moment
formula. However, the whole test coefficient has not been gotten yet. There should be further
calculation using Spearman-Brown Prophecy formula for the whole test

Formula 3.2 Spearman-Brown Prophecy formula


r-xx = 2 x r
1+r
where,
r-xx = reliability of the whole test a
r = correlation coefficient resulted from Pearson-Product Moment

4. Kuder-Richardson Reliability
To get Kuder-Richardson reliability, it requires test administration only once. One correct
answer is given point 1, while incorrect answer is given 0. There are two most commonly used
formulas; they are KR-20 and KR-21.The latter is considered simpler than the former.

Formula 3.3 KR-20 formula

KR-20 = k [1- ∑pq]


k-1 S2
where,
k = number of items
p = proportion of correct answer for an item
q = proportion of incorrect answer of an item
S = standard deviation

Formula 3.4 KR-21 formula

KR-21 = k [1- X(k-X )


k-1 kS2

5. Rater Reliabilty
Besides the ways to arrive at reliability mentioned previously, there is another kind of reliability
specialized for subjective tests in which the response cannot be judged as correct or incorrect and it
involves the rater in the process of judgment. The examples of such tests are test of writing and
speaking.
The rater reliability will also require two sets of scores, but the scores are not gotten from test-
retest method, alternate form method or split half method. Rather, the two sets of scores are gotten
from intra-rater reliability and inter-rater reliability.
Intra-rater reliability is achieved when one scorer or one rater does the scoring twice. Hence,
two sets of scores are gotten and then are calculated using Pearson Product-Moment for getting
correlation coefficient.
Inter-rater reliability is achieved when two scorers or two raters do the scoring. Then, as in
intra-rater reliability, the two sets of scores gotten from the two raters are calculated to get the
correlation coefficient.

The Possible Causes of Unreliability


The issue of reliability of a test may be best addressed by considering a number of factors that
may contribute to the unrealibility of a test. The following possibilities are the causes of unreliable test
results proposed by Mousavi as quoted by Brown (2004: 21-22).
First, the most common-related issue in reliability is caused by test-takers’ temporary illness,
fatigue, a ‘bad day’, anxiety and their other physical or psychological factors, which may make an
observed score deviate from test-takers’ true score. Also included in this category are such factors as
some test-takers’ test strategies for efficient test taking.
Secondly, human error, subjectivity, and bias may affect the scorer or the rater. Rater
unreliability occurs when two or more scorers yield inconsistent scores of the same test, possibly for
lack of attention to scoring criteria, inexperience, inattention, or even preconceived biases.
Rater unreliability issues are not limited to contexts where two or more scorers are involved.
Intra-rater unreliability is a common occurrence for classroom teachers because of unclear scoring
criteria, fatigue, bias toward particular good and bad students, or simple carelessness. One solution to
such intra-rater unreliability is to read through about half of the test before rendering any final scores or
grades, then to recycle back through the whole set of tests to ensure an unbiased judgment.
Next, unreliability may also result from the conditions in which the test is administered. One
example would be the administration of a test of aural comprehension in which a tape recorder played
items for comprehension, but because of the street noise outside the building, students sitting next to
window could not hear the tape accurately. Other sources of unreliability are found in photocopying
variations, the amount of light in different parts of the room, variation in temperature, and even the
condition of desks and chairs.
Finally, the nature of test itself can cause measurement errors. If a test is too long, test-takers
may become fatigues by the time they reach the later items and hastily respond incorrectly. Timed tests
may discriminate against students who do not perform well on a test with a time limit. Poorly written test
items (that are ambiguous or that have more than one correct answer) may become source of test
unreliability.
How to make tests more reliable
As we have seen, there are two components of test reliability: the performance of test takers
from occasion to occasion, and the reliability of scoring. First, it will be suggested some ways of
achieving consistent performance from test takers, then from scorer.
 Take enough samples of behavior/performance The more items on a test, the more reliable the
test will be. Of course, the items should be relevant with the objective so that they give the
additional contribution to the reliability of the test. While it is important to make a test long
enough to achieve satisfactory reliability, it should not be made so long that the test-takers
become so bored or tired that the performance that they exhibit becomes unrepresentative.
 Do not allow too much freedom In some kinds of language test, such as in writing test, there is
a tendency to offer test-takers a choice of questions and then allow them a great deal of
freedom in the way that they answer the questions. Therefore, if the test-takers are given a
choice, the range over the possible answers should be restricted.
 Write unambiguous items It is essential that test-takers should not be presented with items
whose meaning is not clear or having an unanticipated answer. The best way to prevent
unambiguous items is to have the colleagues to criticize the test items to find alternative
interpretations of the test items. The other way of preventing ambiguous test items is by pre-
testing the items on a group of people comparable to those the test are intended.
 Provide clear and explicit instructions This applies both to written and oral instructions. The
common fault done by test writers is that they suppose that the students all know what is
intended by carelessly worded instructions. Hence, colleagues’ critic is the best means of
avoiding this problem.
 Ensure that tests are well laid out and perfectly readable Too often that tests are badly typed or
handwritten, have too much text in too small space, and are poorly reproduced. As a result,
students faced with additional tasks which are not related with the performance being
measured.
 Test-takers should be familiar with format and testing techniques If the test format and
technique are not familiar to the test-takers, they are likely to perform less well than they would
do otherwise. For this reason, every effort must be made to ensure that all test-takers have had
the experience of doing similar test.

The following suggestions are intended to achieve the scorer reliability.


 Use items that permit objective scoring Multiple choice test is considered good to permit
objective scoring although it is difficult to write good multiple choice items and always needs
extensive pre-testing. The alternative of multiple choice test is the open-ended item has one-
word correct response.
 Make comparisons between test-takers as direct as possible In writing test, scoring the
compositions all on one topic will be more reliable than if the test-takers are allowed to choose
from five to six topics.
 Provide a detailed scoring key Acceptable answers and points for any possible responses
should be specified. For higher scorer reliability, the key should be as detailed as possible in its
point assignment.
 Train scorers This is especially important where scoring is subjective. The scoring of
compositions, for example, should not be assigned to anyone who has not learned to score
compositions accurately.
 Identify test-takers by numbers, not name Scorers inevitably have expectations of test-takers
that they know, especially in subjective testing. Studies have shown that even when test-takers
are unknown to the scorers, the name on a script (or a photograph) will make a significant
difference to the scores given. The identification by only number will reduce such effects.
 Employ multiple, independent scoring As a general rule, and when the testing is subjective, all
scripts should be scored at least by two independent scorers. Neither scorer should know how
the other has scored a test paper. Then, the senior colleague should compare for
discrepancies.

B. Validity
The most complex criterion of an effective test and the most important principle of language
testing is validity. It is the extent to which inferences made from assessment results are appropriate,
meaningful, and useful in terms of the purpose of the assessment (Gronlund in Brown, 2004:22). A test
should test what the writer wants to test. Test validity presupposes that the writer can be explicit about
what is to be tested and takes steps to ensure that the test reflects realistic use of particular ability to be
measured (Weir, 1993:19). A valid test of reading ability actually measures reading ability, not previous
knowledge, nor some other irrelevant variable. To measure writing ability, one might ask students to
write in 15 minutes, then simply count the words for the final score. Although it would be easy to
administer (practical) and the scoring is quite dependable (reliable), it would not be considered a valid
test of writing ability because there is no consideration of comprehensibility, organization of ideas and
other factors of writing ability.
How is the validity of a test established? These four types of validity below will provide
evidence to achieve the validity of the test.

Content Validity
A test is said to have content validity if its contents constitutes a representative sample of the
language skills, structures, etc. being tested. It is obvious that grammar test, for instance, must be
made up of items testing knowledge of grammar. However, it is not enough to ensure content validity.
The test will have content validity if it includes a proper sample of the structure or content which is
relevant with the purpose of the test. It would be silly if an achievement test for intermediate learners
has the same content as one for advanced learners. In order to judge whether or not the test has
content validity, we need a specification of the skills or structure being tested. A comparison of test
specification and test content is the basis for judgment for content validity.

Criterion-Related Validity
Another approach to test validity is to see how far results on the test agree with those provided
by some independent and highly dependable test. This independent test is thus the criterion measure
against which the test is validated and is called criterion-related validity.
There are essentially two kinds of criterion-related validity: concurrent validity and predictive
validity (Hughes, 1989:23). Concurrent validity is established when the test and the criterion are
administered at about the same time. Demonstrating concurrent validity usually requires one group of
students to take two kinds of tests: the new test being developed and another well-established test. For
instance, to demonstrate the criterion-related validity of a new test called Test of Overall ESL
Proficiency (TOESLP), a test developer might administer it to a group of students wishing to study
English in USA. As a criterion measure, the test developer might also administer a well-established
test, TOEFL to the same group of students. Then, the two sets of scores gotten from both tests are
calculated for the correlation coefficient. When the calculation yields, let’s say, 0.95, this indicates a
very strong relationship between the two sets of scores. Thus, it can be concluded that the new test is
as good as TOEFL.
Predictive validity, on the other hand, concerns the degree to which a test can predict future
performance of test-takers. To prove this kind of validity, the test developer might administer a certain
test before the students start a course. After one semester, the same group of students might take the
same test and the scores, resulted from the first and the second test, are calculated for the correlation
coefficient. The closer the correlation to 1, the stronger the relationship between the two sets of scores,
and the stronger the test predicts the students’ future.

Construct Validity
A test is said to have construct validity if it can be demonstrated that it measures just the ability
which is supposed to measure. The word ‘construct’ refers to any underlying ability which is
hypothesized in a theory of language ability. Brown (2004:25) mentioned that a construct is any theory,
hypothesis, or model that attempts to explain observed phenomena in our universe of perception.
The illustration of the use of underlying theory in language testing is as follows. It is assumed
that the ability to write involves a number of sub-abilities such as control of punctuation, style, grammar.
By having such knowledge of sub-abilities of writing, of course, we will not develop test of writing in the
form of multiple choice where the control of punctuation cannot be detected. Therefore, the right form of
writing test must be asking the test-takers to write.

Face Validity
A test is said to have face validity if it looks as if it measures what it is supposed to measure.
For example, a test which pretended to measure pronunciation ability but which did not require the test-
takers to speak might be thought to lack face validity. This is true even if the test’s construct and
criterion-related validity can be demonstrated. Face validity is hardly a scientific concept, yet it is very
important. A test which does not have face validity may not be accepted by test-takers, teachers,
education authorities or employers.

C. Practicality
Besides being reliable and valid, an effective test is practical. This means that it is not
excessively expensive, stays within appropriate time constraints, is relatively easy to administer, and
has a scoring procedure that is specific and time-efficient.
A test that is prohibitively expensive is impractical. A test of language proficiency that takes a
student five hours to complete is impractical – it consumes more time (and money) than necessary to
accomplish its objective. A test that takes a few minutes for a student to take and several hours for
examiner to evaluate is impractical for most classroom situations.
D. Authenticity
The next principle of developing a good test is being authentic. The idea of authentic is usually
associated with real-world tasks (likely to be performed in real world). More specifically, the authentic
test is described by Brown (2004:35) with the following characteristics:
 The language in the test is as natural as possible.
 Items are as contextualized as possible rather than isolated.
 Topics and situations are meaningful (relevant, interesting, enjoyable, and/or humorous) for the
students.
 Some thematic organization is provided, such as through a story line or episode.
 Tasks represent, or closely approximate, real-world tasks.

E. Washback
Washback generally refers to the effects the test have on instruction. This can be in the form of
how students prepare for the test. “Cram” courses and “teaching to the test” are examples of such
washback. Another form of washback that occurs more in the classroom assessment is the information
that ‘washes back’ to the students in the form of useful diagnoses of strengths and weaknesses.
Washback also includes the effects of an assessment on teaching and learning prior to the assessment
itself, that is, on preparation for the assessment. Informal performance assessment is by nature more
likely to have built-in washback effects because the teacher is usually providing interactive feedback.
Formal tests can also have positive washback, but they provide washback if the students receive a
simple letter grade or a single overall numerical score.
The challenge to teachers is to create classroom tests that serve as learning devices through
which washback is achieved. Students’ incorrect responses can become windows of insight into further
work. Their correct responses need to be praised. Finally, washback enhances a number of basic
principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, and
inter language (Brown 2004:29).

Discussion
1. Review the five basic principles of language testing and define them clearly!
2. Reliability can be achieved through many ways. Explain the ways briefly!
3. There are many factors causing unreliability of a test. Discuss the factors concisely!
4. Why do you think that content validity is important?
5. Do you think that face validity is essential in achieving a valid test?
6. What things should be considered to develop a practical test?
7. What do you know about an authentic test?
8. Give examples of washback that can be achieved by test administration!
CHAPTER IV
TESTING LANGUAGE SKILLS AND COMPONENTS

This chapter discusses several ways of testing language skills and components. However, it will be
preceded by some explanation about various test techniques.

A. Test Techniques and Testing Overall Ability


Test techniques are means of eliciting behavior from test-takers in which their language
abilities will be revealed. Some test techniques will be discussed here and followed with Cloze test and
dictation which are used to test overall ability.

Multiple Choice Test


Multiple choice items take many forms, but their basic structure is as follows.
There is a stem:
Andy has been here ….. half an hour.
and a number of options, one of which is correct, the others being distractors:
A. during
B. for
C. while
D. since

It is the test-takers’ task to identify the correct or most appropriate option (in this case B).
Multiple choice test technique has some advantages. The most obvious advantage is that scoring can
be perfectly reliable. Scoring should also be rapid and economical. A further considerable advantage is
that it is possible to include more items than other forms of tests since the test-takers have only to
make a mark on the paper.
Despite the advantages, multiple choice test technique has also some limitations. This
technique tests only recognition knowledge. It cannot give an accurate picture of test-takers’
performance. A multiple choice grammar test score, for example, may be poor indicator of someone’s
ability to use grammatical structure. The person who can identify the correct response in the item above
may not be able to produce the correct form when speaking or writing. Therefore, the construct validity
of such technique is questionable.
Besides, multiple choice test technique gives the test-takers a chance of guessing the correct
answer. It will not be known what part of any particular individual’s score has come about through
guessing.
Writing successful items for multiple choice tests is also extremely difficult. Hughes (1989:61)
provides some commonest problems in multiple choice tests. Among others are that there are more
than one correct answer, there is no correct answer, there are clues in the options as to which is
correct, and the distractors are ineffective.
Practice language using multiple choice items will not be the best way for the students to
improve their command of a language since usually much attention is paid to improving one’s guessing
rather than to the content of the items. Hughes(1986:61) consider multiple choice tests as having
harmful backwash.
Finally, multiple choice tests technique is said to facilitate students’ cheating because the
responses (a, b, c, d) are so simple that can make them communicate easily to others nonverbally.
All in all, the multiple choice technique is best suited to relatively infrequent testing of large numbers of
test-takers. In order to make effective and good items in multiple choice test, Djiwandono (2008:47)
suggests the test developer to be careful in formulating the stem, and the correct answer and the
distractors. The stem should be in the form of complete sentence whenever possible. In order to avoid
students’ guessing, it is important to have identical options in terms of form, content, and length. Having
identical options will force the students to think critically.

Matching Test
Matching test require the students to match two parts of a test. The two parts are usually
interrelated in terms of meaning or content. Usually, the two parts are in the form of list. The first list
usually consists of some statements or questions, while the second consists of responses. To make
matching test effective, the number of responses should be more than the statements. This is meant to
make the students think critically until the last questions.

True False Test


Similar to matching test, true false test has also two parts. The first part consists of a list of
statements. The second part is true (T) or false (F) listed beside each statement. The students should
choose true (T) when the statement is considered correct and vice versa.

Cloze, C-Test, and Dictation: measuring overall ability


Cloze, C-Test and dictation technique are recommended as means of measuring overall ability
because they are considered economical. The original form of cloze test involves deleting a number of
words in a passage, leaving blanks, and requiring the test-taker to replace the original words. After
having a short unmutilated ‘lead in’, it is usually about every seventh word which is deleted. The cloze
procedure seemed very attractive. It is easy to construct, administer and score.
The C-Test is really a variety of cloze which is considered superior to the general cloze
procedure. Instead of whole words, it is the second half of every second word which is deleted. An
example follows.
There are usually five men in the crew of fire engine. One o__ them dri__ the eng__. The
lea___ sits bes___ the dri___. The ot___ firemen s__ inside t__ cab o__ the f___ engine. T__
leader h__ usually be___ in t___ Fire Ser___ for ma___ years. H___ will kn___ how t___
diff___ sorts o___ fires. S__, when t___ firemen arr___ at a fire, it is always the leader who
decides how to fight a fire.

The supposed advantages of C-Test over the more traditional one are that only exact scoring is
necessary and that shorter passages are possible. Possible disadvantage of C-Test is that it is harder
to read than the cloze procedure
Dictation is a testing technique in which the passage is read aloud to students, with pauses
during which they have to write down what they heard as accurately as possible (Richard et al, 1992).
Dictation test gives results similar to those obtained from cloze test. In predicting overall ability, it has
the advantage of involving listening ability. It is also easy to create, relatively easy to administer, but is
not certainly easy to score. Because of scoring problem, partial dictation may be considered as an
alternative. In this, part of what is dictated is already printed on the answer sheet. The test-takers are
simply to fill in the gaps and the scoring is likely to be more reliable.

B. Testing Language Skills and Language Components


As indicated previously that the target of language testing is testing language skills (listening,
speaking, reading, and writing) and language components (grammar, vocabulary and pronunciation).
Discussion of each skill and component will be presented as summarized from Hughes (1989)

Testing Listening
It may seem rather odd to test listening separately from speaking, since the two skills are
typically exercised together in oral interaction. However, there are occasions, such as listening to the
radio, listening to lectures, or listening to announcement, when no speaking is required.
The testing of listening involves listening macro-skills and micro-skills. The macro-skills of
listening include listening for specific information, obtaining gist of what is being said, following
directions, and following instructions. The micro-skills of listening include interpretation of intonation
patterns and recognition of function of structures. At the lowest level are abilities like being able to
distinguish between phonemes (for example between /w/ and /v/).
There are several types of texts that can be used for listening test such as monologue,
dialogue, or multi-participant. Those types can be in the forms of announcement, talk or lecture,
instructions, directions, etc.
The source of listening test materials can be recordings of radio broadcasts, television
broadcast, teaching materials, or even recording of native speakers made by ourselves. The most
important thing to consider is that recordings must be good and natural.
There are some techniques that are possibly used in testing listening. They are multiple choice,
short answer, information transfer, note taking, partial dictation and recordings and live presentations.
 Multiple choice The technique has some advantages and disadvantages as discussed
previously. For listening test, the problem is greater because the test-takers should listen to a
passage while reading the alternatives/options. Therefore, the options must be short and
simple.
 Information transfer This technique is useful in testing listening since it makes minimal
demands on productive skills. It can involve such activities as the labeling of diagrams or
pictures, completing forms, or showing routes on a map.
 Note taking Where the ability to take notes while listening to a lecture is in question, this activity
can be quite realistically replicated in the testing situation. Test-takers take notes during the
talk, and only after the talk is finished do they see the items to which they have to respond.
 Partial dictation Although partial dictation may not be an authentic listening activity, it may be
possible to administer it when no other test of listening is practical.
 Recording and live presentation The great advantage of using recordings when administering a
listening test is that there is uniformity in what is presented to the test-takers. This is fine if the
recording is to be listened in a well-maintained language laboratory or in a room with good
acoustic qualities and with suitable equipment. If these conditions cannot be obtained, then a
live presentation is preferred.

Testing Speaking
The objective of teaching speaking is the development of the ability to interact successfully in
that language and therefore, speaking involves comprehension as well as production. Consequently,
testing speaking should enable the students to elicit the behavior which truly represents their ability and
which can be scored validly and reliably.
The materials tested for speaking test include dialog and multi-participant interactions including
operations of language functions such as:
- Expressing: thanks, requirements, opinions, comment, attitude, confirmation,
apology, want/need, information, complaints, reasons/justifications.
- Narrating: sequence of events
- Eliciting: information, directions, service, clarification, help, permission.
- Directing: ordering, instructing, persuading, advising, warning
- Reporting: description, comment, decisions.
There are several formats that can be used to test speaking ability. They are interview, interaction
with peers, and response to tape-recordings. Each format has some techniques.
 Interview It is the most obvious format for testing speaking.
 Questions and request for information. For questions and request, yes/no
questions should be avoided.
 Pictures can also be used to elicit descriptions. Series of pictures (or video
sequences) form a natural basis for narration.
 Interaction with peers Two or more test-takers may be asked to discuss a
topic, make plans, and so on.
 Role play Students can be asked to assume a role in a particular situation
and the tester can act as an observer.
 Response to tape-recordings Uniformity of elicitation procedures can be
achieved through presenting the students only with the same audio- (video-) tape recordings.
 Imitation The test-takers hear a series of sentences, each of which they
have repeat in turn.

Scoring will be valid and reliable only if clearly recognizable and appropriate descriptions of criteria
levels are written and scorers are trained to use them. Description of speaking proficiency usually deals
with accent, grammar, vocabulary, fluency and comprehension as in the following examples taken from
Hughes (1989).

Proficiency Description
Accent
1. Pronunciation frequently unintelligible
2. Frequent gross errors and a very heavy accent make understanding difficult, require frequent
repetition.
3. Foreign accent requires concentrated listening, and mispronunciations lead to occasional
misunderstanding and apparent errors in grammar or vocabulary.
4. Marked foreign accent and occasional mispronunciations which do not interfere with
understanding.
5. No conspicuous mispronunciation, but would not be taken for a native speaker.
6. Native pronunciation, with no trace of foreign accent.

Grammar
1. Grammar almost entirely inaccurate phrases.
2. Constant errors showing control of very few major patterns and frequently preventing
communication
3. Frequent errors showing some major patterns uncontrolled and causing occasional irritation
and misunderstanding
4. Occasional error showing imperfect control of some patterns but no weakness that cause
misunderstanding.
5. Few errors, with no patters of failure.
6. No more than two errors during the interview.

Vocabulary
1. Vocabulary inadequate for even the simplest conversation.
2. Vocabulary limited to basic personal and survival areas (time, food, transportation, family, etc.)
3. Choice of words sometimes inaccurate, limitations of vocabulary prevent discussion of some
common professional and social topics
4. Professional vocabulary adequate to discuss special interests; general vocabulary permits
discussion of any non-technical subject with some circumlocutions.
5. Professional vocabulary broad and precise; general vocabulary adequate to cope with complex
practical problems and varied situations.
6. Vocabulary apparently as accurate and extensive as that of an educated native speaker.

Fluency
1. Speech is so halting and fragmentary that conversation is virtually impossible.
2. Speech is very slow and uneven except for short or routine sentences.
3. Speech is frequently hesitant and jerky; sentences may be left uncompleted
4. Speech is occasionally hesitant, with some unevenness caused by rephrasing and groping for
words.
5. Speech is effortless and smooth, but perceptibly non-native in speech and evenness.
6. Speech on all professional and general topics as effortless and smooth as a native speaker’s.

Comprehension
1. Understands too little for the simplest type of conversation.
2. Understands only slow, very simple speech on common social and touristic topics; requires
constant repetition and rephrasing.
3. Understands careful, somewhat simplified speech when engaged in a dialogue , but may
require considerable repetition and rephrasing.
4. Understands quite well normal educated speech when engaged in a dialogue, but requires
occasional repetition or rephrasing.
5. Understands everything in normal educated conversation except for very colloquial or low-
frequency items, or exceptionally rapid or slurred speech.
6. Understands everything in both formal and colloquial speech to be expected of an educated
native speaker.
Besides using clear descriptions of criteria levels, the use of more than one scorer will
decrease the subjectivity as described earlier. If two testers are involved in an interview, then they can
independently assess each test-taker. If they disagree, even after discussion, then a third scorer may
be referred to.

Testing Reading
Similar to listening skill, reading skill is a receptive skill. The task of language tester is, then, to
set reading tasks which will result in behavior that will demonstrate their successful completion.
The reading macro-skills (directly related to course objectives) are scanning text to locate
specific information, skimming text to obtain general idea, identifying stages of argument, and
identifying examples presented in support of an argument. The micro-skill underlying reading skills are
identifying referents of pronouns, using context to guess meaning of unfamiliar words, and
understanding relations between parts of text.
The reading texts can be taken from textbooks, novel, newspaper, magazine, academic
journal, letter, timetable, etc. The texts can be in the forms of newspaper report, advertisement,
editorial, etc.
The techniques that might be used to test reading skills are multiple choice, true/false,
completion, short answer, guided short answer, summary cloze, information transfer, identifying order
of events, identifying referents, guessing the meaning of unfamiliar words from context.
 Multiple Choice The test-takers provide evidence of successful reading by marking a mark
against one out of a number of alternatives. Its strengths and weaknesses have been
presented earlier.
 True/false The test-takers should respond to a statement by choosing one of the two choices,
true or false.
 Completion The students are required to complete a sentence with a single word, for example:
……………was the man responsible for the first steam railway.
 Short answer It is in the form of questions and requires the students to answer briefly, for
example:
According to the author, what does the increase in divorce rates show about people’s expectations of
marriage?

 Guided short answer This is the alternative of short answer in which students are guided to
have the intended answer. They have to complete sentences presented to them, for example:
Complete the following based on the fourth paragraph!
‘Many universities in Europe used to insist that their students speak and write only ………………… Now
many of them accept ………………….. as an alternative, but not a ………………. of the two.

 Summary cloze A reading passage is summarized by the tester, and then gaps are left in the
summary for completion by the test-takers. This is really the extension of the guided short
answer.
 Information transfer One way to minimize demands on writing by test-takers is to require them
to show successful completion of a reading task by supplying simple information in a table,
following a route on a map, labeling a picture, and so on.
 Identifying order of events, topics, or arguments The test-takers can be required to number the
events etc.
 Identifying referents One of the ‘microskills’ listed previously was the ability to identify referents.
An example of an item to test this is:
What does the word ‘it’ (line 25) refer to? ……………………
 Guessing the meaning of unfamiliar words from context This is another microskill mentioned
earlier. Items may take the form:
Find a single word in the passage (between lines 1 and 25) which has the same meaning as ‘making of
laws’.

The above techniques are among the many techniques of testing reading. In scoring the reading test,
Hughes (1989) suggested that errors of grammar, spelling or punctuation should not be penalized, as
long as it is clear that the test-taker has successfully performed the reading task which the item set.
The function of a reading test is to test reading ability. To test productive skills at the same time simply
makes the measurement of reading ability less accurate.

Testing Writing
The best way to test people’s writing ability is to get them to write directly. Therefore, indirect
testing of writing ability cannot be possibly constructed as accurately as possible even by professional
testing institutions. There are three things that we should consider to develop a good test for writing:

1. We have to set writing tasks that are properly representative.


It is impossible to have the students do various tasks in a short test with few items. It is the test
developer task to create a representative sample of tasks. One example is provided by Hughes (1989)
in developing a test of English for academic purposes. The purpose of the test is to discover whether a
students’ written English is adequate for study through the medium of English at a particular overseas
university. An analysis of needs had revealed that the most important uses of written English were for
the purpose of taking notes in lectures and the writing of examination answers up to two paragraphs in
length. Using the above description, we can describe the relevant tasks such as asking the students to
describe, to explain, to compare and contrast, and to argue for something.

2. The tasks should elicit samples of writing which truly represent the students’ ability.
There are at least two things we can do to obtain the sample that properly represent each
student’s ability. The first one is to set as many tasks as is feasible. The reasons for this are because
students’ performance on the same task is not consistent. And they sometimes are better at some
tasks than others. Therefore, giving many different tasks will enable the test developer to see the
students’ performance as objectively as possible.
The second way to elicit students’ writing ability is by testing only writing ability. Another ability
which at times interferes with the accurate measurement of writing is reading. It is acceptable to expect
students to be able to read simple instructions, but asking the students to read very difficult and long
instruction in writing test should be avoided. It will prevent the students to perform adequately on writing
task. To reduce students’ dependence on the students’ ability to read is to make use of illustrations in
the forms of diagram, a series of pictures or graphs.

3. The samples of writing can and will be scored reliably.


To facilitate reliable scoring, a test developer should set as many tasks as possible. The more
scores that scorers provide for each student, the more reliable should be the total score. The test-
takers should not be given too many choices of writing tasks. As discussed previously, the test-takers
should perform the same tasks to make the scoring easier. Finally, the samples of writing which are
elicited should be long enough for judgments to be made reliably.
In obtaining reliable scoring of writing, the process of scoring can be done either holistically or
analytically. Holistic scoring involves the assignment of a single score to a piece of writing on the basis
of an overall impression on it. This kind of scoring has the advantage of being very rapid. The following
is an example of holistic scoring provided by Cohen (1994:327-328)

Holistic Scoring:
5 The main idea is stated very clearly, and there is a clear statement of change of opinion. The essay
is well organized and coherent. The choice of vocabulary is excellent. There are no major or minor
grammatical errors. Spelling and punctuation are fine.
4 The main idea is fairly clear, and change of opinion is evident. The essay is moderately well
organized and is relatively coherent. The vocabulary is good, and there are only minor grammatical
errors. There are few spelling and punctuation errors.

3 The main idea and a change of opinion are indicated but not so clearly. The essay is not well
organized and is somewhat lacking in coherence. The vocabulary is fair, and there are some major
and minor grammatical errors. There are a fair number of spelling and punctuation errors.

2 The main idea and change of opinion are hard to identify in the essay. The essay is poorly
organized and relatively incoherent. The use of vocabulary is weak, and grammatical errors appear
frequently. Spelling and punctuation errors are frequent.

1 The main idea and change of opinion are absent in the essay. The essay is poorly organized and
generally incoherent. The use of vocabulary is very weak, and grammatical errors appear very
frequently. Spelling and punctuation errors are very frequent.

Method of scoring which require a separate score for each of a number of aspects of a writing task is
said to be analytic. The following is an example of analytic scoring provided by Cohen (1994:328-329)

Analytic Scoring:
Content
5 – Excellent : main ideas stated clearly and accurately, change of opinion very clear
4 – Good : main ideas stated fairly clearly and accurately, change of opinion relatively clear
3 – Average : main ideas somewhat unclear and inaccurate, change of opinion somewhat weak
2 – Poor : main ideas not clear or accurate, change of opinion weak
1 – Very Poor : main ideas not at all clear or accurate, change of opinion very weak

Organization
5 – Excellent : well organized and perfectly coherent
4 – Good : fairly well organized and generally coherent
3 – Average : loosely organized but main ideas clear, logical but incomplete sequencing
2 – Poor : ideas disconnected, lacks logical sequencing
1 – Very poor : no organization, incoherent

Vocabulary
5 – Excellent : very effective choice of words and use of idioms and word forms
4 – Good : effective choice of words and use of idioms and word forms
3 – Average : adequate choice of words but some misuse of vocabulary, idioms and word forms
2 – Poor : limited range, confused use of words, idioms, and word forms
1 – Very Poor : very limited range, very poor knowledge of words, idioms, and word forms

Grammar
5 – Excellent : no errors, full control of complex structure
4 – Good : almost no errors, good control of structure
3 – Average : some errors, fair control of structure
2 – Poor : many errors, poor control of structure
1 – Vary Poor : dominated by errors, no control of structure
Mechanics
5 – Excellent : mastery of spelling and punctuation
4 – Good : few errors in spelling and punctuation
3 – Average : fair number of spelling and punctuation errors
2 – Poor : Frequent errors in spelling and punctuation
1 – Very poor : no control over spelling and punctuation

The choice between holistic and analytic scoring depends on the purpose of testing (Hughes,
1989:97). If diagnostic information is required, then analytic scoring is essential. If the scoring is carried
out by a small group of people, then holistic scoring may be appropriate. Analytic scoring is used when
scoring is conducted by heterogeneous, less well-trained people or in a number of different places.
However, whichever is used, multiple scoring involving two or more scorers is suggested.

Testing Grammar
The place of grammar in language teaching is sometimes debatable. Some may think that
control of grammatical structure was seen as the core of language ability and it would have been
unthinkable not to test it. For that reason, most proficiency tests include a grammar section besides the
reason of its ease with which large numbers of items can be administered and scored within a short
period of time.
On contrast, others see that one cannot accurately predict mastery of grammar by measuring
control of what we believe to be the abilities that underlie it. Besides, the backwash effect of grammar
test may encourage the learning of grammatical structures in isolation, with no apparent need to use
them. Therefore, consideration of this kind has resulted in the absence of grammar components in
some well-known proficiency tests.
However, whether or not grammar has an important place in an institution’s teaching, it has to
be accepted that grammatical ability has an important influence on someone’s performance. The
successful writing of academic writing, for example, must depend to some extent on command of some
elementary grammatical structures. Therefore, it can be said that there is still room for grammar
component in a language test.
The specification of grammar test should be in line with the teaching syllabus if the syllabus
lists the grammatical structures to be taught. When there is no such list, it becomes necessary to infer
from textbooks or other teaching materials.
There are some techniques that can be used to test grammar. Multiple choice is one alternative.
However, it is not suggested for its difficulty in finding appropriate distractors. The other proposed
techniques are paraphrase, completion, and modified cloze.
 Paraphrase This technique requires the students to write a sentence equivalent in meaning to
one that is given. It is helpful to give part of the paraphrase in order to restrict the students to
the grammatical structures being tested. An example of testing passive past continuous form
would be:
When we arrived, a policeman was questioning the bank clerk.
When we arrived, the bank clerk ……………………………..

 Completion This technique can be used to test variety of structures. The following is an
example of testing interrogative forms:
In the following conversation, some sentences have been left incomplete. Complete them
suitably. Read the whole conversation before you begin to answer the question.

(Mr. Cole wants a job in Mr. Gilbert’s export business. He has come from an interview.)

Mr. Gilbert: Good morning, Mr. Cole. Please come in and sit down. Now let me see. (1)
Which school ……………………………………………………….?
Mr. Cole: Whitestone College
Mr Gilbert: (2) And when …………………………………………………………...?
Mr. Cole: In 1999, at the end of the summer term.
Mr. Gilbert: (3) And since then, what ……………………………………………….?
Mr. Cole: I worked in a bank for a year. Then I took my present job, selling cars. But I
would like a change now.
Mr. Gilbert: (4) Well, what sort of a job ……………………………………………?
Mr. Cole: I’d really like to work in your Export Department.

 Modified cloze This technique can be in the form of the following example of testing articles:

Write the, a, or NA (No Article) in the blanks.


In England, children go to ..… school from Monday to Friday. ..… school that Mary goes is
very small. She walks there each morning with ….. friend. One morning, they saw ….. man
throwing ….. stones and ….. pieces of wood at ….. dog. ….. dog was afraid of ….. man.

In the scoring process, the scorer should only score what the item is testing, not something
else. For instance, when the focus is to test pronoun, the error on a missing third person -s should not
be penalized. Finally, for valid and reliable scoring of grammar items, careful preparation of the scoring
key is necessary.

Testing Vocabulary
The debate on testing vocabulary is equal to the testing of vocabulary. Clearly, knowledge of
vocabulary is essential to the development and demonstration of linguistic skills. But according to some
people, that does not mean that it should be tested separately.
On the other hand, some argue that some time should devoted to the regular, conscious
teaching of vocabulary, Thus, it is important to test vocabulary as an achievement test of vocabulary
after teaching.
The specification for vocabulary achievement test should be based on all items presented to
the students in vocabulary teaching. When placement test is applied, the vocabulary being tested
should refer to one of common published word lists.
The testing of vocabulary productively is so difficult. Information on receptive ability is regarded as
sufficient. The following techniques are suggested only for possible use in achievement test.
 Pictures The use of pictures can limit the students to lexical items that we have in mind. Some
pictures are provided and the students are required to write down the names of the objects.
This method of testing vocabulary is obviously restricted to concrete nouns which can be
drawn.
 Definitions This may work for a range of lexical items. The following is an example of such test.
A …… is a person who looks after our teeth.
……… is frozen water.
……… is the second month of the year.

But not all items can be identified using a definition. Nor can all words be defined entirely in
words more common or simpler than themselves.
 Gap filling This can take the form of one or more sentences with a single word missing.
Because of the snow, the football match was ….. until the following week.
I ….. to have to tell you this, Mrs. Jones, but your husband has had an accident

To avoid various answers, the first letter of the word or even the indication of the number of
letters can be given

Testing Pronunciation
Heaton (1990) includes pronunciation into testing speaking skill. There are at least three
techniques of testing pronunciation. They are pronouncing words in isolation, pronouncing words in
sentences, and reading aloud.
 Pronouncing words in isolation The importance of listening in almost all tests of speaking,
especially those of pronunciation, should never be underestimated. It is impossible for students
to pronounce words correctly unless they first hear and recognize the precise sound of that
word. In the early stages of learning English, it is useful to base our pronunciation tests on
minimal pairs: that is, pairs of words which differ only in one sound, for example:
bud bird ferry fairy
nip nib boss bus
pill pail knit lit
ball bowl fry fly
sheet seat sport support

Pictures can also be used to test the students’ pronunciation. The students can be shown pictures
and asked to identify the object of each picture. Each picture is based on a possible source of
confusion. For example, a picture of ship can be used to test the students to distinguish between sheep
and ship.
 Pronouncing words in sentences Students can also be asked to read aloud sentences
containing the problematic sounds which we want to test. Sentences are, of course, preferable
because they provide a context for the sounds (as in real life). For example:
There were several people standing in the hole. (hole/hall)
Are you going to sail your boat today (sail/sell)
Do you like this sport? (sport/spot)

 Reading aloud Reading aloud can offer a useful way of testing pronunciation provided that we
give a student a few minutes to look at the reading text first. When we choose suitable texts to
read aloud, it is useful to imagine actual situations when someone may read something aloud.
For example, people read aloud news on TV, letters, or instructions.

Discussion
1. Having cloze test passage in the chapter, complete it and say what you think each item is
testing?
2. Discuss when and how multiple choice tests can be used appropriately in an English
classroom?
3. What advantages can we have in testing language proficiency by using dictation?
4. Design a test that requires the test-takers to draw (complete) simple picture after listening to an
instruction!
5. Can reading aloud be included into one technique of testing reading ability?
6. Do you think grammar should be tested separately?
7. What do you think the best way of testing writing ability?
CHAPTER V
DESIGNING CLASSROOM TESTS AND STANDARDIZED TESTS

This chapter provides the teachers with step-by-step procedures in designing classroom tests
and standardized tests. Most of the explanation is summarized from Brown (2004).

A. Designing Classroom Tests


We, as teachers, need to take a lot of effort and a long time to design and to refine a good test
through trial and error. Here are some practical steps in constructing classroom test adapted from
Brown (2004).

Assessing Clear, Unambiguous Objectives


When we wants to develop a good classroom test, we need to know as specifically as possible
what we want to test. We can do this by looking carefully at what we think the students should “know” or
“be able to do,” based on the material that the students are responsible for. In other words, we need to
examine the objectives for the unit we are testing.
Ideally, every curriculum has appropriately framed assessable objectives, that is, objectives
that are stated in terms of explicit performance by students. “ Students will produce yes/no question with
final raising intonation” is a good example of objective because acceptable students’ level of
performance is specified and the objective can be tested. Unfortunately, however, not all objectives are
stated clearly. Then, we have to go back through a unit and formulate them by ourselves. Remember
that we have to state the performance elicited and the target linguistic domain. See Table 5.1

Table 5.1 Example of Selected Objectives for a unit in a low-intermediate integrated-skills course
(Brown, 2004:50)

Form-focused objectives (listening and speaking)


Students will
1. recognize and produce tag questions, with correct grammatical form and final intonation
pattern, in simple social conversations.
2. recognize and produce wh-information questions with correct final intonation pattern.

Communication skills (speaking)


Students will
3. state completed actions and events in a social conversation.
4. ask for confirmation in a social conversation.
5. give opinions about an event in a social conversation.
6. produce language with contextually appropriate intonation, stress, and rhythm.

Reading (simple essay or story)


Students will
7. recognize irregular past tense of selected verbs in a story or essay.

Writing skills (simple essay or story)


Students will
8. write a one paragraph story about a simple event in the past.
9. use conjunctions ‘so’ and ‘because’ in a statement of opinion.

In reviewing the objectives of a unit, we cannot possibly test each one. We will then need to
choose a possible subset of the objectives to test.

Drawing Up Test Specifications


Test specifications for classroom use can be a simple and practical outline of our test. Test
specifications will simply comprise (a) a broad outline of the test, (b) the skills being tested, and (c) test
items. The following test specification based on the objectives in Table 5.1 will provide a clearer
description of how to make it.

Table 5.2 Test Specifications


Speaking (5 minutes per person)
- Format : oral interview, teacher and students
- Tasks : teacher asks questions to students (objectives 3 and 5,
emphasis on 6)

Listening (10 minutes)


- Format : Teachers makes audiotape in advance, with one other voice on
it
- Tasks : a. minimal pair items, multiple choice (objective 1)
b. 5 interpretation items, multiple choice (objective 2)

Reading (10 minutes)


- Format : cloze test items (10 total) in a story line
- Tasks : fill-in-the-blanks (objective 7)

Writing (10 minutes)


- Format : prompt for a topic: why I liked/disliked a recent TV sitcom
- Tasks : writing a short opinion paragraph (objective 9)

These informal, classroom-oriented specifications give us an indication of the topics


(objectives), the format of test, the number of items in each section, and the time allocated for each.
Notice that not all objectives (objectives 4 and 8) are tested. This is because of time limitation.

Devising Test Tasks


Having oral interview, we have to draft the questions to conform to the accepted pattern of oral
interviews which begin and end with nonscored items (warm-up and wind-down) designed to set
students at ease and then sandwich between them items intended to test the objective (level check)
and a little beyond (probe)

Table 5.3 Oral Interview Form


A. Warm-up: questions and comments
B. Level-check questions (objective 3, 5, and 6)
1. Tell me about what you did last weekend
2. Tell me about an interesting trip you took last year.
3. How did you like the TV show we saw this week.
C. Probe (objective 5,6)
1. What is your opinion about _____? (news event)
2. How do you feel about _____? (another news event)
D. Wind-down: comments and reassurance

Now, we are ready to draft other test items. To provide a sense of authenticity and interest, we
make the items based on the context of a recent TV sitcom that we have used in class. The sitcom
described a loud, noisy party with lots of small talk. Finally, we have the following samples of test items
for each section.

Table 5.4 Test items sample (first draft)


Listening, part a (sample item)
Directions: Listen to the sentence on the tape. Choose the sentence on your test page
that is closest in meaning to the sentence you heard.
- Voice : They sure made a mess, didn’t they?
- Students read : a. They didn’t make a mess, did they?
b. They did make a mess, didn’t they?

Listening, part b. (sample item)


Directions: Listen to the question on the tape. Choose the sentence on your test page
that is the best answer to the question.
- Voice : Where did George go after the party last night?
- Students read : a. yes, he did.
b. Because he was tired.
c. To Elaine’s place for another party
d. He went home around eleven o’clock
Reading (sample items)
Direction: Fill in the correct tense of the verb (in parentheses) that should go in each
blank.
Then, in the middle of this loud party, they (hear) _____ the loudest thunder you have
ever heard! An then right away lightning (strike) _____ right outside their house!

Writing
Directions: Write a paragraph about what you liked or didn’t like about one of the
characters at the party in the TV sitcom we saw.
However, the above test items are quite traditional. It should be admitted that the format of
some of the items is unnatural, thus lowering the level of authenticity. Therefore, the above test items
need to be revised.
In revising our draft, we need to ask some important questions:
1. Are the directions to each section absolutely clear?
2. Is there an example item for each section?
3. Does each item measure a specified objective?
4. Is each item stated in clear, simple language?
5. Does each multiple choice item have appropriate distractors?
6. Is the difficulty of each item appropriate for your stuff\dents?
7. Is the language of each item sufficiently authentic?
8. Do the sum of the items and the test as a whole adequately reflect the learning objective?

In the current example that we have been analyzing, our revising process is likely to result in at
least four changes or additions:
1. In both interview and writing sections, you recognize that a scoring rubric will be essential. For
the interview we decide to create a holistic scale and and for the writing section we devise a
simple analytic scale that captures only the objectives you have focused on (see the previous
chapter).
2. In the interview questions, you realize that follow-up questions may be needed for students
who give one-word or very short answers.
3. In the listening section, part, you intend choice “c” as the correct answer, but we realize that
choice “d” is also acceptable. We need an answer that is unambiguously incorrect. You shorten
it “d. Around eleven o’clock.” We also note that providing the prompts for this section on audio
recording will be logistically difficult, and so we opt to read these items to your students.
4. In the writing prompt, we can see how some students would not use the words so or because,
which were in our objectives, so we re-word the prompt: “Name one of the characters at the
party in the TV sitcom we saw. Then use the word so at least once and the word because at
least once to tell why you liked or didn’t like that person.”
Ideally, we have to try out all your tests on the students not in our class before actually
administering the tests. But, in our daily classroom teaching, the tryout phase is almost impossible.
Alternatively, we could enlist the aid of a colleague to look over our test.
In the final revision of our test, imagine that you are a student taking the test. Go through each
set of directions and all items slowly and deliberately. It’s better if we time ourselves. If the test should
be shortened or lengthened, we should make necessary adjustments. We have to be sure that
everything is good.

B. Standardized Tests
A standardized test presupposes certain standard objectives, or criteria, that are held constant
across one form of the test to another. The criteria in large-scale standardized tests are designed to
apply to a broad hand of competencies that are usually no exclusive to one particular curriculum. A
good standardized test is the product o a thorough process of empirical research and development. It
dictates standard procedures for administration and scoring. And finally, it is typical of a norm-
referenced test, the goal of which is to place test-takers on a continuum across a range scores and to
differentiate test-takers by their relative ranking.
Many people are under the incorrect impression that all standardized tests consist of items
presented in multiple-choice format. While it is true that many standardized tests conform to a multiple-
choice format for its objective standard, multiple-choice is not a prerequisite characteristic of
standardized tests. Human-scored tests standard are also involved in standardized tests as in Test of
Spoken English (TSE) and Test of Written English (TWE) produced by Educational Testing Service
(ETS).
Standardized tests have both advantages and disadvantages. The advantages of standardized
testing include a ready-made previously validated product that frees the teacher from having to spend
hours creating a test. Administration to large groups can be accomplished within reasonable time limits.
In the case of multiple-choice formats, the scoring procedures are easy.
The disadvantage of standardized tests centers largely on the inappropriate use of such tests,
for example using an overall proficiency test as an achievement test simply because of the
convenience of standardization. Therefore, the teachers should be careful in using standardized test.

Developing a Standardized Test


While it is not likely that a classroom teacher would be in a position to develop a brand-new
standardized test of large-scale proportions, it is virtual certainty that some day we will be in a position
to revise an existing test, to adapt or expand an existing test, and/or to create a smaller-scale
standardized test for a program we are teaching in. Here are some steps in developing a standardized
test by using TOEFL as an example.
1. Determine the purpose and objective of the test. Most standardized tests are expected to
provide high practicality in administration and scoring without improperly compromising validity. The
initial outlay of time and money for such test is significant. It is therefore, important for its purpose
and objectives to be stated specifically. Dealing with TOEFL, the purpose of TOEFL is to evaluate
the English proficiency of people whose native language is not English. More specifically, TOEFL
is designed to help institutions of higher learning make valid decisions concerning English language
proficiency in terms of their own requirements. As we can see, the objectives of TOEFL are
specific. The content of each test must be designed to accomplish those particular ends.
2. Design test specification. Decision need to be made on how to go about structuring the
specifications of the test. A comprehensive program of research must be done to identify a set of
constructs underlying the test itself. This stage of laying the foundation of stones can occupy
weeks, months, or even years of effort. Standardized tests that don’t work are often the product of
short-sighted construct validation. In the case of TOEFL, its construct validation is carried out by
the TOEFL staff at Educational Testing Service (ETS) under the guidance of a Policy Council that
works with the committee of Examiners that is composed of appointed external university faculty,
linguists, and assessment specialists. Because TOEFL is a proficiency test, the first step in the
developmental process is to define the construct of language proficiency. First, it should be made
clear about the term ‘proficiency’. According to Lowe as quoted by Brown (2004:71), proficiency is a
holistic, unitary trait view of language ability. How we view language will make a difference in how
we assess language proficiency. After breaking language competence down into subsets of
listening, speaking reading and writing, each performance mode can be examined on a continuum
of linguistic units: phonology (pronunciation), and orthography (spelling), words (lexicon), sentences
(grammar), discourse and pragmatic (sociolinguistic, contextual, functional, cultural) features of
language. Finally, to make a very long story short, the TOEFL had for many years included three
types of performance in its organizational specifications: listening, structure, and reading, all of
which tested comprehension through standard multiple choice tasks. In 1996, a major step was
taken to include written production in computer-based TOEFL.
3. Design, select, and arrange test tasks/items. Once the specifications for the standardized test
have been stipulated, the never-ending task of designing, selecting, and arranging items begins.
The process involves determining the number and types of items to be created. Dealing with
TOEFL, the first thing to do is coding the content. Items, then, are designed by a team who select
and adapt items solicited from a bank of items that have been deposited by free-lance writers and
ETS staffs. The content of reading section, for example, are usually excerpts from authentic
general or academic reading that are edited for linguistic difficulty, culture bias, or other topic
biases. Items are designed to test overall comprehension, certain specific information, and
reference.
4. Make appropriate evaluations of different kinds of items. The indices on item facility (IF), item
discrimination (ID), and distractor analysis are a must in standardized multiple choice test. For
production responses format, the essay writing in this case, different forms of evaluation become
important. The principle of practicality and reliability are prominent.
5. Specify scoring procedures and reporting formats. The process of developing standardized
tests should yield a test that can be scored accurately and reported back to test-takers and
institutions efficiently. TOEFL is known as having the most straightforward scoring procedures.
Scores are calculated and reported for three sections of the TOEFL (the essay ratings are
combined with the Structure and Written Expression score). The total score (range 40 to 300 on the
computer-based TOEFL and 310 to 677 on the paper-pencil TOEFL) is also calculated and reprted.
A separate score for the essay (range 0 to 6) is also provided on the examinee’s score record.
6. Perform ongoing construct validation studies. The last step to develop a standardized test is to
perform a systematic periodic validation of its effectiveness. ETS, as the producer of TOEFL, also
has sponsored many impressive programs of research, including examining the content
characteristics of the TOEFL from a communicative perspective.

Discussion
1. Following the steps of developing classroom tests, can you make your own English test for the
first grade of junior high school? Do it in group!
2. In pairs or in small groups, compile a brief list of pros and cons of standardized testing!
3. Tell the class about the worst test experience you’ve ever had. Briefly, analyze what made the
experience so unbearable?
CHAPTER VI
DESCRIBING, ANALYZING, AND INTERPRETING
TEST SCORES

This chapter deals with three things a teacher should do after test administration. They are
describing, analyzing and interpreting test result.

A. Describing Test Scores


Right after a test is administered, ideally the teachers do the scoring process based on the
scoring guidance set in advance. Then, when test scores are at hand, the description of the scores
should be provided. At a minimum, teachers should examine the descriptive statistics of the test.
Descriptive statistics are numerical representations of how a group of students performed on a test
(Brown, 1996:102). Two aspects are considered in descriptive statistics: the middle of the group
(central tendency) and the individuals (dispersion).

Central Tendency
Central tendency describes the most typical behavior of a group. Foyr statistics are used for
estimating central tendency: the mean, the mode, the median, and the midpoint.

1. Mean
The mean is probably the single most important indicator of central tendency. The mean is
virtually the same with average. It is symbolized in writing by X (said “ex bar”). It is the sum of all the
scores divided by the number of scores. Thus the means of 14, 34, 56, 68 is (14 + 34 + 56 + 68)/4 = 43
The formula will be:
_ ∑X
X= N

Where _
X = mean
X = scores
N = number of scores

2. Mode
Another indicator of central tendency is the mode. The mode is that score which occurs most
frequently. When the students’ scores are 77, 75, 72, 72, 70, 70, 69, 69, 69, 69, 69, 68, 68, 67, 64, 64,
61, then the mode is 69. No statistical formula is necessary for this straightforward idea.

3. Median
The median is that point below which 50% of the scores fall and above which 50% fall. Thus in
the set of scores 100, 95, 83, 61, 57, 30, the median is 71 because 71 has three scores above it (100,
95, and 83) and three scores below it (61, 57, and 30).
In real data, cases arise that are not so clear. For example, what is the median for these scores: 9, 12,
15, 16, 17, 27? In such a situation, when there is an even number of scores, the median is taken to be
midway between two middle scores. In this case, 15 and 16 are two middle scores, so, the median is
15.5.

4. Midpoint
The midpoint in a set of scores is that point halfway between the highest score and the lowest
score on the test. The formula for calculating the midpoint is:

High + Low
Midpoint =
2

For example, if the lowest score on a test was 30 and the highest was 100, the midpoint would be (100
+ 30)/2 = 65

Dispersion
Dispersion is how the individual performances vary from the central tendency. Three indicators
of the dispersion are commonly used for describing distributions of test scores: the range, the standard
deviation, and the variance.
1. Range
Most teachers are already familiar with the concept of range from tests they have given in
class. The range is the number of points between the highest score on a measure and the lowest score
plus one (one is added because the range should include the scores at both ends). Thus, when the
highest score is 71 and the lowest score is 61, the range is 17 points (17 – 61 + 1 = 17). The range
provides some idea of how individuals vary from the central tendency.
2. Standard Deviation
The standard deviation is a short of average of the differences of all scores from the mean
(Brown, 1989:107). This is not an exact statistical definition but rather one that will serve well for
conveying the meaning of this statistic. The formula is as follows:
_
S= ∑ (X – X)2
N
S = standard deviation,
X = the score
X = the mean
N = the number of scores

The standard deviation is a very flexible and useful statistic because it is a good indicator of the
dispersion of a set of scores around the mean. The standard deviation is usually better than the range
because it is the result of averaging process.
Sometimes, a slightly different formula is used for the standard deviation:
_
S= ∑ (X – X)2
N-1

This version (called the ”N – 1” formula) is only appropriate if the number of students taking the
test is less than 30.

3. Variance
The variance is another descriptive statistic for dispersion. As indicated by its symbol, S 2, the
test variance is equal to the squared value of the standard deviation. So the formula for test variance is
very much like the one for standard deviation except that both sides of the equations are squared.
Thus, the formula of variance is:
_
S = ∑ (X – X)2
2

N
Hence, variance can easily be defined, with reference to this formula, as the average of the
squared differences of students’ scores from the mean.

B. Analyzing Test Items


To help teachers to understand and improve the effectiveness of formats and content, it is
necessary to know the following two statistical analysis: item facility analysis and item discrimination
analysis.
Item Facility Analysis
Item facility (IF) is a statistical index used to examine the percentage of students who correctly
answer a given item. To calculate the IF index, we need to add up the number of students who correctly
answered a particular item, and divide that sum by the total number of students who took the test. As a
formula, it looks like this:
N correct
IF = N total

The result of this formula is an item facility value that can range from 0.00 to 1.00 for different
items. Teachers can interpret this value as the percentage of correct answer for a given item (by
moving the decimal point two places to the right). For example, the correct interpretation for an index of
0.27 would be that 27% of the students correctly answered the item. In most cases, an item with an IF
of 0.27 would be a very difficult question because many more students missed it than answered it
correctly.
Arranging the data in matrix as shown in Table 6.1 below will be very helpful.

Table 6.1 Item Analysis Data


Student Item Number Total
s 1 2 3 4 5 6 7 8 9 10
A 1 1 1 1 1 1 0 1 1 0 80
B 1 0 0 1 1 1 1 1 0 1 70
C 1 1 0 1 0 1 1 1 0 1 70
D 1 1 0 1 1 1 0 1 0 0 60
E 1 1 1 1 0 0 1 1 0 0 60
F 1 0 1 1 1 0 0 0 1 1 60
G 1 1 0 1 0 0 1 1 0 0 50
H 1 0 0 1 1 1 0 0 0 1 50
I 1 0 0 1 0 0 1 0 1 0 40
J 0 1 0 1 0 0 0 0 1 1 40

The actual responses are recorded with a 1 for each correct answer and 0 for a wrong answer.
Notice that A answered the first item correctly – indeed, so did everyone else, except poor J. This item
must have been very easy.

Item Discrimination Analysis


Item discrimination (ID) indicates the degree to which an item separates the students who
performed well from those who performed poorly. These two groups are sometimes referred to as the
high and the low scorers or upper- and lower-proficiency students. The reason for identifying these two
groups is that ID allows teachers to contrast the performance of the upper group students on the test
with that of the lower-group students. The process begins by determining which students had the
scores in the bottom group. To do this, begin by lining the students’ names, their individual item
responses, and total scores in descending order based on the total scores.
The upper and lower groups are sometimes defined as the upper and the lower third, or 33%.
Because group of people do not always come in nice neat numbers that are divisible by three, the
solution is often like that found in Table 6.2. Three students have been assigned to the top and bottom
groups and four to the middle groups.

Table 6.2 Upper and Lower Groups


Student Item Number Total
s 1 2 3 4 5 6 7 8 9 10
A 1 1 1 1 1 1 0 1 1 0 80
B 1 0 0 1 1 1 1 1 0 1 70
C 1 1 0 1 0 1 1 1 0 1 70

D 1 1 0 1 1 1 0 1 0 0 60
E 1 1 1 1 0 0 1 1 0 0 60
F 1 0 1 1 1 0 0 0 1 1 60
G 1 1 0 1 0 0 1 1 0 0 50

H 1 0 0 1 1 1 0 0 0 1 50
I 1 0 0 1 0 0 1 0 1 0 40
J 0 1 0 1 0 0 0 0 1 1 40

Once the data are sorted into groups of students, calculation of the discrimination index is
easy. To do this, calculate for the IF of the upper and lower groups separately for each item. Then, to
calculate the ID index, the IF for the lower group is subtracted from the IF for the upper group on each
item as follows.

ID = IF upper – IF lower

where ID = item discrimination for an individual item


IF upper = item facility for the upper group on the whole test
IF lower= item facility for the lower group on the whole test

For example, in Table 6.2, the IF for the upper group on item 8 is 1.00 because everyone in
group answered it correctly. At the same time, the IF for the lower group on that item is 0.00 because
everyone on the lower group answered it incorrectly. The calculation for item discrimination for item 8
resulted in 1.00 (1.00 – 0.00 = 1.00). An item discrimination index of 1.00 is very good because it
indicates the maximum contrast between the upper and the lower groups of students. See Table 6.3 for
ID for each item.

Table 6.3 Item Statistics


Item Item Number
Statistic 1 2 3 4 5 6 7 8 9 10
IF total 0.90 0.60 0.30 1.00 0.50 0.50 0.50 0.60 0.40 0.50
IF upper 1.00 0.67 0.33 1.00 0.67 1.00 0.67 1.00 0.33 0.67
IF lower 0.67 0.33 0.00 1.00 0.33 0.33 0.00 0.00 0.67 0.67
ID 0.33 0.34 0.33 0.00 0.34 0.67 0.67 1.00 -0.34 0.00

The calculations for ID can give us the knowledge about the quality of a certain item. This will
help us in deciding whether an item should be revised or not in the process of developing a good test
item. The ideal item in a test should have an average IF of 0.50 and the highest available ID. Ebel as
quoted by Brown (1989:70) has suggested the following guidelines for making decisions based on ID:

0.40 and up very good items


0.30 to 0.39 reasonably good but possibly subject to improvement
0.20 to 0.29 marginal items, usually needing and being subject to improvement
Below 0.19 poor items, to be rejected or improved by revision

Distractor Efficiency Analysis


Even after careful selection of the items to be used in a revised and improved version of a test,
the job of improving the test may not be finished, particularly for multiple-choice items. Recall that the
parts of a multiple choice item include the stem (the main part of the item at the top), and the options
(which are the alternative choices presented to the students). The options consist of the correct answer
and the distractors (which are the options that will be counted as incorrect). The primary goal of
distractor efficiency analysis is to examine the degree to which the distractors are attracting the
students who do not know the correct answer. To do this for an item, the percentages of students who
chose each option are analyzed. If this analysis can also give the percentages choosing each option in
the upper, middle and lower groups, the information will be even more interesting and useful.
Consider the distractor efficiency analysis result (for the same items previously shown in Table
6.2) that are given in Table 6.4 for items 1 through 10 (listed down the list side of the table).
Table 6.4 Distractor Efficiency
Item IF ID Group Options
Number a. b. c. d.
1. 0.90 0.33 High 1.00* 0.00 0.00 0.00
Middle 1.00* 0.00 0.00 0.00
Low 0.67* 0.33 0.00 0.00

2. 0.60 0.34 High 0.00 0.33 0.67* 0.00


Middle 0.00 0.25 0.75* 0.00
Low 0.33 0.34 0.33* 0.00

3. 0.30 0.33 High 0.33 0.33* 0.00 0.33


Middle 0.25 0.50* 0.25 0.00
Low 0.33 0.00* 0.33 0.33

4. 1.00 0.00 High 1.00* 0.00 0.00 0.00


Middle 1.00* 0.00 0.00 0.00
Low 1.00* 0.00 0.00 0.00

5. 0.50 0.34 High 0.00 0.00 0.33 0.67*


Middle 0.25 0.00 0.25 0.50
Low 0.33 0.33 0.00 0.33*

6. 0.50 0.67 High 1.00* 0.00 0.00 0.00


Middle 0.25* 0.25 0.25 0.25
Low 0.33* 0.33 0.33 0.00

7. 0.50 0.67 High 0.00 0.00 0.33 0.67*


Middle 0.00 0.00 0.50 0.50*
Low 0.33 0.33 0.33 0.00*

8. 0.60 1.00 High 0.00 1.00* 0.00 0.00


Middle 0.25 0.75* 0.00 0.00
Low 0.67 0.00* 0.33 0.00

9. 0.40 -0.34 High 0.33 0.00 0.33 0.33*


Middle 0.25 0.25 0.25 0.25*
Low 0.00 0.00 0.33 0.67*

10. 0.50 0.00 High 0.67* 0.33 0.00 0.00


Middle 0.25* 0.50 0.25 0.00
Low 0.67* 0.33 0.00 0.00

Notice that the table also provides the same item facility and discrimination indexes that were
previously shown in Table 6.3. In addition Table 6.4 gives information about the proportion of students
in high, middle, and low groups who chose each of the options. For example, in item 1, the correct
answer is option a (indicated by asterisk). Options b, c, and d are the distractors. One student from low
group chose option b, while no one chose options c and d. It is the indication that options c and d are
not good distractors because they are not attracting the students to choose them. Therefore, options c
and d need to be revised.

C. Interpreting Test Scores


The purpose of developing language tests, administering them, and sorting through the
resulting scores is to make decisions about the students. The sorting process is sometimes called test
score interpretation. This process of test score interpretation is usually based on two ways: norm-
referenced test (NRT) and criterion-referenced test (CRT).
The decisions based on NRT are called relative decisions and that the interpretation of the
scores focuses on the relative position of each student in comparison with the rest of the students with
regard to some general ability. Thus, the normal distribution and each student’s position in that
distribution, make sense as possible tools for score interpretation.
The two most important characteristics of a normal distribution were covered previously: central
tendency and dispersion. The third useful characteristic is the notion of percents in the distribution. In a
normal distribution, the mean, the mode, the median, and the midpoint should all be the same. As
mentioned earlier, the median is the score below which 50% of the cases should fall, and above which
50% should be. Given these facts, we can predict that 50% of our students’ scores will be above the
median (or mean, or mode, or midpoint) in a normal distribution. Since the distribution under discussion
is normal and therefore bell-shaped, the curve is symmetrical with six equal areas divided according to
standard deviation, three on the left side (below the mean) and three on the right side (above the
mean).
Let’s take an example that we have 500 students taking the test. The mean is 41 and the
standard deviation is 10. Using percentage, approximately 34% of the scores will fall within one
standard deviation above the mean, or within one standard deviation below the mean. The total will be
68%. That leaves 32% of the students not yet explained in the distribution. Notice in Figure 6.1 that
roughly 14% of the students scored the first and the second standard deviation (+1 S to +2S) above the
mean (or between 51 and 61 score points in this particular distribution). Likewise, 14% will usually
score between standard deviation below the mean(-1 S) and two standard deviation below the mean (-
2S) (or between 21 and 31 score points in this case).
At this point, 96% of the students in the distribution are accounted for. The remaining 4% of the
students are divided above and below the mean: 2% in the area between the second and third standard
deviation above the mean (+2 S to +3S) and about the same 2% in the area between the second and
the third standard deviation below the mean (-2 S to -3S).
Based on the distribution, we can classify the students based on the percentage into 5 groups.
Group 1 consists of 10 students (2%) who got score above 61, group 2 consists of 70 students (14%)
who got scores between 51 and 61, group 3 consists of 340 students (68%) who got scores between
31 and 51, group 4 consists of 70 students (14%) who got scores between 21 and 31, and group 5
consists of 10 students (2%) who got scores less than 21. Finally, the teachers can assign the grade to
each group of students in the forms of A, B, C, D or E (or 4, 3, 2, 1 or 0) respectively.
On the other hand, CRT decisions are labeled absolute because they focus not on the
student’s position relative to other students but rather on the percent of material that each student
knows, largely without reference to other students. Thus, at the beginning of a course, the distribution
of scores on a CRT is likely to be positively skewed if the students actually need to learn the material
covered in the course. However, at the end of the course, if the test actually reflects the course
objectives, the teacher hopes the students will all score fairly high. In other words, the distribution of
scores at the end of instruction will be negatively skewed on a good CRT if reasonably efficient
language teaching and learning are taking place.
In interpreting the scores in CRT distribution, teachers should set certain passing criteria.
Based on the criteria, teachers can decide whether the students pass the test or not. It will not be a
problem when all of students pass the test, or in contrast, no one passes the test. It is really based on
the criterion.
In setting the criteria, the teachers ideally consider many things. For example, in a writing test,
in order to pass the test, the students should reach level good in terms of content, organization,
vocabulary/spelling, and grammar/punctuation. See Table 6.5. Then, each indicator is given its weight
in order to be calculated easily. In this case, the calculation is done by dividing the total scores with the
maximal score (20) and multiplying it with one hundred. Finally, the students’ writing level can be
determined. See Table 6.6

Table 6.5 Scoring Guide of Writing Test


Content
5 – main ideas stated clearly and accurately, change of opinion very clear
4 – main ideas stated fairly clearly and accurately, change of opinion relatively clear
3 – main ideas somewhat unclear and inaccurate, change of opinion somewhat weak
2 – main ideas not clear or accurate, change of opinion weak
1 – main ideas not at all clear or accurate, change of opinion very weak
Organization
5 – well organized and perfectly coherent
4 – fairly well organized and generally coherent
3 – loosely organized but main ideas clear, logical but incomplete sequencing
2 – ideas disconnected, lacks logical sequencing
1 – no organization, incoherent
Vocabulary
5 – very effective choice of words and use of idioms and word forms
4 – effective choice of words and use of idioms and word forms
3 – adequate choice of words but some misuse of vocabulary, idioms and word forms
2 – limited range, confused use of words, idioms, and word forms
1 – very limited range, very poor knowledge of words, idioms, and word forms
Grammar
5 – no errors, full control of complex structure
4 – almost no errors, good control of structure
3 – some errors, fair control of structure
2 – many errors, poor control of structure
1 – dominated by errors, no control of structure
Mechanics
5 – mastery of spelling and punctuation
4 – few errors in spelling and punctuation
3 – fair number of spelling and punctuation errors
2 – Frequent errors in spelling and punctuation
1 – no control over spelling and punctuation
The total number gotten x 100 = n
The maximal score

Table 6.6 Criteria of Writing Test


No. Grade Qualification Range of Scores
1. A Excellent 100 - 85
2. B Good 84 - 70
3. C Average 69 - 55
4. D Poor 54 - 50
5. E Very Poor 49 - 0

Discussion
1. How would you define central tendency? What are the four ways to estimate it?
2. What is dispersion? Which of the three indices for dispersion are most often reported?
3. Why should a teacher describe the students’ behavior on a measure in terms of both central
tendency and dispersion?
4. What is item facility index? How do you calculate it? How do you interpret the results of your
calculations?
5. What is item discrimination index? How do you calculate it? How do you interpret the results of
your calculations?
6. What is the distractor efficiency analysis? How do you do it? What can you learn from it in
terms of improving your test?
7. Draw an ideal normal distribution. Start by drawing two lines-an ordinate and an abscissa. Then
mark off a reasonable set of scores along the abscissa and some sort of frequency scale along
the ordinate. Make sure that you represent the mean, mode, median, and midpoint with a
vertical line down the middle of the distribution. Also include six lines to represent each of three
standard deviations above and below the mean.
REFERENCES
Allison, D. 1999. Language Testing: An Introductory Course. Singapore: Singapore University Press
Brown, H. D. 2001. Teaching by Principles: An Interactive Approach to Language Pedagogy. White
Plains, NY: Addison Wesley Longman
Brown, H. D. 2004. Language Assessment: Principles and Classroom Practices. White Plains, NY:
Pearson Education
Brown, J. D. 1996. Testing in Language Programs. New Jersey: Prentice Hall
Cohen, A. D. 1994. Assesing Language Ability in the Classroom . Boston: Heinle & Heinle Publishers
Djiwandono, M. S. 2008. Tes Bahasa. Jakarta: Indeks
Heaton, J. B. 1988. Writing English Language Test. New York: Longman
Heaton, J. B. 1990. Classroom Testing. New York: Longman
Hughes, A. 1989. Testing for Language Teachers. Cambridge: Cambridge University Press
Johnson, K. 2001. An Introduction to Foreign Language Learning and Teaching . Essex: Pearson
Education
Weir, C. 1993. Understanding &Developing Language Tests . Hertfordshire: Prentice Hall International
(UK) Ltd.

You might also like