Topic 1 provides you with some meanings of test, measurement, evaluation
and assessment, some basic historical development in language assessment,
and the changing trends of language assessment in the Malaysian context.

By the end of this topic, you will be able to:



define and explain the important terms of test, measurement,
evaluation, and assessment;


examine the historical development in Language Assessment;


describe the changing trends in Language Assessment in the
Malaysian context and discuss the contributing factors.





of various

SESSION ONE (3 hours)


Assessment and examinations are viewed as highly important in most Asian
countries such as Malaysia. Language tests and assessment have also
become a prevalent part of our education system. Often, public examination
results are taken as important national measures of school accountability.
While schools are ranked and classified according to their students’
performance in major public examinations, scores from language tests are
used to infer individuals’ language ability and to inform decisions we make
about those individuals.
In this topic, let’s discuss about the concept of measurement at its
numerous definitions. We will also look into the historical development in
language assessment and the changing trends of language assessment in
our country.


DEFINITION OF TERMS – test, measurement, evaluation, and

1.4.1 Test
The four terms above are frequently used interchangeably in any
academic discussions. A test is a subset of assessment intended to measure
a test-taker's language proficiency, knowledge, performance or skills. Testing
is a type of assessment techniques. It is a systematically prepared procedure
that happens at a point in time when a test-taker gathers all his abilities to
achieve ultimateperformance because he knows that his responses are being
evaluated and measured.A test is first a method of measuring a test-taker’s
ability, knowledge or performance in a given area; and second it must
Bachman (1990) who was also quoted by Brown defined a test as a
process of quantifying a test-taker’s performance according to explicit
procedures or rules.

1.4.2 Assessment
Assessment is every so oftena misunderstood term. Assessment is ‘a
comprehensive process of planning, collecting, analysing, reporting, and
using information on students over time’(Gottlieb, 2006, p. 86).Mousavi
(2009)is of the opinion that assessment is ‘appraising or estimating the level
of magnitude of some attribute of a person’. Assessment is an important
aspect in the fields of language testing and educational measurement and
perhaps, the most challenging partof it. It is an ongoing process in
educational practice, which involves a multitude of methodological techniques.
It can consist of tests, projects, portfolios, anecdotal information and student
self-reflection.A test may be assessed formally or informally, subconsciously
or consciously, as well as incidental or intended by an appraiser.

1.4.3 Evaluation
Evaluation is another confusing term. Many are confused between
evaluation and testing. Evaluation does not necessary entail testing. In
reality, evaluation is involved when the results of a test (or other assessment
procedure) are used for decision-making (Bachman, 1990, pp. 22-23).
Evaluation involves the interpretation of information. If a teacher simply
records numbers or makes check marks on a chart, it does not constitute
evaluation. When a tester or marker evaluate, s/he “values” the results in
such a way that the worth of the performance is conveyed to the test-taker.
This is usually done with some reference to the consequences, either good or
bad of the performance.This is commonly practised in applied linguistics
research, where the focus is often on describing processes, individuals, and
groups, and the relationships among language use, the language use
situation, and language ability.

Test scores are an example of measurement, and conveying the
“meaning” of those scores is evaluation. However, evaluation can occur
without measurement. For example, if a teacher appraises a student’s correct
oral response with words like “Excellent insight, Lilly!”it is evaluation.

we will limit the discussion to unobservable abilities or attributes.0 Historical development in language assessment From the mid-1960s. events. while the latter consists of written descriptions. Bachman (1990) cautioned us to distinguish between quantitative and qualitative descriptions. criteria. Similar to other tyoes of assessment. and their uses are illustrated in Figure 1. measurement must be conducted according to explicit rules and procedures as spelled out in test specifications. language testingpractices reflected in large-scale institutional language testing and in most language testing textbooks of the time . Simply put. and procedures for scoring. speaking.4. sometimes referred to as traits. the former involves assigning numbers (including rankings and letter grades) to observed performance. (Source: Bachman. reading and .1. through the 1970s. assessment. For our purposes of language testing. or people according to a rule-governed system.was informed essentially bya theoretical view of language ability as consisting of skills (listening. oral feedback.Measurement could be interpreted as the process of quantifying the observed performance of classroom learners.4 Measurement Measurement is the assigning of numbers to certain attributes of objects. 1990) 2. and non-quantifiable reports. such as grammatical knowledge. The relationships among test. measurement. strategic competence or language aptitude. measurement and assessment. Figure 1:The relationship between tests.

the 1990s saw a continuation of this trend. 1983. Erickson and Molly. which spurred language testers to investigate not only a wide variety of factors such as field independence/dependence (e. and a quantitative. while theprimary concern was with psychometric reliability (e. The 1980s saw other areas of expansion in language testing. vocabulary. grammar.writing) and components (e. or performance. 1986.g. Cohen. Language testingresearchwas dominated largely bythe hypothesis that language proficiency consisted of a single unitarytrait. 1985) on language test performance. 1988). Hale. Chapelle.g. Lado. In this decade the field also witnessed expansionsin a number of areas: a) research methodology.g.g. pronunciation) and an approach to test design that focused on testing isolated ‘discrete points’ of language. Current developments in the fields of applied linguistics. 1987). 1983. 1979). statisticalresearch methodology (Oller. Stansfield and Hansen. 1985. language learning and . Carroll. academic discipline and background knowledge (e. but also the strategies involved in the process of test-taking itself(e. assessments. perhaps. Hansen. 1988) and discoursedomains (Douglas and Selinker. Alderson and Urquhart.1961. d) authentic.g. Grotjahn. in the influence of second language acquisition(SLA) research. c) factors that affect performance on language tests. 1984. and e) concerns with the ethics of language testing and professionalising the field The beginning of the new millennium is another exciting time for anyone interested in language testing and assessment research. If the 1980s saw a broadening of the issues and concerns of language testing into other areas of applied linguistics. b) practical advances. mostimportantly.1968).

written tests in schools were introduced for a number of subjects. One does not need to look very far to see how important testing and assessment havebecome in our education system. The stages are as follow:  Pre-Independence  Razak Report  RahmanTalib Report  Cabinet Report  Malaysia Education Blueprint (2013-2025) . 3. When assessment is integrated with instructions. and how summative assessments provide diagnostic information. it informs teachers about what activities and assignments will be most useful. what level of teaching is most appropriate. we have to look at the changing trends in assessment particularly language assessment in this country. public examination results are taken as important national measures of school accountability. Schools are ranked and classified according to their students’ performance in major public examinations. technological innovation. it also influences the natureof instruction in the classroom. and not an activitythat merely audits learning (Shepard.pedagogy. Just as assessment impacts student learning and motivation. With this in mind.0 Changing trends in Language Assessment-Malaysian context History has clearly shown thatteaching and assessment should be intertwined in education. Public examinations have long been the only measurement of students’ achievement. Often.Assessment and examinations are viewed as highly important in Malaysia. and educational measurement have opened up some rich new research avenues. In Malaysia. which has been carried out mainly through the examination system until recent years. Figure 1 shows the four stages/phases of development of examination system in our country. the development of formal evaluation and testing in education began after Independence. There has been considerable recent literature that haspromoted assessment as something that is integrated with instruction.Starting from the year 1845. 2000). This trend in assessment continued with the intent to gauge the effectiveness of the teaching-learning process.

given our poor results at the 2009 Programme for International Student Assessment (PISA) tests. In line with the on-going transformation of the national educational system. which was to establish a common examination system for all the schools in the country. continuous school-based assessment is administered at all grades and all levels. Teachers will be given empowerment in assessing their students. Malaysia lags far behind regional peers like Singapore. Malaysian Teacher Education Division (TED) is entrusted by the Ministry of Education to formulate policies and guidelines to prepare teachers for the new implementation of assessment. Additionally.It is a timely move. and Hong Kong in every category. According to the Malaysian Ministry of Education (MOE). A new evaluation system known as the School Based Assessment (SBA) was introduced in 2002 as a move away from traditional teaching to keep abreast with changing trends of assessment and to gauge the competence of students by taking into consideration both academic and extra curricular achievements.On 3rd May 1956. a three-wave initiative to revamp the education system over the next 12 years. The Malaysia Education Blueprint was launched in September this year. the current scenario is gradually changing. the new assessment system aims to promote a combination of centralised and schoolbased assessment. As emphasised in the innovation of the student assessment. One of its main focuses is to overhaul the national curriculum and examination system. and with it. the Examination Unit (later known as Examination Syndicate) in the Ministry of Education (MOE) was formed on the recommendation of the Razak Report (1956). Based on the 2009 assessment. widely seen as heavily content-based and un-holistic. It is also a fact that the role of teachers in the new assessment system is vital. South Korea. The main objective of the Malaysia Examination Syndicate (MES) was to fulfil one of the Razak Report’s recommendations. Japan. . students sit for common public examinations at the end of each level.

the Ministry of Education has started to implement numerous changes to the examination system.Poor performance in Pisa is normally linked to students not being able to demonstrate higher order thinking skill. To remedy this. Two out of the three nationwide examinations that we currently administer to primary and secondary students have gradually seen major changes. the policies are ideal and impressive. but there are still a few questions on feasibility that have been raised by concern parties. Figure 2 below shows the development of educational evaluation in Malaysia since pre-independence until today. Generally. .

PreIndependence Examinations were conducted according to the needs of school or based on overseas examinations such as the Overseas School Certificate. especially with KBSR and KBSM. Multi-stream education (Aneka Jurusan). The revamp of the national examination and schoolbased assessments in stages. 2. The implementation of Cabinet Report resulted in evolution of the education system to its present state. whereby by 2016. Automatic promotion to higher classes.emoe.The introduction of the Standard 3 Dignostic Test (UDT). Implementation of the Razak Report (1956) Razak Report gave birth to the National Education Policy and the creation of Examination Syndicate (LP). .The entry of elective subjects in LCE and SRP. and Lower Certificate of Education (LCE) Examination. Adjustments were made in examination to fulfill the new curriculum's needs and to ensure it is in line with the National Education Philosophy. Figure 2: The development of educational evaluation in Malaysia Source: Malaysia Examination Board (MES) http://apps. It was first introduced in LP conducted examinations such as the Cambridge and Malayan Secondary School Entrance Examination (MSSEE).Introduction examination of the Standard 5 Evaluation Examination. It is a new system of assessment and is one of the new areas where teachers are directly involved. The following changes in examination were made: .The introduction of Malaysia's Vocational Education Implementation of the RahmanTalib Report (1960) Implementation of the Cabinet Report (1979) Implementation of the Malaysia Education Blueprint (2013 – 2025) RahmanTalib Report recommended the following actions: 1. at least 40% of questions in UjianPenilaianSekolahRendah (UPSR) and 50% in SijilPelajaran Malaysia (SPM) are of high order thinking skills questions. Extend schooling age to 15 years old. . The emphasis is on School-Based Assessment (SBA). 3.htm . .

.166386 Tutorial question Examine the contributing factors to the changing trends of language assessment. Among its achievements are: i vi Implementation of the Open Certificate Syndicate Implementation of Malay Language as the National Language (1960) The achievements of Malaysia Examination Syndicate v Recognition of Examination certificates iv Putting in place an examination system to meet national needs ii Pioneering the use of computer in the country (1967) iii Taking over the work of the Cambridge Examination Syndicate Figure 3: The achievements of Malaysia Examination Syndicate (MES) Source:Malaysia Examination Board (MES) Exercise Describe the stages involved in the development of educational evaluation in Malaysia.emoe. the role of MES is to complement and complete the implementation of the national education Read more: http://www.By and large. Create and present findings using graphic

1 LEARNING OUTCOMES By the end of this topic. you will be able to: 2. name and differentiate the different test types. 5. 6. and Placement Tests . Achievement. It also looks at the different types of assessments and the classifications of tests according to their purpose. FRAMEWORK OF TOPICS Role and Purposes of Assessment in Teaching and Learning Reasons / Purposes of Assessment Assessment of Learning / Assessment for Learning Types of Tests: Proficiency.ROLE AND PURPOSES OF ASSESSMENT IN TEACHING AND LEARNING TOPIC 2 2.0 SYNOPSIS Topic 2 provides you an insight on the reasons/purposes of assessment. Diagnostic. Aptitude. 2. explain the reasons/purposes of assessment. distinguish the differences between assessment of learning and assessment for learning.2 4.

SESSION TWO (3 hours)

Reasons/Purpose of Assessment

Critical to educators is the use of assessment to both inform and guide
instruction. Using a wide variety of assessment tools allows a teacher to
determine which instructional strategies are effective and which need to be
modified. In this way, assessment can be used to improve classroom practice,
plan curriculum, and research one's own teaching practice. Of course,
assessment will always be used to provide information to children, parents,
and administrators. In the past, this information was primarily expressed by a
"grade". Increasingly, this information is being seen as a vehicle to empower
students to be self-reflective learners who monitor and evaluate their own
progress as they develop the capacity to be self-directed learners. In addition
to informing instruction and developing learners with the ability to guide their
own instruction, assessment data can be used by a school district to measure
student achievement, examine the opportunity for children to learn, and
provide the basis for the evaluation of the district's language programmes.
Assessment instruments, whether formal tests or informal
assessments, serve multiple purposes. Commercially designed and
administered tests may be used for measuring proficiency, placing students
into one of several levels of course, or diagnosing students’ strengths and
weaknesses according to specific linguistic categories, among other
purposes. Classroom-based teacher-made tests might be used to diagnose
difficulty or measure achievement in a given unit of a course. Specifying the
purpose of an assessment instrument and stating its objectives are an
essential first step in choosing, designing, revising, or adapting the procedure
an educator will finally use.
We need to rethink the role of assessment in effective schools, where
“effective” means maximising learning for the most students. What uses of

assessment are most likely to maximise student learning and well being? How
best can we use assessment in the service of student learning and wellbeing?
We have a traditional answer to these questions. Our traditional answer says
that to maximise student learning we need to develop rigorous standardised
tests given once a year to all students at approximately the same time. Then,
the results are used for accountability, identifying schools for additional
assistance, and certifying the extent to which individual students are “meeting
Let us take a closer look at the two assessments below i.e.
Assessment of Learning and Assessment for Learning.


Assessment of Learning
Assessment of learning is the use of a task or an activity to measure,

record, and report on a student’s level of achievement in regards to specific
learning expectations.
This traditional way of using assessment in the service of student
learning is assessment of learning - assessments that take place at a point in
time for the purpose of summarising the current status of student
achievement. This type of assessment is also known as summative
This summative assessment, the logic goes, will provide the focus to
improve student achievement, give everyone the information they need to
improve student achievement, and apply the pressure needed to motivate
teachers to work harder to teach and learn.


Assessment for leaning
Now compare this to assessment for learning. Assessment for

learning is roughly equivalent to formative assessment - assessment
intended to promote further improvement of student learning during the
learning process.

Assessment for learning is more commonly known as formative and
diagnostic assessments. Assessment for learning is the use of a task or an
activity for the purpose of determining student progress during a unit or block
of instruction. Teachers are now afforded the chance to adjust classroom
instruction based upon the needs of the students. Similarly, students are
provided valuable feedback on their own learning.
Formative assessment is not a new idea to us as educators. However,
during the past several years there has been literally an explosion of
applications linked to sound research.In this evolving conception, formative
assessment is more than testing frequently, although frequent information is
important. Formative assessment also involves actually adjusting teaching to
take account of these frequent assessment results. Nonetheless, formative
assessment is even more than using information to plan next
steps. Formative assessment seems to be most effective when students are
involved in their own assessment and goal setting.


Types of tests
The most common use of language tests is to identify strengths and

weaknesses in students’ abilities. For example, through testing we can
discover that a student has excellent oral abilities but a relatively low level of
reading comprehension. Information gleaned from tests also assists us in
deciding who should be allowed to participate in a particular course or
programme area. Another common use of tests is to provide information
about the effectiveness of programmes of instruction.
Henning (1987) identifies six kinds of information that tests provide about
students. They are:
o Diagnosis and feedback
o Screening and selection
o Placement
o Program evaluation
o Providing research criteria

such as the American TOEFL test which is used to measure the English language proficiency of foreign college students who wish to study in NorthAmerican universities or the British-Australian IELTS test designed for those who wish to study in the United Kingdom or Australia (Davies et al. Clapham and Wall (1995) have a different classification scheme. covering material drawn from an entire course or semester. placement tests. They sort tests into these broad categories: proficiency.point of the semester or academic year. they are usually administered at mid-and end. 1999). progress. diagnostic. Proficiency Tests Proficiency tests are not based on a particular curriculum or language programme.o Assessment of attitudes and socio-psychological differences Alderson. namely achievement tests. and aptitude tests.Their purpose is to describe what students are capable of doing in a language. andplacement. however. categorised tests according to their purpose. Achievement tests are often cumulative. They are designed to assess the overall language ability of students at varying levels. Proficiency tests are usually developed by external bodies such as examination boards like Educational Testing Services (ETS) or Cambridge ESOL. Some proficiency tests have been standardised for international use. diagnostic tests. . proficiency test.. They may also tell us how capable a person is in a particular language skill area. achievement. Achievement Tests Achievement tests are similar to progress tests in that their purpose is to see what a student has learned with regard to stated course outcomes. The content of achievement tests is generally based on the specific course content or on the course objectives. Brown (2010). However.

1966). 29) point out that where “other types of tests are based on success. Because diagnostic tests are difficult to write.. 1958) and the Pimsleur Language Aptitude Battery (PLAB. In the United States. Progress tests are generally teacher produced and are narrower in focus than achievement tests because they cover a smaller amount of material and assess fewer objectives. Language aptitude tests were seemingly designed to apply to the classroom learning of any language. Davies et al. standardised aptitude tests are seldom used today with the exception of identifying foreign language disability (Stansfield & Reed. Progress Tests These tests measure the progress that students are making towards defined course or programme goals. An aptitude test is designed to measure general ability or capacity to learn a foreign language a priori (before taking a course) and ultimate predicted success in that undertaking. apart from untutored language acquisition. They are administered at various stages throughout a language course to see what the students have learned. Since there is no research to show unequivocally that these kinds of tasks predict communicative success in a language. 1999). Carroll & Sapon. . 2004).Diagnostic Tests Diagnostic tests seek to identify those language areas in which a student needs further help. two common standardised English Language tests once used were the Modern Language Aptitude Test (MLAT.” The information gained from diagnostic tests is crucial for further course activities and providing students with remediation. diagnostic tests are based on failure. Pimsleur. placement tests often serve a dual function of both placement and diagnosis (Harris & McCann. perhaps after certain segments of instruction have been completed. 1994. Harris and McCann (1994 p. Aptitude Tests This type of test no longer enjoys the widespread use it once had.

Happy reading! . In the former. institutions may choose to use a well-established proficiency test such as the TOEFL or IELTS exam and link it to curricular benchmarks. This type of test indicates the level at which a student will learn most effectively.Placement Tests These tests. the test developer may choose to base the test content either on a theory of general language proficiency or on learning objectives of the curriculum. Discuss the extent tests or assessment tasks serve their purpose. In some contexts. In designing a placement test. In the latter. Elsewhere. At other institutions. Discuss and present the various types of tests and assessment tasks that students have experienced. The end of the topic. students are placed according to their level in each individual skill area. students are placed according to their overall rank in the test results. tests are based on aspects of the syllabus taught at the institution concerned. which are homogeneous in level. placement test scores are used to determine if a student needs any further instruction in the language or could matriculate directly into an academic programme. are designed to assess students’ level of language ability for placement in an appropriate course or class. The main aim is to create groups. on the other hand.

It looks at the definitions. purposes and differences of various tests. 8.0 BASIC TESTING TERMINOLOGY SYNOPSIS Topic 3 provides input on basic testing terminology. Objective and Subjective Tests FRAMEWORK OF TOPICS Norm-Referenced and CriterionReferenced Types of Tests Formative and Summative Objective and Subjective .TOPIC 3 3. 3. 3.2 explain the meaning and purpose of different types of language tests.1 LEARNING OUTCOMES By the end of this topic. compare between Norm-Referenced Test and CriterionReferenced Test. Formative and Summative Tests. you will be able to: 7.

The main advantage of CRTs is that they provide the testers to make inferences about how much language proficiency. In a standards-based assessment model. NRT is administered to compare an individual performance with his peers’ and/or compare a group with other groups. 3. These criteria are based on learning outcomes or objectives as specified in the syllabus. in the case of language proficiency tests. on specific course or lesson objectives. In other words. median (middle score). the standards serve as the criteria or yardstick for measurement.4 Criterion-Referenced Test (CRT) Gottlieb (2006) on the other hand refers Criterion-referenced tests as the collection of information about student progress or achievement in relation to a specified criterion. such as in the end of the year examination for the streaming and selection of students. In a test. which denotes that the test-taker’s score was higher than 78 percent of the total number of test-takers but lower than 22 pecent in the administration. 250 out of 300 and a percentile rank for instance 78 percent.CONTENT SESSION THREE (3 hours) 3. mostly in the form of grades.3 Norm-Referenced Test (NRT) According to Brown (2010). or knowledge . Thus. Curriculum Development Centre (2001) defines CRT as an approach that provides information on student’s mastery based on the criteria determined by the teacher. the word ‘criterion’ means the use of score values that can be accepted as the index of attainment to a test-taker. CRTs are designed to provide feedback to test-takers. The purpose of such tests is to place test-takers along a mathematical continuum in rank order. Following Glaser (1973). scores are commonly reported back to the test-taker in the form of a numerical score for example. and/or percentile rank. standard deviation (extent of variance in scores). in NRTs an individual test-taker’s score is interpreted in relation to a mean (average score). NRT is used for the summative evaluation. In the School-Based Evaluation.

that testtakers/students originally have and their successive gains over time. as the name implies. in the aspect of academic achievement tests. Mastery test: monthly PMR. project. CRTs focus on student’s mastery of a subject matter (represented in the standards) along a continuum instead of ranking student on a bell curve. exercises in the classroom Table 3: The differences between Norm-Referenced Test (NRT) and Criterion-Referenced Test (CRT) Definition 3. Criterion-Referenced Test An approach that provides information on student’s mastery based on a criterion specified by the teacher Purpose Determine performance Determine learning difference among mastery based on individual and groups specified criterion and standard Test Item From easy to difficult level Guided by minimum and able to discriminate achievement in the examinee’s ability related objectives Frequency Continuous assessment Continuous assessment in the classroom Appropriateness Summative evaluation Formative evaluation Example Public exams: UPSR. is a kind of feedback teachers give students while the course is progressing. The teachers point out on what the students have done wrong and help them to get it right. coursework. SPM. the teachers can suggest changes to the focus of . As opposed to NRTs.We can think of formative assessment as “practice.” With continual feedback the teachers may assist students to improve their performance. Based on the results of formative test or assessment. This can take place when teachers examine the results of achievement and progress tests. It is part of the instructional process.and skills. Formative assessment can be seen as assessment for learning. Table 3 below shows the differences between NormReferenced Test (NRT) and Criterion-Referenced Test (CRT). and STPM test.5 Norm-Referenced Test A test that measures student’s achievement as compared to other students in the group Formative Test Formative test or assessment.

It is given at a point in time to measure student achievement in relation to a clearly defined set of standards. Formative Assessment Anecdotal records Quizzes and essays Summative Assessment Final exams National exams (UPSR. refers to the kind of measurement that summarise what the student has learnt orgive a one-off measurement. but it does not necessarily show the way to future progress. PMR. students may also need to change and improve. On the other hand. 3.6 Summative Test Summative test or assessment. Due to the demanding nature of this formative test.It does not necessarily provide a clear picture of an individual’s overall progress or even his/her full potential.In other words.1 shows formative and summative assessments that are common in schools. but may provide straightforward and invaluable results for teachers to analyse. End of the year tests in a course and other general proficiency or public exams are some of the examples of summative tests or assessment. especially if s/heis hindered by the fear factor of physically sitting for a test.curriculum or emphasis on some specific lesson elements. summative assessment is assessment of student learning. numerous teachers prefer not to adopt this test although giving back any assessed homework or achievement test present both teachers and students healthy and ultimate learning opportunities. It is given after learning is supposed to occur. SPM.Table 3.1: Common formative and summative assessments in schools . on the other hand. STPM) Diagnostic tests Entrance exams Table 3. Students are more likely to experience assessment carried out individually where they are expected to reproduce discrete language items from memory.The results then are used to yield a school report and to determine what students know and do not know.

commonly called a supply type of response rather than creating a response. Receptive or selective response Items that the test-takers chooses from a set of responses. True-falseitems/questions: iii. Multiple choice items/questions ii.  It is very challenging to write successful items. The weaknesses include:  It may limit beneficial washback. it is very difficult to build correctly. let us focus on the multiple-choice questions.which may have a considerable effect on test scores.7 Objective Test According to BBC Teaching English.  This technique tests only recognition knowledge. and iv. Matchingitems/questions.  This technique strictly limits what can be tested. This is congruent with the viewpoint of Hughes (2003. This objective test item comprises five terminologies namely: 1.  It may encourage guessing. Objective tests are popular because they are easy to prepare and take. pp76-78) who warns against many weaknesses of multiple-choice questions. Fill-in the blanks items/questions. quick to mark. They tend to focus more on specific facts than on general ideas and concepts. and provide a quantifiable and concrete result. The types of objective tests include the following: i. which may look easy to construct but in reality. In this topic.  It may enable cheating among test-takers. Let’s look at some important terminology when designing multiple-choice questions. an objective test is a test that consists of right or wrong answers or responses and thus it can be marked objectively.3. .

discard or revise item. compact and clear. State both stem and options as simply and directly as possible. When building multiple-choice items for both classroom-based and large-scaled standardised tests. positive or negative sentence form. Stem Every multiple-choice item consists of a stem (the ‘body’ of the item that presents a stimulus). 4. Usually for a good item.2. 5. Design each item to measure a single objective. There are usually between three and five options/alternatives to choose from. It is in a complete or open. Options or alternatives They are known as a list of possible responses to a test item. An excellent distractor is almost the same as the correct answer but it is not. 3. . (Optional) Use item indices to accept. Distractors This is known as a ‘disturber’ that is included to distract students from selecting the correct answer. consider the four guidelines below: i. Stem must be short or simple. iv. Make certain that the intended answer is clearly the one correct one. the correct answer is not obvious as compared to the distractors. Key This is the correct response. Stem is the question or assignment in an item. However. it must not easily give away the right answer. iii. ii. The response can either be correct or the best one.

In this case. test takers might provide some acceptable. a subjective test is evaluated by giving an opinion. and take-home tests. vocabulary. In short.3. Describe in-depth the multiple-choice test item. teacher or test developer did not predict.Subjective tests include essay. as well as cultural (class. Table 3.8 Subjective Test Contrary to an objective test. Generally. . alternative responses that the tester. Objective Assessments Subjective Assessments True/False Items Extended-response Items Multiple-choice Items Restricted-response Items Multiple-responses Item Essay Matching Items Table 3. a subjective test provides more opportunity to test-takers to show/demonstrate their understanding and/or in-depth knowledge and skills in the subject matter. short-answer. all assessments are created with inherent biases built into decisions about relevant subject matter and content. Reflection 1. in reality. In reality. synthesis. and evaluation. and gender) biases. In fact. ethnic. subjective test will enable students to be more creative and critical.2 shows various types of objective and subjective assessments. 2. Some students become very anxious of these tests because they feel their writing skills are not up to par. Subjective test-items allocate subjectivity in the response given by thetest-takers. Objective test items are items that have only one answer or correct response. usually based on agreed criteria.2: Various types of objective and subjective assessments Some have argued that the distinction between objective and subjective assessments is neither useful nor accurate because. Explain in detail the various types of subjective testitems. subjective tests will test the higher skills of analysis. there is no such thing as ‘objective’ assessment.

Informal assessments are often unreliable. yet they are still important in classrooms. . Identify at least three differences between formative and summative assessment? 2. and defend your explanation with examples. What are the strengths of multiple-choice items compared to essay items? 3. Explain why this is the case. Compare and contrast Norm-Referenced Test with CriterionReferenced Test. 4.Discussion 1.

2 1. validity.1 LEARNING OUTCOMES By the end of this topic. and authenticity) and the essential subcategories within reliability and validity. 2. 4. define the basic principles of assessment (reliability. and authenticity) and the essential sub-categories within reliability and validity. FRAMEWORK OF TOPICS Reliability Interpretability Validity Types of Tests Authenticity Practicality Washback Effect Objectivity . practicality. explain the differences between validity and reliability. washback.TOPIC 4 4. washback. practicality. 3.0 BASIC PRINCIPLES OF ASSESSMENT SYNOPSIS Topic 4 defines the basic principles of assessment (reliability. you will be able to: 4. validity. distinguish the different types of validity and reliability in tests and other instruments in language assessment.

washback and authenticity.In a validity chain. adding up marks wrongly. If scores used by the tester do not reflect accurately what the test-taker actually did. 2008). et al. and knowledge-in the exercise of professionally judgment. dependability. giving Level 2 when another rater would give Level 4. thus reliability in planning. It is a concept. transcribing scores from test paper to database inaccurately. implementing. Reliability essentially denotes ‘consistency. understanding. Errors occur in scoring in any ways-for example. G. it is stated that test administrators need to be sure that the scoring performance has to be carried out properly. would not be rewarded by another marker.4 RELIABILITY Reliability means the degree to which an assessment tool produces stable and consistent results.CONTENT SESSION FOUR (3 hours) 4. there are five important criteria that the testers ought to look into for “testing a test”: reliability. Fundamentally. or would not be received on a similar assessment. then these scores lack reliability. In this process. a reliable test is consistent and dependable. there is no priority order implied in the order of presentation.3 INTRODUCTION Assessment is a complex. 2001a. and accuracy of assessment results’ (McMillan. practicality. Since these five principles are context dependent.65 in Brown. iterative process requiring skills. 4. stability. p. validity. If a tester administers the same test to the same test-taker or matched test-takers on two circumstances. the test should give the same results. which is easily being misunderstood (Feldt & Brennan. and scoring student performances gives rise to valid assessment. students performing really well on the first half of the assessment and poorly . Since there is tremendous variability from either teacher or tester to teacher/tester that affects student performance. 1989).

One way to test inter-rater reliability is to have each rater assign each test item a score. there is a tendency of error. without influencing one another.on the second half due to fatigue. you would calculate the correlation between the two ratings to determine the level of inter-rater reliability. The scores are then compared to determine the consistency of the raters’ estimates. According to Brown (2010). lack of reliability in the scores students receive is a treat to validity.1 Rater Reliability When humans are involved in the measurement procedure. the test has an 80% inter-rater reliability rate. Rater reliability is assessed by having two or more independent judges score the test.4. can two or more testers/raters. give the same marks to the same set of scripts (contrast with intra-rater reliability). a reliable test can be described as follows:     Consistent in its conditions across two or more administrations Gives clear directions for scoring / evaluation Has uniform rubrics for scoring / evaluation Lends itself to consistent application of those rubrics by the scorer  Contains item / tasks that are unambiguous to the test-taker 4. Another means of testing inter-rater reliability is to have raters determine which category each observation falls into and then calculate the percentage of agreement between the raters. . biasness and subjectivity in determining the scores of similar test.There are two kinds of rater reliability namely inter-rater reliability and intra-rater reliability. So. if the raters agree 8 out of 10 times. For example. each rater might score items on a scale from 1 to 10. and so on. Next. Thus. Inter-rater reliability refers to the degree of similarity between different tester or rater.

For example. that was a clear case of unreliability caused by the conditions of the test administration. and even the condition of desk and chairs. test-taker sitting next to open windows could not hear the stimuli clearly. s/he might take much more care with the first. Both inter-and intra-rater reliabilitydeserve close attention in that test scores are likely to vary from rater to rater or even from the same rater (Clark. intra rater reliability determines the consistency of our grading. Unreliability occurs due to outside interference like noise.Intra-rater reliability is an internal factor.4. than the rest. 1979). while inter-rater reliability involves two or more raters. 4. Brown (2010) stated that he once witnessed the administration of a test of aural comprehension in which an audio player was used to deliver items for comprehension. This inconsistency will affect the students’ scores.2 Test Administration Reliability There are a number of reasons which influences test administration reliability. variations in photocopying. According to him. we may become inconsistent in our grading for various reasons. As such. In intra-rater reliability. . say. Some papers that are graded during the day may get our full and careful attention. In other words. intrarater reliability is the consistency of grading by a single rater. the first ten might get higher scores. Scores on a test are rated by a single rater/judge at different times. its main aim is consistency within the rater. ten papers. temperature variations. the amount of light in various parts of the room. but due to street noise outside the building. if a rater (teacher) has many examination papers to mark and does nothave enough time to mark them. while others that are graded towards the end of the day are very quickly glossed over. When we grade tests at different times.

3 Factors that affect the reliability of a test The outcome of a test is influenced by many factors. longer tests produce higher reliabilities. A consistent score does not necessary measure what is intended to measure.3 Factors influencing Reliability Factors that can affect the reliability of a test Test Factor Teacher and Student Factor Environment Factor Test Administration Factor Marking Factor Figure 4. Test length factors In general. . Factors that affect the reliability of a test include test length factors.4. test administration factors. and marking factors. the scores will be more accurate if the duration of the test is longer. An objective test has higher consistency because it is not exposed to a variety of interpretations. a test is considered to be reliable if the scores are consistent and not different from other equivalent and reliable test scores. a. tests are not free from errors.4. In addition.4. Assuming that the factors are constant and not subject to change. teacher and student factors. Due to thedependency on coincidence and guessing. environment factors. A valid test is said to be reliable but a reliable test need not be valid. However. the test items that are the samples of the subject being tested and variation in the samples may be found in two equivalent tests and there can be one of the causes test outcomes are unreliable.

a non-conducive environment will affect test-takers’ performance and test reliability. familiarity to the test formats. high-stakes marking panels that are tightly trained and monitored marker effects are small. Environment factors An examination environment certainly influences test-takers and their scores. small-scale marking. Any favourable environment with comfortable chairs and desks. 2002). good ventilation. d. any good teacher-student relationship would help increase the consistency of the results. there is potentially a large error introduced by individual markers. we human judges have many opportunities to introduce error in our scoring of essays (Linn & Gronlund. it can be concluded that in low-stakes. Marking factors Unfortunately. It is also . Other factors that contribute to positive effects to the reliability of a test include teacher’s encouragement.b. A test-re-test technique can be used to determine test reliability. Weigle. sufficient time and careful monitoring of tests to improve the reliability of their tests.Brennan (1996) has reported that in large-scale. Teacher-Student factors In most tests. c. 2000. Hence. sufficient light and space will improve the reliability of the test. test administrators should strive to provide clear and accurate instructions. Thus. positive mental and physical condition. Test administration factors Because students' grades are dependent on the way tests are being administered. e. On the contrary. and perseverance and motivation. it is normally for teachers to construct and administer tests for students.It is possible that our scoring invalidates many of the interpretations we would like to make based on this type of assessment.

This is important for us as we do not want to make claims concerning what a student can or cannot do based on a test when the test is actually measuring something else. one might ask students to write as many words as they can in 15 minutes. it does not happen to the objective type of tests since the responses are fixed. 4. or previous knowledge of a subject. According to Brown (2010). uses. Thus. However. it would not constitute a valid test of writing ability without taking into .common that different markers award different marks for the same answer even with a prepared mark scheme. Conversely. we have to evaluate the whole assessment process and its constituent parts by how soundly we can defend the consequences that arise from the inferences and decisions we make. objectivity is a condition for reliability. or legitimacy of the claims or inferences that testers would like to make on the basis of obtained scores. the second characteristic of good tests is validity. but a judgment. and consequences that come from assessment (McMillan. which can have varying degrees of strength. To measure writing ability. in other words. trustworthiness.5 VALIDITY Validity refers to the evidence base that can be provided about appropriateness of the inferences. Validity is usually determined logically although several types of validity may use correlation coefficients. Clearly.Appropriateness has to do with the soundness. a valid test of reading ability actually measures reading ability and not 20/20 vision. which refers to whether the test is actually measuring what it claims to measure. A marker’s assessment may vary from time to time and with different situations. Validity. 2001a). then simply count the words for the final score. Such a test is practical (easy to administer) and the scoring quite dependable (reliable). is not a characteristic of a test or assessment. So. or some other variables of questionable relevance.

The important notion here is the purpose. The following are the different types of validity:  Face validity: Do the assessment items appear to be appropriate?  Content validity: Does the assessment content cover what you want to assess? Have satisfactory samples of language and language skills been selected for testing?  Construct validity: Are you measuring what you think you're measuring? Is the test based on the best available theory of language and language use?  Concurrent validity: Can you use the current test score to estimate scores of other criteria? Does the test correlate with other existing measures?  Predictive validity: Is it accurate for you to use your existing students’ scores to predict future students’ scores? Does the test successfully predict future outcomes? It is fairly obvious that a valid assessment should have a good coverage of the criteria (concepts. and the organisation of ideas. rhetorical discourse elements. skills and knowledge) relevant to the purpose of the examination.account its comprehensibility. .

the administrative personnel who decide on its use. . Predictive Validity Figure 4.5. If students taking a test do not feel that the questions given to them are not a test or part of a test. hence. based on the subjective judgement of the examinees who take it. and appears to measure the knowledge or abilities it claims to measure.1 Face validity Face validity is validity which is “determined impressionistically. Content Validity Types of Validity c. The test. and other psychometrically unsophisticated observers. Face validity b. 1987). Construct Validity d. Concurrent Validity e. Mousavi (2009) refers face validity as the degree to which a test looks right.5: Types of Validity 4. will not be able to measure what it claims to measure. then the test may not be valid as the students may not take it seriously to attempt the questions. for example by asking students whether the examination was appropriate to the expectations” (Henning. It is pertinent that a test looks like a test even at first impression.a.

and fluency are examples of linguistic constructs. the test should consist of various components of fluency: speed. In other words. Proficiency. Another method to verify validity is through the use of Table of Test Specification that can give detailed information on each content. How would you feel if at the end of the course. rhythm. number of items. When you are assessing a student’s oral proficiency for instance. how valid is a final examination that requires you to demonstrate your ability to place an order at a posh restaurant in a five-star hotel? 4. status of difficulty.5. To possess construct validity. communicative competence.5. your final examination consists of only one question that covers one element of language from the many that were introduced in the course? If the language course was a conversational course focusing on the different social situations that one may encounter. and item representation for rating in each content or skill or topic. We can quite easily imagine taking a test after going through an entire language course. Fundamentally every issue in language learning and teaching involves theoretical constructs. level of skills. Construct validity is the most obvious reflection of whether a test measures what it is supposed to measure as it directly addresses the issue of what it is that is being measured.3 Construct validity Construct is a psychological concept used in measurement.4.2 Content validity Content validity“is concerned with whether or not the content of the test is sufficiently representative and comprehensive for the test to be a valid measure of what it is supposed to measure” (Henning.The most important step in making sure of content validity is to make sure all content domains are presented in the test. . selfesteem and motivation are psychological constructs. 1987). construct validity refers to whether the underlying theoretical constructs that the test measures are themselves valid.

A high positive correlation of 0. For example. operational definitions of constructs in that their test tasks are the building blocks of the entity that is being measured (see Davidson. Hudson. 4. If you choose to use concurrent validity. T. & Lynch. Since criterion-related evidence usually falls into one of two categories of concurrent and predictive validity. you would look for a reputable test and compare your students’ performance on your test with their performance on the reputable and acknowledged test.5. Tests are. In concurrent validity. a classroom test designed to assess mastery of a point of grammar in a communicative use will have criterion validity if test scores are verified either by observed subsequent behaviour or by other communicative measures of grammar point in question. 1985.juncture. 2000).5. (lack of) hesitations.4 Concurrent validity Concurrent validity is the use of another more reputable and recognised test to validate one’s own test. and other elements within the construct of fluency. a correlation coefficient is obtained and used to generate an actual numerical value.5 Predictive validity Predictive validity is closely related to concurrent validity in that it too generates a numerical value. in a manner of speaking. in a course unit whose objective is for students to be able to orally produce voiced and unvoiced stops in all possible phonetics environments. suppose you come up with your own new test and would like to determine the validity of your test. the results of one teacher’s unit test might be compared with an independent assessment such as a commercially produced test of similar phonemic proficiency.7 to 1 indicates that the learners’ score is relatively similar for the two tests or measures. 4. For example. For example. McNamara. the predictive validity .

php 4.5. There will be situations in which after we have already determined what we consider to be the most valid test. What are reliability and validity? What determines the reliability of a test? What are the different types of validity? Describe any three types and cite we need to reconsider the format purely because of practicality issues. for example. 36) caution that validity is not an allor-none proposition and that various forms of validity may need to be applied to a test in order to be satisfied worth its overall effectiveness. (Norleha Ibrahim. It would require hidden cameras as . it is by far a limiting factor in testing.of a university language placement test can be determined several semesters later by correlating the scores on the test to the GPA of the students who took the test. would require that the examinees be relaxed. A simple example of tests that may be concerned with predictive validity is the trial national examinations conducted at schools in Malaysia as it is intended to predict the students’ performance on the actual SPM national examinations. Of course such a situation would be a highly valid measure of spoken interaction – if we can setit up. Therefore. interact with peers and speak on topics that they are familiar and comfortable with. a test with high predictive validity is a test that would yield predictable results in a latter measure. p. A valid test of spoken interaction. It is good to heed Messick’s (1989. yet it is crucial to the teacher’s understanding of what makes a good test. 2009) As mentioned earlier validity is a complex concept. Imagine if we even try to do so. This sounds like the kind of conversations that people have with their friends while sipping afternoon teaby the roadside stalls. http://www.6 Practicality Although practicality is an important characteristic of tests.

different examiners tend to award different scores to an essay test. Therefore. administration considerations such as time and scoring procedures. Therefore tests that cannot be easily interpreted will definitely cause many problems. Such impact is usuallyseen as being negative: tests are said to force teachersto do things they do not necessarily wish to do.However.5.1) refers to the impact that testshave on teaching and learning. cannot be dismissed if we are to come up with a useful assessment of language ability. 2003.well as a lot of telephone calls and money. Practicality issues can involve economics or costs. An objective test is a test that has the highest level of objectivity due to the scoring that is not influenced by the examiner’s skills and emotions. The test is said to have high objectivity when the examiner is able to give the same score to the similar answers guided by the mark scheme. practicality issues.7 Objectivity The objectivity of a test refers to the ability of teachers/examiners who mark the answer scripts. in which an examiner examines and awards scores to the same answer script. a more practical form of the test especially if it is to be administered at the national level as a standardised test. Tests are only as good as how well they are interpreted.8 Washback effect The term 'washback' or backwash (Hughes. as well as the ease of interpretation. 4. some . is to have a short interview session of about fifteen minutes using perhaps a picture or reading stimulus that the examinees would describe or discuss.5. subjective test is said to have the lowest objectivity. although limiting in a sense. Therefore. Objectivity refers to the extent. 4. It is also possible that the same examiner would give different scores to the same essay if s/he is to re-check at different times. p. Based on various researches. Meanwhile.

Teachers can provide information that “washes back” to students in the form of useful diagnoses of strengths and weaknesses. especially when they represent accomplishments in a student’s developing competence. and provide conditions for peak performance by the learners. Washback enhances a number of basic principles of language acquisition namely intrinsic motivation. Washback is generally said to be either positive or negative. Pearson. In large-scale assessment. and Curtis (2004) offered an entire anthology to the issue of washback while Spratt (2005) challenged teachers to become agents of beneficial washback in their language classrooms. 1986b. Watanabe. washback can have a number of positive manisfestations. In classroom-based assessment. ranging from the benefit of preparing and reviewing for a test to the learning that accrues from feedback on one’s performance. The challenge to teachers is to create classroom tests that serve as learning devices through which washback is achieved. is more formative in nature than summative. Teachers can have various strategies in providing guidance or coaching. Cheng. students learn. give learners feedback that enhance their language development. On the other hand.He mentions that such a test can positively influence what and how teachers teach. Brown (2010) discusses the factors that provide beneficial washback in a test. offer learners a chance to adequately prepare. .have argued that tests are potentiallyalso 'levers for change' in language education: theargument being that if a bad test has negative impact. interlanguage. autonomy.a good test should or could have positive washback(Alderson. among others. washback often refers to the effects that tests have on instruction in terms of how students prepare for the test. 1988). self-confidence. their correct responses need to be complimented. and strategic investment. Students’ incorrect responses can become a platform for further improvements. language ego.

tests must be part of learning experiences for all involved.Unfortunately. which are known to both students and teachers/testers. Good testing or assessment strives to use formats and tasks that reflect the types of situation in which students would authentically use the target . that national standardised examinations would have strong washback effects compared to a school-based or classroom-based test. It is a concept that is difficult to define. We would expect. they have a sense of accomplishment. particularly within the art and science of evaluating and designing test.5. for example. Positive washback. Positive washback occurs when a test encourages good teaching practice. In short. students and administrators. Language learners are motivated to perform when they are faced with tasks that reflect real world situations and contexts. or what we prefer to call “guided washback” can benefit teachers. Washback is particularly obvious when the tests or examinations in question are regarded as being very vital and having a definite impact on the student’s or test-taker’s future. Positive washback assumes that testing and curriculum design are both based on clear course outcomes.9 Authenticity Another major principle of language testing is authenticity. Citing Bachman and Palmer (1996) in Brown (2010) authenticity is “the degree of correspondence of the characteristics of a given language test task to the features of a target language task” (p. 4. students and teachers tend to think of the negative effects of testing such as “test-driven” curricula and only studying and learning “what they need to know for the test”.23) and then suggested an agenda for identifying those target language tasks and for transforming them into valid test items. If students perceive that tests are markers of their progress towards achieving these outcomes.

the following considerations should be taken into account. Administration and Scoring Variation: Stated criteria for score interpretation assume standard procedures for administering and . D. Less than full understanding of these procedures is likely to produce errors in interpretation and ultimately in counseling or other uses. which can be obtained by studying its manual and other materials along with current research literature with respect to its use. Scores.0 Interpretability Test interpretation encompasses all the ways that meaning is assigned to the scores. use of a measurement for a purpose for which it was not designed may constitute misuse. which by itself is not interpretable. 4. Consider Reliability: Reliability is important because it is a prerequisite to validity and because the degree to which a score may vary due to measurement error is an important factor in its interpretation. In any test interpretation. B. Consider Validity: Proper test interpretation requires knowledge of the validity evidence available for the intended use of the test. a standard score).6.g. Norms. Whenever possible.g. pass or fail) or into a derived score (e. no one should undertake the interpretation of scores on any test without such study. Additional steps are needed to translate the number directly into either a verbal description (e. A. C..language. Its validity for other uses is not relevant. The nature of the validity evidence required for a test depends upon its use. teachers should attempt to use authentic materials in testing language skills. Proper interpretation requires knowledge about the test.. Indeed. and Related technical Features: The result of scoring a test or subtest is usually a number called a raw score.

. discuss aspects of reliability/validity that must be considered in these assessments. Based on samples of formative and summative assessments. Departures from standard conditions and procedures modify and often invalidate these criteria. Discuss the importance of authenticity in testing. Study some of commercially produced tests and evaluate the authenticity of these tests/ test items.scoring the test. Discuss measures that a teacher can take to ensure high validity of language assessment for the primary classroom.

demonstrate an understanding of the importance of following the guidelines for constructing tests items 9.0 DESIGNING CLASSROOM LANGUAGE TEST SYNOPSIS Topic 5 exposes you the stages of test construction. identify the different stages of test construction 2. discuss the elements of test items of high quality.1 LEARNING OUTCOMES By the end of this topic. identify the elements in a Test Specifications Guidelines 8. you will be able to: 1. categorise test items according to Bloom’s taxonomy 6. describe the features of a test specification 3. compare and contrast Bloom’s taxonomy and SOLO taxonomy 5. Then we look at the various test formats that are appropriate for language assessment. 5. draw up a test specification that reflect both the purpose and the objectives of the test 4. illustrate test formats that are appropriate and meet the requirements of the learning outcomes . the preparing of test blueprint/test specifications.TOPIC 5 5. reliability and validity 7. the elements in a Test Specifications Guidelines And the importance of following the guidelines for constructing tests items.

 Who are the examinees?  What kind of test is to be made?  What is the precise purpose? . When we start to construct a test. it requires a variety of skills along with deep knowledge in the area for which the test is to be constructed.5.3 Stages of Test Construction Constructing a test is not an easy task.1 Determining The essential first step in testing is to make oneself perfectly clear about what it is one wants to know and for what purpose.3.2 FRAMEWORK OF TOPICS Stages of Test Construction Preparing Test Blueprint / Test Specifications Guidelines for constructing Test Items Bloom's and SOLO Taxonomies Test Format CONTENT SESSION FIVE (3 hours) 5. The steps include: i ii iii iv v determining planning writing preparing reviewing vi vii pre-testing validating 5. the following questions have to be answered.

levels of performance. No one can expect to be able consistently to produce perfect items.2 Planning The first form that the solution takes is a set of specifications for the test. The best way to identify items that have to be improved or abandoned is through teamwork. and despite the seemingly inevitable emotional attachment that item writers develop to . and scoring?  What is the scope of the test? 5. What abilities are to be tested?  How detailed and how accurate the results must be?  How important is the backwash effect?  What constraints are set by the unavailability of expertise.  Determining format and timing of the test. administration. Some items will have to be rejected. it includes six qualities: reliability.  Developing a plan for evaluating the qualities of test usefulness. In this stage. authenticity.  Determining scoring procedures 5. the nature of the population of the examinees for whom the test is being designed.This will include information on: content. time of construction. which is the degree to which a test is useful for teachers and students.3 Writing Although writing items is time-consuming. and scoring procedures. Colleagues must really try to find fault.  Determining levels of performance. format and timing.  Identifying resources and developing a plan for their allocation and management. practicality interactiveness.  Defining the nature of the ability we want to measure.3. others reworked. and impact. the test constructor has to determine the content by answering the following questions:  Describing the purpose of the test.  Describing the characteristics of the test takers. criteria. facilities. writing good items is an art.3. validity.

and ready to accept. .  They should have the capacity in using language clearly and economically. Test items writers should possess the following characteristics:  They have to be experienced in test construction.wiseness. Rather.3. Test writers should also try to avoid test items. One should not concentrate solely on elements known to be easy to test. Choices have to be made for content validity and for beneficial backwash. techniques and experience of preparing the test items. It is most unlikely that everything found under the heading of 'Content’ in the specifications can be included in any one version of the test. To construct different kinds of tests.items that they have created. the content of the test should be a representative sample of the course material.5 Reviewing Principles for reviewing test items:  The test should not be reviewed immediately after its construction.3.  They have to be ready to sacrifice time and energy. 5. we have to bear in mind that no comments are necessary. Good personal relations are a desirable quality in any test writing team. Not every teacher can make a good tester. Another basic aspect in writing the items of the test is sampling. which can be answered through test. they must be open to.  They have to be quite knowledgeable of the content of the test. I 5. In the production-type tests. the criticisms that are offered to them. Sampling means that test constructors choose widely from the whole area of the course content. the tester should observe some principles. Testwiseness refers to the capacity of the examinees to utilise the characteristics and formats of the test to guess the correct answer.4 Preparing One has to understand the major principles.

Consider your test specs as a blueprint of the test that include the following:  a description of its content . 5. it should include item facility and discrimination.37 is difficult.7 Validating Item Facility (IF) shows to what extent the item is easy or difficult. often show low reliability. In a language test.63 is easy.63]. it is preferable if native speakers are available to review the test.  Numerical data (test results) should be collected to check the efficiency of the item.6 Pre-testing After reviewing the test. less than 0.e. tests which are too easy or too difficult for a given sample population. and above 0. 5. i. the following formula is used: IF= number of correct responses (Σc) / total number of candidates (N) And to measure item difficulty: IF= (Σw) / (N) The results of such equations range from 0 – 1.37 → 0.3.  The tester should administer the newly-developed test to a group of examinees similar to the target group and the purpose is to analyse every individual item as well as the whole test. 5. The ideal item is one with the value of (0.4 Preparing Test Blueprint / Test Specifications Test specifications (specs) for classroom use can be an outline of your test (Brown. what it will “look like”. As noted in Topic 4. reliability is one of the complementary aspects of measurement. Thus.3. To measure the facility or easiness of the item.but after some considerable time.  Other teachers or testers should review it. and with 1 is too easy. 2010).5) and the acceptability range for item facility is between [0. The items should neither be too easy nor too difficult. it should be submitted to pre-testing. An item with a facility index of 0 is too difficult.

the specs are your guiding plan for designing an instrument that effectively fulfils your desired principles. In reality. in a grammar unit test that will be administered at the end of a three-week grammar course for high beginning adult learners (Level 2).)  tasks (e. test specifications are much more formal and detailed (Spaan. written essay. and the like. A recall item is the item that requires one to recall in order to answer. For instance. Items can be classified as a recall and thinking item. They are also usually confidential so that the institution that is designing the test can ensure the validity of subsequent forms of a test. Michigan English Language Assessment Battery) MELAB. which is a useful information for consideration in measuring or asserting a construct measurement. The students will be taking a test that covers verb tenses and two integrated skills (listening/speaking and reading/writing) . and a thinking item refers to an item that requires test-takers to use their thinking skills to attempt. an instrument. which is an evidence t of something that is being measured. 2006). it is rather easy to develop an item. especially validity. that are intended to be widely distributed and thus are broadly generalised. cloze.)  skills to be included  how the test will be scored  how it will be reported to students For classroom purposes (Davidson & Lynch.g. However. An item is an instrument used to get feedback. It is vital to note that for large-scale standardised tests like Test of English as a Foreign Language (TOEFL® Test). International English Language Testing System (IELTS). such as multiple-choice. etc. etc. 2002). Many language teachers claim that it is difficult to construct an item. instruction or question used to get feedback from testtakers. item types (methods. what exactly is an item for a test? An item is a tool. if we are committed in the planning of the measuring instruments to evaluate students’ achievement. reading a short passage.

With the exception of Application.and the grammar class they attend serves to reinforce the grammatical forms that they have learnt in the two earlier classes. Figure 5. 5. reported to students. each of these was broken into subcategories. Instead. kurwongbss.1. how results will be scored. Application.1 Bloom’s Taxonomy (Revised) Blooms’ Taxonomy is a systematic way of describing how a learner’s performance develops from simple to complex levels in their affective. and used in future class (washback) Besides knowing the purpose of the test you are creating. The Original Taxonomy provided carefully developed definitions for each of the six major categories in the cognitive domain. what the various tasks and item types will be Bloom’s and SOLO Taxonomies 5. which of the eight sub-skills you will test 3.qld. Do not conduct a test hastily. 2000.1: Original Terms of Bloom’s Taxonomy Retrieved from: http://www. a broad outline of how the test will be organised 2. Based on the scenario above.htm Adapted from: Pohl. Comprehension. Analysis. psychomotor and cognitive domain of learning. you need to examine the objectives for the unit you are testing carefully.8 . Learning to Think. The complete structure of the original Taxonomy is shown in Figure 5. p. you are required to know as precisely as possible what it is you want to Synthesis. the test specs that you design might consist of the four sequential steps: 1.5. Thinking to Learn. and Evaluation. The categories were Knowledge.

namely: Knowledge.2. the Knowledge category embodied both noun and verb aspects. Anderson (former student of Bloom) eliminated this inconsistency in the revised Taxonomy by allowing these two Bloom’s Revised Taxonomy Retrieved from: http://www. Unfortunately. In their cognitive domain. mastery of each simpler category was prerequisite to mastery of the next more complex one. the noun providing the basis for the Knowledge dimension and the verb forming the basis for the Cognitive Process dimension as shown in Figure 5. This brought uni-dimensionality to the framework at the cost of a Knowledge category that was dual in nature and thus different from the other Taxonomic categories. Figure 5. Synthesis and Evaluation. Further.htm .The categories were ordered from simple to complex and from concrete to abstract. there are six stages. The verb aspect was included in the definition given to Knowledge in that the student was expected to be able to recall or recognise knowledge. In the original Taxonomy. kurwongbss. it was assumed that the original Taxonomy represented a cumulative hierarchy. The noun or subject matter aspect was specified in Knowledge's extensive subcategories. the noun and verb. that is. In 1990s. to form separate dimensions.qld. Comprehension. Application. traditional education tends to base the student learning in this domain.

In the revised Bloom’s Taxonomy. the word knowledge was inappropriate to describe a category of thinking and was replaced with the word remembering instead. Comprehension and synthesis were retitled to understanding and creating respectively. Table 3: The Cognitive Dimension Process Level 1 – C1 Categories & Cognitive Processes Remember Alternative Names Recognising Identifying Recalling Retrieving Definition Retrieve knowledge from long-term memory Locating knowledge in long-term memory that is consistent with presented material Retrieving relevant knowledge from longterm memory . Knowledge is an outcome or product of thinking not a form of thinking per se. Besides. As the taxonomy reflects different forms of thinking and thinking is an active process verbs were used instead of nouns. The knowledge category was renamed. the subcategories of the six major categories were also replaced by verbs and some subcategories were re-organised. in order to better reflect the nature of the thinking defined in each category. the names of six major categories were changed from noun to verb forms. Table 3 below provides a summary of the above. Consequently.

written. and graphic communication Changing from one form of representation to another Finding a specific example or illustration of a concept or principle Determining that something belongs to a category Abstracting a general theme or major point(s) Drawing a logical conclusion from presenting information Detecting correspondences between two ideas. objects.Level 2 – C2 Categories & Cognitive Processes Understand Alternative Names Interpreting Clarifying Paraphrasing Representing Translating Exemplifying Illustrating Instantiating Classifying Categorising Subsuming Summarising Abstracting Generalising Concluding Extrapolating Interpolating Predicting Contrasting Mapping Matching Inferring Comparing Explaining Constructing models Definition Construct meaning from instructional messages. and the like Constructing a cause and effect model of a system Level 3 – C3 Categories & Cognitive Processes Apply Alternative Names Executing Carrying out Exemplifying Illustrating Instantiating Using Analyse Definition Applying a procedure to a familiar task Applying a procedure to a familiar task Applying a procedure to an unfamiliar task Break materials into its constituent parts . including oral.

reorganise elements into a new pattern or structure .Differentiating Organising Attributing Discriminating Distinguishing Focusing Selecting Finding coherence Integrating Outlining Parsing Structuring Deconstructing Evaluating Checking Coordinating Detecting Monitoring Testing Critiquing Judging Create and determine how the parts relate to one another and to an overall structure or purpose Distinguishing relevant from irrelevant parts or important from unimportant parts of presented material Determining how elements fit or function within a structure Determining a point of view. determining whether a process or product has internal consistency. values. or intent underlying presented material Make judgments based on criteria and standards Detecting inconsistencies or fallacies within a process or product. bias. detecting the effectiveness of a procedure as it is being implemented Detecting inconsistencies betweena product and external criteria.determining whether a product has external consistency. detecting the appropriateness of a procedure for a given problem Putting elements together to form a coherent or functional whole.

algorithms.Generating Planning Producing Hypothesising Coming upwith alternative hypotheses based on criteria Devising a procedure for accomplishing some task Inventing a product Designing Constructing The Knowledge Domain Categories & Cognitive Processes Factual Knowledge Conceptual Knowledge Procedural Knowledge Metacognitive Knowledge Definition The basic elements students must know to the acquainted with a discipline or solve problems in it The interrelationships among the basic elements within a larger structure that enable them to function together How to do something.5. in their 1982 study. SOLO is a means of classifying learning outcomes in terms of their complexity. Students find learning more complex as it advances. Unistructural. which are in a quantitative phrase and Relational and Extended Abstract. There are 5 stages. which are in a qualitative phrase. techniques. At first we pick up only . and criteria for using skills. Biggs & Collis first introduced it. SOLO. enabling teachers to assess students’ work in terms of its quality not of how many bits of this and of that they got right. and methods Knowledge of cognition in general as well as awareness and knowledge of one’s own cognition 5. methods of inquiry.2 SOLO Taxonomy On the other hand. Multistructural. which stands for the Structure of the Observed Learning Outcome. namely Prestructural. taxonomy is a systematic way of describing how a learner’s performance develops from simple to complex levels in their learning.

possibly going beyond the initial scope of the task (Biggs & Collis. 2004). 1982. to the ability to link the ideas and elements of a task together (Relational) and finally (Extended Abstract) to understand the topic for themselves. through a simple and then more developed grasp of the topic (Unistructural and Multistructural). Biggs & Collis noted that there was an ‘increase in the structural complexity of their (the students’) responses’ (1991:64). .3: SOLO Taxonomy The SOLO taxonomy maps the complexity of a student’s work by linking it to one of five phases: little or no understanding (Prestructural). Hattie & Brown. then we learn how to integrate them into a whole (relational). In their later research into multimodal learning. Figure or few aspects of the task (unistructural). we are able to generalise that whole to as yet untaught applications (extended abstract). and finally. The diagram below shows lists verbs typical of each such level. then several aspects but they are unrelated (multistructural).

. Smith. 2009. 2011). 2005:247. that learning is a series of interconnected webs that can be built upon and extended. 2011). It may also be helpful in providing a range of techniques for differentiated learning (Anderson. use research and critical thinking effectively to develop their own answers. 2012). to be used in lesson design. in task guidance and formative and summative assessment (Smith & Colby. and write essays that engage with the critical conversation of the field (Linkon. 2007. thereby enabling deep understanding and long-term retention. Higher Order Thinking Skills) maps (Hook & Mills. 2009. for use in class (Black &Wiliam. 2007. Nückles et al. encouraging students to: Develop interpretations. The structure of the taxonomy encourages viewing learning as an on-going process. cited in Allen. .” A range of SOLO based techniques exist to assist teachers and students. 2011) can be used in English to scaffold in depth discussion. 2012). Hattie. focusing on what the student should be able to do and at which level. Use of HOTS viz. 2009. This would help to develop Smith’s (2011:92) “self-regulating. 2009. 2009) encourages teachers to be more explicit when creating learning objectives. Use of constructional alignment (Biggs & Tang. self-evaluating learners who were well motivated by learning. to make the process explicit to the student.. Hook & Mills. Huang. Black & William. Nückles et al. This is essential for a student to make progress and allows for the creation of rubrics. moving from simple recall of facts towards a deeper understanding.It may be useful to view the SOLO taxonomy as an integrated strategy. (2009:261) elaborates: Cognitive strategies such as organization and elaboration are at the heart of meaningful learning because they enable the learner to organize learning into a coherent structure and integrate new information with existing knowledge.

such as human thought. Biggs & Collis (1991) refers to the structure as a hierarchy. Campbell et al. in his wide-ranging investigation into effective teaching and ‘visible learning’. (2005:306) advocates its use as a ‘framework for developing the quality of assessment’ citing that it is ‘easily communicable to students’. He indicates that: The most powerful model for understanding these three levels and integrating them into learning intentions and success criteria is the SOLO model. so use throughout the teaching process may alleviate these issues. (1992) explained the structure of the SOLO taxonomy as consisting as a series of cycles (especially between the Unistructural. Multistructural and Relational levels).It can be effectively used for students to deconstruct exam questions to understand marks awarded and as a vehicle for self-assessment and peer-assessment. are categorised in this manner. However. However. Hook & Mills (2011:5) refer to it as ‘a model of learning outcomes that helps schools develop a common understanding’. there are concerns when complex processes. deep and conceptual. naturally. In these two studies. . the taxonomy is not without critics. However. the SOLO taxonomy was used primarily for assessing completed work. outlines three levels of understanding: surface. An additional criticism. SOLO taxonomy can be used not only in designing the curriculum in terms of the learning outcomes intended. Hattie (2012:54). as does Moseley et al. which would allow for a development of breadth of knowledge as well as depth. (2005). (2002:512) criticises its ‘conceptual ambiguity’ stating that the ‘categorisation’ is ‘unstable’. Moseley et al. but also in assessment.The SOLO taxonomy has a number of proponents. is the SOLO taxonomy’s structure. Chick (1998:20) believes that ‘there is potential to misjudge the level of functioning’ and Chan et al. in particular when the taxonomy is compared with that of Bloom (1956).

5. If it is evaluated outside the curriculum.6.1 Aim of the test Test item development is a critical step in building a test that properly meets certain standards. synthesis. analysis. This is not an easy task. the curricular validity of the item can be disputed.5. 5. critical thinking. evaluation and interpreting rather than just declarative knowledge. Therefore. 5. A good test is only as good as the quality of the test items. well-equipped facilities.6 Guidelines for constructing test items Tests do not work without well-written test items. If the individual test items are not appropriate and do not perform well.6. Ample opportunity must be given to students to learn the topics that are to be evaluated. It should be a goal of the writer to ensure their items have cognitive characteristics exemplifying understanding. Test-takers appreciate clearly written questions that do not attempt to trick or confuse them into incorrect responses.3 Range of skills to be tested Test item writers should always attempt to write test items that measure higher levels of cognitive processing. problem-solving. how can the test scores be meaningful? The topic to be evaluated (construct) and where the evaluation is done (title/context) must be part of the curriculum.6. This opportunity would include the availability of language teachers. There are many theories that provide frameworks on . The following presents the major characteristics of well-written test items. test items must be developed to precisely measure the objectives prescribed by the blueprint and meet quality standards.2 Range of the topics to be tested A test must measure the test-takers’ ability or proficiency in applying the knowledge and principles on the topics that they have learnt. and the expertise of the language teachers in conducting the lessons and providing the skills and knowledge that would be evaluated to the test-takers or students.

intermediate language proficiency students could answer easy and moderate items whereas high language proficiency students could answer . we must assure that weak students could answer easy item. For example. In any test item construction.5 Level of difficulty A test has a planned number of questions at a level of difficulty and discrimination to best determine mastery and non-mastery performance states.? • What is an example of ….? • What is not the characteristic of …. A format that provides an initial starting structure to use in writing questions can be valuable for item writers. questions can begin with the following: • What best defines …. Always stick to writing important questions that represent and can predict that a test-taker is proficient at high levels of cognitive processing in doing their test proficiently. Test-takers should clearly understand what is needed in education and language assessment to prepare for the examination and how much experience performing certain activities would help in preparation. to measure understanding of knowledge or facts. since the format is expected. Therefore a logical and consistent stimulus format for writing test items can help expedite the laborious process of writing test items as well as supply a format for asking basic questions.? 5. This should be the road map that helps item writers create test items and helps test takers understand what will be required of them to pass an examination. 5. When these formats are used.6. test takers can quickly read and understand the questions.6.4 Test format Test items should always follow a consistent design so that the questioning process in itself does not give unnecessary difficulty to answering questions.levels of thinking and Bloom’s taxonomy is often cited as a tool to use in item writing.

. historical references or dates (holidays) that may not be understood by an international examinee. 50 multiple.easy. Tests need to be adapted to other society so that meaning is fully translated correctly and benefits are not given to a particular group of always refrain from the use of slang. what do you say? Test format or test type? Test format refers to the layout of questions on a test. A reliable and valid test instrument should encompass all three levels of difficulties. when you want to introduce new kinds of geographic references. reading test.0 Test format What is the difference between test format and test type? For example. race or other cultural groups.For the sake of brevity. For example.choice questions. either in a single language or translated to other languages. which is organised a little bit different from the existing test items.6. http://books. the format of a test could be two essay questions. moderate and advance test items.6 International and Cultural Considerations (biasness) In standardised tests when exams are distributed internationally. etc.html ?id=Ia3SGDfbaV0C&redir_esc=y 6. I will consider providing the outlines of some large-scale standardised tests. 5. What are the good characteristics of a test item? Explain each characteristic of a test item in a graphic organiser. for example. Steps should be taken to avoid item content that may bias gender.

The test format is illustrated below. IELTS Test Format IELTS is a test of all four language skills – Listening. reading. as an Internet-based test (TOEFL iBT™). . It is prepared and examined by the Malaysian Examinations Syndicate. and C. or up to seven days before or after that. TOEFL (Teaching of Foreign Language) The TOEFL test is administered two ways. namely Sections A.500+ test sites in the world use the TOEFL iBT. and as a paper-based test (TOEFL PBT). with no breaks in between. Test-takers will take the Listening. speaking and writing). Multiple-choice questions are tested using a standardised optical answer sheet that uses optical mark recognition for detecting answers for Paper 1 and Paper 2 comprises three sections. which take a total of about four and a half hours to complete. B. Reading and Writing tests all on the same day one after the other. one’s Speaking test may be on the same day as the other three tests.UPSR Primary School Evaluation Test. Depending on the examinee’s test centre.The TOEFL iBT® test is given in English and administered via the Internet. Writing & Speaking. There are four sections (listening. Most of the 4. is a national examination taken by all pupils in our country at the end of their sixth year in primary school before they leave for secondary school. Malay). This test consists of two papers namely Paper 1 and Paper 2. also as known Ujian Penilaian Sekolah Rendah (commonly abbreviated as UPSR. Reading. The total test time is under three hours.

Figure 6: IELTS Test Format .

TOPIC 6 6.2 FRAMEWORK OF TOPICS LANGUAGE SKILLS LISTENING SPEAKING ASSESSING LANGUAGE SKILLS AND LANGUAGE CONTENT READING WRITING LANGUAGE CONTENT DISCRETE TEST INTEGRATIVE TEST COMMUNICATIVE TEST OBJECTIVE AND SUBJECTIVE TESTING .0 ASSESSING LANGUAGE SKILLS CONTENT SYNOPSIS Topic 6 focuses on ways to assess language skills and language content. 6. speaking. 6. It defines the types of test items used to assess language skills and language content. reading and writing skills in a classroom.1 LEARNING OUTCOMES At the end of Topic 6. integrative test and communicative test. It also provides teachers with suggestions on ways a teacher can assess the listening. It also discusses concepts of and differences between discrete point test. teachers will be able to:  Identify and carry out the different types of assessment to assess language skills and language content  Understand anddifferentiate between objective and subjective testing  Understand and differentiate between discrete point test. integrative test and communicative test in assessing language.

Brown 2010 identified four types of listening performance from which assessment could be considered. Intensive : listening for perception of the components (phonemes.CONTENT SESSION SIX (6 hours) 6.larger stretch of language. command. numbers. Selective : processing stretches of discourse such as short monologues for several minutes in order to “scan” for certain information.1 Types of test items to assess language skills a. iv. to listen for names. i. for example. grammatical category. . Listening for the gist – or the main idea. Listening Basically there are two kinds of listening tests: tests that test specific aspects of listening. Extensive : listening to develop a top-down . intonation. In addition to this. directions (in a map exercise). Extensive performance ranges from listening to lengthy lectures to listening to a conversation and deriving a comprehensive message or purpose.and making inferences are all part of extensive listening. Assessment tasks in selective listening could ask students. and task based tests which test skills in accomplishing different types of listening tasks considered important for the students being tested. like sound discrimination.etc) of a . or stories). or certain facts and events. The purpose of such performance is not necessarily to look for global or general meaning but to be able to comprehend designated information in a context of longer stretches of spoken language( such as classroom directions from a teacher. discourse markers. words. Responsive : listening to a relatively short stretch of language ( a greeting. comprehension check.2. global understanding of spoken language. TV or radio news items. etc.) in order to make an equally short response iii. question. ii.

and grammatical properties of language may be included in the performance criteria. giving instructions.etc. language required in doing different things like describing. Examples of intensive assessment tasks include directed response tasks (requests for specific production of speech).b. The only role of listening here is in the short-term storage of a prompt. a number of prosodic (intonation.). narrating a story. At one end of a continuum of types of speaking performance is the ability to imitate a word or phrase or possibly a sentence. lexical . limited picture-cued tasks including simple sentences. knowledge of what language is appropriate in different situations. rhythm. sentence and dialogue completion. Speaking In the assessment of oral production. reading aloud. Brown 2010 cited four categories for oral assessment. or phonological relationships. participating in a discussion about a given topic. The production of short stretches of oral language designed to demonstrate competence in a narrow band of grammatical. just long enough to allow the speaker to retain the short stretch of language that must be imitated. . Imitative . Task-based activities involve describing scenes shown in a picture. As in the listening performance assessment tasks. both discrete feature objective tests and integrative task-based tests are used. etc. The second type involves finding out if pupils can perform different tasks using spoken language that is appropriate for the purpose and the context. no inference are made about the test-takers ability to understand or convey meaning or to participate in an interactive conversation. 2. Intensive. phrasal. giving directions. Although this is a purely phonetic level of oral production. lexical. etc. and translation up to the simple sentence level. We are interested only in what is traditionally labelled “pronunciation”. The first type tests such skills as pronunciation. 1.

and storytelling.In can include informal monologue such as casually delivered speech (e.. which sometimes includes multiple exchanges and/or multiple participants.g. Lan : Hey. Extensive (monologue). which have the purpose of maintaining social relationships.3. Liza : Excuse me. oral presentations. C. Extensive oral production tasks include speeches. Six-fifteen. Shan. A and B are transactional. and yourself? Lan : I’m good. how’s it going? Shan: Not bad. etc. 4. do you have the time? Don : Yeah. The difference between responsive and interactive speaking is in the length and complexity of the interaction. B. standard greetings. Language style is more deliberative (planning is involved) and formal for extensive tasks. during which the opportunity for oral interaction from listeners is either highly limited (perhaps to nonverbal responses) or ruled out together. and small talk. Jo : What is the most urgent social problem today? Sue : I would say bullying. Shan: Cool. and C is interpersonal). 5. (In the three dialogues cited above. Responsive assessment tasks include interaction and test comprehension but at somewhat limited level of very short conversation. The stimulus is almost always a spoken prompt (to preserve authenticity) with one or two followup questions or retorts: A. which has the purpose of exchanging specific information. recalling a vacation in the . Responsive. Interactive. simple requests and comments. Okay gotta go. and (b) interpersonal exchanges. Interaction can be broken down into two types : (a) transactional language.

He also discusses receptive reading or intensive reading which refers to “a form of reading aimed at discovering exactly what the author seeks to convey” (p. These words. such as unless. conveying recipes. He describes skimming and scanning as two different types of reading. Reading Cohen (1994). Compared to grammatical or syntactic meaning. This is the most common form of reading especially in test or assessment conditions. In the first. A reading text can also convey various kinds of meaning and reading involves the interpretation or comprehension of these meanings. grammatical meaning are meanings that are expressed through linguistic structures such as complex and simple sentences and the correct interpretation of those structures. discussed various types of reading and meaning assessed. A second meaning is informational meaning which refers largely to the concept or messages contained in the text. This refers to the perception of rhetorical functions conveyed by the text. One typical function is discourse marking which adds cohesiveness to a text. a respondent is given a lengthy passage and is required to inspect it rapidly (skim) or read to locate specific information (scan) within a short period of time.. First. thus. Another type of reading is to read responsively where respondents are expected to respond to some point in a reading text through writing or by answering questions. A third meaning contained in many texts is discourse meaning. informational meaning requires a more general understanding of a text rather than having to pay close attention to the linguistic structure of sentences. Respondents may be required to comprehend merely the information or content of the passage and this may be assessed through various means such as summary and précis writing. c. 218).mountains. therefore etc. however. are crucial to the correct interpretation of a text and students may be assessed on their ability to understand the discoursal . recounting the plot of a novel or movie).

The writer’s tone – whether it is cynical. the learner must attain the skills in the fundamental. basic tasks of writing letters. identifies three different genres of writing which are academic writing. Brown (2010) identified four categories of written performance that capture the range of written production which can be used to assess writing skill. collocation and idioms. for example. Assessment tasks require learners to perform at a limited discourse level. may be considered as personal writing according to Brown’s taxonomy. Intensive (controlled). there can be many situations where the reader is completely wrong in comprehending a text simply because he has failed to perceive the correct tone of the author. each of which can be expanded to include many different examples. Fiction. connecting sentences into a paragraph . words. Form is the primary focus while context and meaning are of secondary concern. To produce written language. 1. Imitative. sad or etc.. At this stage the learners are trying to master the mechanics of writing. punctuation. job-related writing and personal writing. and correct grammatical features up to the length of a sentence. Finally. d. Beyond the fundamentals of imitative writing are skills in producing appropriate vocabulary within a context. a fourth meaning which may also be an object of assessment in a reading test is the meaning conveyed by the writer’s tone. Writing Brown (2004). Responsive. important in reading comprehension but may be quite difficult to identify. 3. and brief sentences. Nevertheless. especially by less proficient learners. This category includes the ability to spell correctly and to perceive phoneme-grapheme correspondences in the English spelling system. 2. Meaning and context are important in determining correctness and appropriateness but most assessment tasks are more concerned with a focus on form and are rather strictly controlled by the test design.meaning that they bring in the passage.

Extensive writing implies successful management of all the processes and strategies of writing for all purposes. or even a thesis. a term paper. Genres of writing include brief narratives and descriptions. up to the length of an essay. engaging in the process of multiple drafts to achieve a final product. demonstrating syntactic and lexical variety. short reports. Examples of the subjective test include essays and short answer . a major research project report. In these examples of objective tests. brief responses to reading. 6. true false items and matching items because each of these are graded objectively. and interpretations of charts and graphs. However. and in many cases. We normally associate objective tests with multiple choice question type tests and subjective tests with essays. Objective tests are tests that are graded objectively while subjective tests are thought to involve subjectivity in grading. Focus on grammatical form is limited to occasional editing and proofreading of a draft.and creating a logically connected sequence of two or three paragraphs. to be more accurate we will consider how the test is graded. The most familiar terms regarding tests are the objective and subjective tests . Form-focused attention is mostly at the discourse level. using details to support or illustrate ideas. lists of criteria.2 Objective and Subjective test Tests have been categorized in many different ways. Focus is on achieving a purpose. lab reports. 4. summaries. Extensive. and other guidelines. organizing and developing ideas logically. with a strong emphasis on context and meaning.2. There are many examples of each type of test. Objective type tests include the multiple choice test. Tasks relate to pedagogical directives. outlines. there is only one correct response and the grader does not need to subjectively assess the response.

questions. However some other types of common tests such as
the dictation test, filling in the blank type tests, as well as
interviews and role plays can be considered subjective and
objective type tests where they fall on some sort of continuum
where some tests are more objective than others. As such, some
of these tests would fall closer to one end of the continuum or the

Two other terms, select type tests and supply type tests are related
terms when we think of objective and subjective tests. In most
cases, objective tests are similar to select type tests where
students are expected to select or choose the answer from a list of
options. Just as a multiple choice question test is an objective type
test, it can also be considered a select type test. Similarly, tests
involving essay type questions are supply type as the students are
expected to supply the answer through their essay. How then
would you classify a fill in the blank type test? Definitely for this
type of test, the students need to supply the answer, but what is
supplied is merely a single word or a short phrase which differs
tremendously from an essay. It may therefore be helpful to once
again consider a continuum with supply type and select type items
at each end of the continuum respectively.

It is possible to now combine both continua as shown in Figure 6.1
with the two different test formats placed within the two continua:

Figure 6.1: Continua for different types of test formats

It is not by accident that we find there are few, if any, test formats that are
either supply type and objective or select type and subjective. Select type
tests tend to be objective while supply type tests tend to be subjective.
In addition to the above, Brown and Hudson (1998), have also suggested
three broad categories to differentiate tests according to how students are
expected to respond. These categories are the selected response tests, the
constructed response tests, and the personal response tests. Examples of
each of these types of tests are given in Table 6.1.

Table 6.1: Types of Tests According to Students’ Expected Response
Selected response

Constructed response

Personal response

True false




Short answer


Multiple choice

Performance test

Self and peer

Selected response assessments, according to Brown and Hudson
(1998), are assessment procedures in which “students typically do not
create any language” but rather “select the answer from a given list” (p.
658). Constructed response assessment procedures require students to
“produce language by writing, speaking, or doing something else” (p.
660). Personal response assessments, on the other hand, require
students to produce language but also allows each students’ response to
be different from one another and for students to “communicate what
they want to communicate” (p. 663). These three types of tests,
categorised according to how students respond, are useful when we
wish to determine what students need to do when they attempt to
answer test questions.


Types of test items to assess language content

Discrete Point Test and Integrative Test
Language tests may also be categorised as either discrete point or
integrative. Discrete point tests examine one element at a time.

Integrative tests, on the other hand, “requires the candidate to
combine many language elements in the completion of a
task” (Hughes, 1989: 16). It is a simultaneous measure of
knowledge and ability of a variety of language features, modes, or
A multiple choice type test is usually cited as an example of a
discrete point test while essays are commonly regarded as the
epitome of integrative tests. However, both the discrete point test
and the integrative test are a matter of degree. A test may be more
discrete point than another and similarly a test may be more
integrative than another. Perhaps the more important aspect is to
be aware of the discrete point or integrative nature of a test as we
must be careful of what we believe the test measures.

This brings us to the question of how discrete point is a multiple
choice question type item? While it is definitely more discrete point
than an essay, it may still require more than just one skill or ability
in order to complete. Let’s say you are interested in testing a
student’s knowledge of the relative pronoun and decide to do so by
using a multiple choice test item. If he fails to answer this test item
correctly, would you conclude that the student has problems with
the relative pronoun? The answer may not be as straight forward as
it seems. The test is presented in textual form and therefore
requires the student to read. As such, even the multiple choice test
item involves some integration of language skills as this example
shows, where in addition to the grammatical knowledge of relative
pronouns, the student must also be able to read and understand the

Perhaps a clearer way of viewing the distinction between the
discrete point and the integrative test is to examine the perspective
each takes toward language. In the discrete point test, language is
seen to be made up of smaller units and it may be possible to test

for example. all of which are inauthentic when viewed in terms of real life situations.language by testing each unit at a time. it is not surprising that communicative tests have also been given prominence. editing the results. is certainly assessing the students on a particular unit of language and not on the language as a whole. Alderson himself argues that because candidates in language tests are not interested in communicating but to display their language abilities. 2002: 99). 99). Both these aspects are briefly addressed in the following sub sections: Integrating Communicative Elements into Examinations Alderson and Banerjee (2002). two of which revolve around communicative elements in tests and meaningful content. b. In her study. In an integrative test. the perspective of language is that of an integrated whole which cannot be broken up into smaller units or elements. They cite Spence-Brown (2001) who posits that “the very act of assessment changes the nature of a potentially authentic task and compromises authenticity” and that “authenticity must be related to the implementation of an activity. Hence. the test situation is a communicative event in itself and therefore cannot be used to replicate any real world event (p. Testing knowledge of the relative pronoun. 98). report on various studies that seem to point to the difficulty in achieving authenticity in tests. . While this activity seems quite authentic. students were required to interview native speakers outside the classroom and submit a tape-recording of the interview. but flawed discourse” (Alderson & Banerjee. and engaging in spontaneous. Communicative Test As language teaching has emphasised the importance of communication through the communicative approach. on the other hand. the students were observed to prepare for the interview by “rehearsing the interview. the testing of language should maintain the integrity or wholeness of the language. A communicative emphasis in testing involves many aspects. not to its design” (p.

it will not be an easy task to achieve. topic maintenance. and topic change in order for the test to become more authentic and realistic. Fulcher finally points out that in a communicative test. the kinds of tests that we should expect more of in the future will be communicative tests in which candidates actually have to produce the language in an interactive setting involving some degree of unpredictability which is typical of any language interaction situation. In the future. In his review of communicative tests. It is obvious from this description that the communicative test may not be . to investigations that systematically explore inconsistent (which does not mean random) performances across contexts” (p. tests will also need to integrate elements of communication such as topic initiation. The three principles of communicative tests that he highlights are that communicative tests:  involve performance. She believes that there should be a “shift in focus of our measurement from traditional examinations of the construct in terms of response consistency. 493). In short. “the only real criterion of success … is the behavioural outcome. 378). involving especially the amount of time and extent of organisation to allow for such communicative elements to emerge. and  are scored on real-life outcomes.Chalhoub-Deville (2003). notes the descriptors of a communicative test as suggested by several theorists. Due to issues of practicality. Fulcher (2000). besides context. These tests would also take the communicative purpose of the interaction into consideration and require the student to interact with language that is actual and unsimplified for the learner. argues for tests that take context into consideration.  are authentic. The idea of bringing communicative elements into the language test is not a new one. or whether the learner was able to achieve the intended communicative effect” (p.

2. Exercise 1 1. a solution to this problem has to be found in the near future in order to have valid language that are purposeful and can stimulate positive washback in teaching and learning. In your opinion and based on your teaching experience. Describe three different types of writing performance as suggested by Brown (2004) and relate their relationship to academic writing. Practical reasons may hinder some of the demands easily developed and implemented. writing and speaking skills of your own students? What are the methods that you employ? Share this with your classmates and exchange ideas. how would you conduct the testing of reading. job related writing and personal writing. Nevertheless. .

holistic and analytic.2. GRADING AND ASSESSMENT CRITERIA SYNOPSIS Topic 7 focuses on the scoring.1 LEARNING OUTCOMES By the end of Topic 7. This scoring approach relies on quantified methods of evaluating students’ writing.0 SCORING.2 FRAMEWORK OF TOPICS Approaches to scoring Objective Holistic Analytic CONTENT SESSION SEVEN (3 hours) 7. 7. teachers will be able to:  Identify and differentiate the different approaches used in scoring  Use the different approaches used in scoring in assessing language 7. A sample of how objective scoring is conducted is given by Bailey (1999) as follows: . grading and assessment criteria.TOPIC 7 7.1 Objective approach A type of scoring approach is the objective scoring approach. It provides teachers with brief descriptions on the different approaches to scoring namely:-objective.

 Establish standardization by limiting the length of the assessment: Count the first 250 words of the essay. . morphology. or even 1 to 10. The following is an example of a holistic scoring scheme based on a 6 point scale.2 Holistic approach In holistic scoring. vocabulary. and the sum of error scores as the denominator: The denominator is the sum of all the error scores: 7.  Quantify the assessment: Calculate the essay Correctness Score by using 250 words as the numerator of a fraction.2. or 1 to 6. Include every error that a literate reader might note.  Operationalise the assessment: Assign a weight score to each error. and 1 is a minor error that does not affect readability in any significant way. A score of 3 is a severe distortion of readability or flow of ideas. the reader reacts to the students’ compositions as a whole and a single score is awarded to the writing. 2 is a moderate distortion. from 3 to 1.  Identify the elements to be assessed: Go through the essay up to the 250th word underlining every mistake – from spelling and mechanics through verb tenses. Each score on the scale will be accompanied with general descriptors of ability.(Bailey. Normally this score is on a scale of 1 to 4. etc. 1998 : 187).

The 6 point scale above includes broad descriptors of what a student’s essay reflects for each band. Georgetown University. Events are organized logically. development of ideas. Vocabulary is limited and repetitious. Transition from one idea to another is smooth and provides reader with clear understanding that topic is changing. but some part of the sample may not be fully developed. and vivid. Meaning is conveyed but breaks down at times. Responds with a few isolated words.1: Holistic Scoring Scheme Source: S. varied. Some transition of ideas is evident. Mechanical errors are present but do not disrupt communication.Table 7. Moya. Washington RRating 5 5-6 CCriteria      4  4      3     2      1  0   Vocabulary is precise. No evidence of concepts of writing. No response. and conclusion. Sample is comprised of only a few disjointed sentences. Evaluation Assistance Center (EAC)-East. Vocabulary is adequate for grade level. meaning. No complete sentences are written. Meaning is unclear. Vocabulary is simple. Shows a good understanding of writing and topic development. topic . Organization is appropriate to writing assignment and contains clear introduction. There are a few transitional markers or repetitive transitional markers. Shows a clear understanding of writing and topic development. Mechanical errors cause serious disruption in communication. Meaning is conveyed effectively.S. Meaning is frequently not clear. organisation. No transitional markers. A few mechanical errors may be present but do not disrupt communication. Shows some understanding of writing and topic development. Shows little evidence of discourse understanding. Mechanical errors affect communication. Organization may be extremely simple or there may be evidence of disorganization. It is quite apparent that graders using this scale are expected to pay attention to vocabulary.

2. The following are some possible components used in assessing writing ability using an analytical scoring approach and the suggested weightage assigned to each: Components Content Organisation Vocabulary Language Used Mechanics Weight 30 points 20 points 20 points 25 points 5 points The points assigned to each component reflect the importance of each of the components. 7. scoring would be on how well the author has been able to persuade the grader rather than how well organised the ideas were. is there substance to what is written? Is the essay meaningful? Similarly. Bailey also describes another type of scoring related to the holistic approach which she refers to as primary trait scoring. raters assess students’ performance on a variety of categories which are hypothesised to make up the skill of writing. we may also want to consider the organisation of the essay. a particular functional focus is selected which is based on the purpose of the writing and grading is based on how well the student is able to express that function. language use and mechanics. In primary trait scoring. For example. or how grammatical the structures in the essay were. This technique to grading emphasises functional and communicative ability rather than discrete linguistic ability and accuracy. In analytical scoring.e. Does the writer begin the essay with an appropriate topic sentence? Are there good transitions between paragraphs? Other categories that we may want to also consider include vocabulary. Content. . is often seen as an important aspect of writing – i. for example. Mechanics such as punctuation are secondary to communication. if the function is to persuade.development and communication.3 Analytic approach Analytical scoring is a familiar approach to many teachers.

Allows the graders to consciously address important aspects of writing.  Still some degree of subjectivity involved. It provides clear guidelines in grading in the form of the various components. EXERCISE 1. Based on your understanding.Comparing the Three Approaches Each of the three scoring approaches claims to have its own advantages and disadvantages.2: Comparison of the Advantages and Disadvantages of the Three Approaches to Scoring Essays Scoring Approach Holistic Advantages       Analytical   Objective Quickly graded Provide a public standard that is understood by the teachers and students alike Relatively higher degree of rater reliability Applicable to the assessment of many different topics Emphasise the students’ strengths rather than their weaknesses. .  Accentuates negative aspects of the learner’s writing without giving credit for what they can do well. Emphasises the students’ strengths rather than their weaknesses. These can be illustrated by Table 7.2 Table 7. draw a mind map to indicate the advantages and disadvantages of the three approaches to scoring essays.  Does not provide a lot of diagnostic feedback  Writing ability is unnaturally split up into components. Disadvantages  The single score may actually mask differences across individual compositions.

TOPIC 8 8.2  Identify and differentiate some basic statistics terminologies used. It will also look at some item analysis that deals with item difficulty and item discrimination.  determine how well items discriminate using item discrimination.1 LEARNING OUTCOMES By the end of Topic 8. median. teachers will be able to: 8. standard score and interpretation of data.0 ITEM ANALYSIS AND INTERPRETATION SYNOPSIS Topic 8 focuses on item analysis and interpretation. It provides teachers with brief descriptions on basic statistics terminologies such as mode. standard deviation. and  Analyse how well a distractor in a test item performs FRAMEWORK OF TOPICS ITEM ANALYSIS AND INTERPRETATIO N BASIC STATISTICS CONTENT SESSION EIGHT (6 hours) ITEM ANALYSIS MODE STANDARD DEVIATION ITEM DIFFICULTY MEDIAN STANDARD SCORE ITEM DISCRIMINATIO N MEAN INTERPRETATIO N OF DATA DISTRACTOR ANALYSIS . Teachers will also be introduced to distractor analysis in language assessment. mean. 8.

45. 52. add them up and divide by two. 13. If we arrange it in order based on value. 50. 50. and N is the number of observed scores. we have to first arrange the scores in either ascending or descending order of value.8. Instead.5 as (51 + 52)/2 or 103/2 =51. 52. X refers to the raw or observed scores. It is possible to have one mode in a set of scores. 47. the median will be 51 as it is the middle score. it would be 45. then the set of scores is referred to as being bimodal. 54. however. 65. Central tendency measures the extent to which a set of scores gathers around. 17. 65 As there is no one score that is in the middle. If a person were to ask you about the performance of the students in your class. the median is 51. 53. 12. The median refers to the score that is in the middle of the set of scores when the scores are arranged in ascending or descending order. What values would you give? In this section. The mean of a set of test scores is the arithmetic mean or average and is calculated as SX/N where S (sigma) refers to the sum of. 51 The mean for this set of scores is 364/7 = 52 . Always remember. 50. then you are correct as it occurs more often than others. 13. we need to take the two in the middle. median and mean. MODE MEDIAN MEAN Mode is the most frequently occurring raw score in a set of scores. 54. namely measures of central tendency and measures of dispersion. Or perhaps you would like to report on the performance by giving some values that would help provide a good indication of how the students in your class performed. Look at the following set of scores: 47. 47. There are three scores lower than it and an equal number of scores higher than it. you may prefer to cite only one score.5. 52. There are three major measures of central tendency. that when we wish to find the median. 14. If there are two modes.2. You now have a set of scores. Both these types of measures are useful in score reporting. 18 What is the mode for this set of scores? If you said 13. As such. 65. 51. In this set of scores. What happens when there are an even number of scores? Let’s take the following set of scores as an example: 45. 51. The following is a set of scores: 15.1 Basic Statistics Let us assume that you have just graded the test papers for your class. 54. 12. 16. They are the mode. There are seven scores in the set of scores above. 13. we will look at two kinds of measures. it would be very difficult to give all the scores in the class.

2. we come up with the following table: Table 8. There are two methods of calculating standard deviation which are the deviation method and raw score method which are illustrated by the following formulae. 25. To illustrate this.8.30. Using standard deviation method. we will use 20. we can come up with the following: .2 Standard deviation Standard deviation refers to how much the scores deviate from the mean.1:Calculating the Standard Deviation Using the Deviation Method Using the raw score method.

A standardised score can be computed for every raw score in a set of scores for a test. 8. Z scores and T scores are the more common forms of standardised scores although you may come up with your own standardised score. If you are calculating standard deviation with a calculator. it will be tedious to calculate the square of the deviations and their sum. it is suggested that the deviation method be used when there are only a few scores and the raw score method be used when there are many scores.2.2 : Calculating the Standard Deviation Using the Raw Score Method Both methods result in the same final value of 5. This is because when there are many scores. .3 Standard score Standardised scores are necessary when we want to make comparisons across tests and measurements.Table 8.

there is another form of standardised score .28) + 50.23) + 50.3 are 10(-1.47) + 50. The Z score The Z score is the basic standardised score. It is referred to as the basic form as other computations of standardised scores must first calculate the Z score.the T score – with values that are more palatable to the relevant parties. The formula used to calculate the Z score is as follows: Table 8. and D in the table 4. Such small values make it inappropriate for score reporting especially for those unaccustomed to the concept.3: Calculating the Z Score for a Set of Scores Z score values are very small and usually range only from –2 to 2.47 in English Language! Fortunately. ii. B. 10 (-0. Imagine what a parent may say if his child comes home with a report card with a Z score of 0. As such.i. the T score for students A. 10(0. The T score The T score is a standardised score which can be computed using the formula 10 (Z) + 50. C. and 10 .

However.7. .e.4 Interpretation of data The standardised score is actually a very important score if we want to compare performance across tests and between students. Let us take the following scenario as an example: How can En. The T score average or mean is always 50 (i. and 60. En.4 respectively. –1.34) and therefore performed better when we take the performance of all the other students into consideration.2. a standard deviation of 0) which connotes an average ability and the mid point of a 100 point scale. Abu can find the Z score for each raw score reported as follows: Table 8. 8. Chong has a higher Z score total (i.4: Z Score for Form 2A Based on Table 8. Abu solve this problem? He would have to have standardised scores in order to decide. These values seem perfectly appropriate compared to the Z score.(1. 54. 47.04) + 50 or 37.2.e.7.07 compared to – 1.4. This would require the following information: Test 1 : X = 42 standard deviation= 7 Test 2 : X = 47 standard deviation= 8 Using the information above. both Ali and Chong have a negative Z score as their total score for both tests.

. test scores that measure any characteristic such as intelligence. it is expected that we would obtain something similar to a normal distribution curve. Figure 8.1 is partitioned according to standard deviations (i. By plotting the heights of all Malaysian men according to frequency of occurrence.1: The normal distribution or Bell curve The normal curve in Figure 8.THE NORMAL CURVE The normal curve is a hypothetical curve that is supposed to represent all naturally occurring phenomena. there will be a few who will be relatively shorter and an equal number who are relatively taller. The area of the curve between standard deviations is indicated in percentage on the diagram. It is assumed that if we were to sample a particular characteristic such as the height of Malaysian men. + 4s) which are indicated on the horizontal axis. For example. – 4s. the area between the mean (0 standard deviation) and +1 standard deviation is 34.e. Similarly. + 3s. The following is a diagram illustrating how the normal curve would look like. then we will find that while most will have an average height of perhaps 5 feet 4 inches. language proficiency or writing ability of a specific population is also expected to provide us with a normal curve. -3s.13%.

As such.13%. It involves finding out how many students answered an item correctly and dividing it by the number of students who took this test. -2. if we find the score to be 5 as in the earlier example. and –15. then the score for the standard deviation value of 1 is 5 and for the value of 2 is 5 x 2 = 10 and for the value of 3 is 15 and so on. the area between –1 and 1 standard deviations is 68. if twenty students took a test and 15 of them correctly answered item 1. is the obtained score when we use the standard deviation formula provided earlier. on the other hand. An item difficulty of 0 refers to an extremely difficult item with no students getting the item correct and an item difficulty of 1 refers to an easy item which all students answered correctly.2. the area between the mean and –1 standard deviation is also 34. Standard deviation values of –1. Item difficulty Item difficulty refers to how easy or difficult an item is. it is important to make a distinction between standard deviation values and standard deviation scores. 10. and –3 will have corresponding negative scores of –5.8 can be accepted. then the item difficulty for item 1 is 15/20 or 0. According to Anastasi & Urbina (1997). if the test is to assess mastery.Similarly. So. 8. The formula is therefore: For example. The appropriate difficulty level will depend on the purpose of the test. The standard deviation score. Item difficulty is always reported in decimal points and can range from 0 to 1. . A standard deviation value is a constant and is shown on the horizontal axis of the diagram above. then items with a difficulty level of 0.75. In using the normal curve.5 Item analysis a.26%. The formula used to measure item difficulty is quite straightforward.

However, they go on to describe that if the purpose of the test is for
selection, then we should utilise items whose difficulty values come
closest to the desired selection ratio –for example, if we want to select
20%, then we should choose items with a difficulty index of 0.20.

b. Item discrimination
Item discrimination is used to determine how well an item is able to
discriminate between good and poor students. Item discrimination
values range from –1 to 1. A value of –1 means that the item
discriminates perfectly, but in the wrong direction. This value would tell
us that the weaker students performed better on a item than the better
students. This is hardly what we want from an item and if we obtain
such a value, it may indicate that there is something not quite right with
the item. It is strongly recommended that we examine the item to see
whether it is ambiguous or poorly written. A discrimination value of 1
shows positive discrimination with the better students performing much
better than the weaker ones – as is to be expected.

Let’s use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:

Table 8.5: Item Discrimination

As there are twelve students in the class, 33% of this total would be 4
students. Therefore, the upper group and lower group will each consist
of 4 students each. Based on their total scores, the upper group would
consist of students L, A, E, and G while the lower group would consist
of students J, H, D and I.
We now need to look at the performance of these students for each
item in order to find the item discrimination index of each item.
For item 1, all four students in the upper group (L, A, E, and G)
answered correctly while only student H in the lower group answered
correctly. Using the formula described earlier, we can plug in the
numbers as follows:

Two points should be noted. First, item discrimination is especially
important in norm referenced testing and interpretation as in such
instances there is a need to discriminate between good students who
do well in the measure and weaker students who perform poorly. In

criterion referenced tests, item discrimination does not have as
important a role. Secondly, the use of 33.3% of the total number of
students who took the test in the formula is not inflexible as it is
possible to use any percentage between 27.5% to 35% as the value.

Distractor analysis
Distractor analysis is an extension of item analysis, using techniques
that are similar to item difficulty and item discrimination. In distractor
analysis, however, we are no longer interested in how test takers select
the correct answer, but how the distractors were able to function
effectively by drawing the test takers away from the correct answer.
The number of times each distractor is selected is noted in order to
determine the effectiveness of the distractor. We would expect that the
distractor is selected by enough candidates for it to be a viable
What exactly is an acceptable value? This depends to a large extent on
the difficulty of the item itself and what we consider to be an acceptable
item difficulty value for test items. If we are to assume that 0.7 is an
appropriate item difficulty value, then we should expect that the
remaining 0.3 be about evenly distributed among the distractors.

Let us take the following test item as an example:
In the story, he was unhappy because_____________________________
A. it rained all day
B. he was scolded
C. he hurt himself
D. the weather was hot

Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in
their role as distractors. Similarly, if 15 students selected D and another 15
selected B, then C is not an effective distractor and should be replaced.

Instead of expecting a positive value.7. the discrimination index for each distractor can be calculated using the discrimination index formula. we should logically expect a negative value as more students from the lower group should select distractors. This technique is similar to a difficulty index although the result does not indicate the difficulty of each item.1 where 10 is the number of students who selected the tiems and 100 is the total number of students who took the test. If the distractors worked equally well. we know that all the students in the upper group answered this item correctly and only one . 0. the ideal situation would be for each of the three distractors to be selected by an equal number of all students who did not get the answer correct. From a different perspective. 0. 0. i. then the indices would be 0. C and D would have a difficulty index of 0. but the analysis and expectation would differ slightly from the regular item discrimination that we have looked at earlier.3 respectively. the value of the difficulty index formula for the distractors must be interpreted in relation to the indices for the other distractors. * indicates key For Item 1. The concept of upper groups and lower groups would still remain. and 0.7. From Table 8. 0. options A. Therefore the effectiveness of each distractor can be quantified as 10/100 or 0.e.1. in this case 10 students.5. B. and 0. but rather the effectiveness of the distractor.1.Therefore. Each distractor can have its own item discrimination value in order to analyse how the distractors work and ultimately refine the effectiveness of the test item itself. In the first situation described in this paragraph. Unlike in determining the difficulty of an item. the item discrimination formula can also be used in distractor analysis. Table 8.6: Selection of Distractors Distractor A Distractor B Distractor C Distractor D Item 1 8* 3 1 0 Item 2 2 8* 2 0 Item 3 4 8* 0 0 Item 4 1 3 8* 0 Item 5 5 0 0 7* d.1.

26. 24. median and range of the following set of scores: 23. 28. then the discrimination index for item 1. distractor B will be: This negative value indicates that more students from the lower group selected the distractor compared to students from the upper group. 23. What is a normal curve and what does this show? Does the final result always show a normal curve and how does this relate to standardised tests? . 25. 27. 22. If we assume that the three remaining students from the lower group all selected distractor B. This result is to be expected of a distractor and a value of -1 to 0 is preferred. 23. mode.student from the lower group did so. Calculate the mean. 23. 2. 24. EXERCISE 1.

1 LEARNING OUTCOMES By the end of Topic 9.TOPIC 9 REPORTING OF ASSESSMENT DATA 9. 9. teachers will be able to:  Understand the purposes of reporting of assessment data  Understand and use the different reporting methods in language assessment 9.2 FRAMEWORK OF TOPICS REPORTING OF ASSESSMENT DATA PURPOSES OF REPORTING CONTENT SESSION NINE (3 hours) REPORTING METHODS .0 SYNOPSIS Topic 9 focuses on reporting assessment data. It provides teachers with brief descriptions on the purposes of reporting and the reporting methods.

These educational decisions are shown in Figure 9. teachers decide to change or maintain their instructional approach. The teacher could evaluate the effectiveness of his own teaching or instructional approach and implement the necessary changes.2. several different types of decisions can be made.9.1 Figure 9.1 :Eight Types of Decisions Mode Instructional decisions are made based on test results when. Traditionally.1 Purposes of reporting We can say that the main purpose of tests is to obtain information concerning a particular behaviour or characteristic. this assessment will .Tests yield scores and teachers will have to make decisions in terms of the kind of grades to give students. If a teacher finds out that most of his class have failed his test. teachers need to decide whether a student deserves a high grade – perhaps an A – on the basis of some form of assessment. for example. Kubiszyn & Borich (2000). mention eight different types of decisions made on the basis of information obtained from tests. there are many possible reactions he can have. and perhaps for a long time to come. Based on information obtained from tests. As grades are indicators of student performance.

however. and administrative policy are all made at levels higher than the classroom. However. Decisions related to selection. A placement decision. Sometimes. Finally. educational agencies and institutions may be involved in these decisions. counselling and guidance. we give tests to find out the strengths and weaknesses of our students. Based on their performance on such a test. Selection and placement decisions are somewhat similar. in the form of tests. students are placed into different language classes that are arranged according to proficiency levels. Programme or curriculum decisions reflect the kinds of changes made to the educational programme or curriculum based on examination results. Counsellors often give advice in terms of appropriate vocations for some of their students. and hence selected for admission. Counselling and guidance decisions are also made by relevant parties such as counsellors and administrators on the basis of exam results. . deals with where a candidate should be placed based on performance on the test. A clear example is the language placement examination for newly admitted students commonly administered by many local and foreign universities. there are also administrative policy decisions that need to be made which are also greatly influenced by test scores. These advice is likely to be made on the basis of the students’ own test scores. Administrators. a selection decision relates to whether or not a student is selected for a programme or for admission into an institution based on a test score. programme or curriculum. Tests such as TOEFL and IELTS are often used by universities to decide whether a candidate is suitable.

can be working towards syllabus outcomes anywhere along the learning continuum.2 Reporting methods Student achievement progress can be reported by comparing: i.Referenced Assessment and Reporting Assessing and reporting a student's achievement and progress in comparison to predetermined criteria. ii Criterion . Principles of effective and informative assessment and reporting Effective and informative assessment and reporting practice:  Has clear. . iii An outcomes-approach Acknowledges that students. Syllabus outcomes in stages will describe the standard against which student achievement is assessed and reported. An outcomes-approach to assessment will provide information about student achievement to enable reporting against a standards framework. regardless of their class or grade. These strategies should provide information concerning student progress and achievement that helps inform ongoing teaching and learning as well as the diagnosis of areas of strength and need. direct links with outcomes The assessment strategies employed by the teacher in the classroom need to be directly linked to and reflect the syllabus outcomes.2.Referenced Assessment and Reporting Assessing and reporting a student's achievement and progress in comparison to other students.9.  Is integral to teaching and learning Effective and informative assessment practice involves selecting strategies that are naturally derived from well structured teaching and learning activities. Norm .

to demonstrate what they know. comments in workBooks. and among the students themselves. culture.  Values teacher judgement Good assessment practice involves teachers making judgements. Is valid Assessment strategies should accurately and appropriately assess clearly defined aspects of student achievement. comprehensive and varied Effective and informative assessment practice involves teachers using a variety of assessment strategies that give students multiple opportunities.  Is fair Effective and informative assessment strategies are designed to ensure equal opportunity for success regardless of students' age. The syllabus outcomes and the assessment processes to be used should be made explicit to students. then its use is misleading. Ideally there is a cooperative interaction between teacher and students. parent and student interviews. certificates and awards. Basic Skills Tests. about student progress towards the achievement of outcomes. student profiles. these too should be assessed as part of student learning. physical or other disability. If a strategy does not accurately assess what it is designed to assess. portfolios. background language. gender. based on syllabus outcomes. Students should participate in the negotiation of learning tasks and actively monitor and reflect upon their achievements and progress. Where values and attitudes are expressed in syllabus outcomes. in varying contexts. understand and can do in relation to the syllabus outcomes.  Engages the learner Effective and informative assessment practice is student centred. annotations on student work. Valid assessment strategies are those that reflect the actual intention of teaching and learning activities. Teachers can be confident a student has achieved an outcome .  Is balanced. socio-economic status or geographic location. on the weight of assessment evidence. Effective and informative reporting of student achievement takes a number of forms including traditional reporting.

 Is time efficient and manageable Effective and informative assessment practice is time efficient and supports teaching and learning by providing constructive feedback to the teacher and student that will guide further learning. class groupings. They are distinct from knowledge. Effective and informative assessment and reporting practice is sensitive to the self esteem and general well-being of students. Values and attitudes outcomes are an important part of learning that should be assessed and reported. programming and resource allocation. . Teacher judgement based on well defined standards is a valuable and rich form of student assessment. Decisions about assessment and reporting cannot be taken independently of issues relating to curriculum. by addressing several outcomes in one assessment task).  Recognises individual achievement and progress Effective and informative assessment practice acknowledges that students are individuals who develop differently. The reliability of teacher judgement is enhanced when teachers cooperatively develop a shared understanding of what constitutes achievement of an outcome. frequency and nature of their assessment strategies.  Actively involves parents Schools and their communities are responsible for jointly developing assessment and reporting practices and policies according to their local needs and expectations. Good planning ensures that assessment and reporting is manageable and maximises the usefulness of the strategies selected (for example. timetabling. understanding and skill outcomes. providing honest and constructive feedback.when the student has successfully demonstrated that outcome a number of times. All students must be given appropriate opportunities to demonstrate achievement.  Involves a whole school approach An effective and informative assessment and reporting policy is developed through a planned and coordinated whole school approach. and in varying contexts. This is developed through cooperative programming and discussing samples of student work and achievements within and between schools. Teachers need to plan carefully the timing.

not just within stages. Good reporting practice takes into account the expectations of the school community and system requirements. Schools can use student achievement information at a number of levels including individual. parents. class. Students. teachers. It is important for schools and parents to explore which methods of reporting will provide the most meaningful and useful information. This information helps identify students for targeted intervention and can inform school improvement programs.  Conveys meaningful and useful information Reporting of student achievement serves a number of purposes. or comparing their achievements to those of other students. Effective and informative reporting acknowledges that students can be demonstrating progress and achievement of syllabus outcomes across stages. grade or school. for a variety of audiences. Student achievement and progress can be reported by comparing students' work against a standards framework of syllabus outcomes. particularly the need for information about standards that will enable parents to know how their children are progressing. The form of the report must clearly serve its intended purpose and audience. Reporting can involve a combination of these methods. comparing their prior and current learning achievements. other schools and employers are potential audiences.Schools should ensure full and informed participation by parents in the continuing development and review of the school policy on reporting processes. .

10.0 SYNOPSIS Topic 10 focuses on the issues and concerns related to assessment in the Malaysian primary and alternative assessment) 10. It will look at how assessment is viewed and used in Malaysia.TOPIC 10 ISSUES AND CONCERNS RELATED TO ASSESSMENT IN MALAYSIAN PRIMARY SCHOOLS 10. teachers will be able to:    Understand some issues and concerns regarding assessment in the Malaysian primary schools Understand Chapter 4 of the Malaysian Education Blueprint 2013-2025 Use the different types of assessment in assessing language in school (cognitive-level.1 LEARNING OUTCOMES By the end of Topic 10.2 FRAMEWORK OF TOPICS ExamOriented system Alternative assessment Issues and Concerns in Malaysian Schools Schoolbased assessment CONTENT SESSION TEN (3 hours) Cognitive Levels of assessment .

Lim and Tan 1999.3 Exam-oriented System The educational administration in Malaysia is highly centralised with four hierarchical levels. the Malaysian Certificate of Education (SPM) at the end of 11 years of schooling. district and the lowest level. which consists of the Curriculum Development Centre. federal. and students from every state in Malaysia. Major decision-and policy-making take place at the federal level represented by the Ministry of Education (MoE). state. the Government’s aspiration of better preparing Malaysia’s children for the needs of the 21st century. Malaysia Education Blueprint 2013-2025 “In October 2011. Over the course of 11 months.. the school division. The Malaysian education system requires all students to sit for public examinations at the end of each level of schooling. The result is a preliminary Blueprint . and the Malaysian Examination Syndicate (MES). from education experts at UNESCO. the Ministry drew on many sources of input. OECD. These are the Primary School Achievement Test (UPSR) at the end of six years of primary education. the Lower Secondary Examination (PMR) at the end of another three years’ schooling. The current education system in Malaysia is too examination-oriented and over-emphasizes rote-learning with institutions of higher learning fast becoming mere diploma mills. and six local universities.g. that is. and the Malaysian Higher School Certificate Examination (STPM) or the Higher Malaysian Certificate for Religious Education (STAM) at the end of 13 years’ schooling (MoE 2004). Choi 1999). teachers. Malaysia so far has focused on public examination results as important determinants of students’ progression to higher levels of education or occupational opportunities (Chiam 1984). There are four public examinations from primary to postsecondary education. to principals.Like most Asian countries (e. parents. Gang 1996. World Bank. This decision was made in the context of rising international education standards.10. the Ministry of Education launched a comprehensive review of the education system in Malaysia in order to develop a new National Education Blueprint. and increased public and parental expectations of education policy. school.

their analysis of the 2010 and 2011 English Language UPSR papers showed that approximately 70% of the questions tested basic skills of knowledge and comprehension. project work. in parallel with the KSSR. and will seek feedback from across the community on this preliminary effort before finalising the Blueprint in December 2012. administered. ▪ Central assessment refers to written tests. The challenge is that these examinations do not currently test the full range of skills that the education system aspires to produce. The Ministry hopes that this effort will inform the national discussion on how to fundamentally transform Malaysia’s education system. such as application. For example. as per policy. robust. and suggests 11 strategic and operational shifts that would be required to achieve that vision. The Blueprint also offers a vision of the education system and students that Malaysia both needs and deserves. LP has started a series of reforms to ensure that. The test questions and marking schemes are developed. and reported by school teachers based on guidance from LP.that evaluates the performance of Malaysia’s education system against historical starting points and international benchmarks. and aligned to the new standard-referenced curriculum. the issue of teaching to the test has often translated into debates over whether the UPSR. There are four components to the new PBS: ▪ School assessment refers to written tests that assess subject learning. In 2011. An external review by Pearson Education Group of the English examination papers at UPSR and SPM level noted that these assessments would benefit from the inclusion of more questions testing higher-order thinking skills. PMR. analysis. scored. and SPM examinations should be abolished. Summative national examinations should not in themselves have any negative impact on students. the LP rolled out the new PBS format that is intended to be more holistic. synthesis and evaluation. assessments are evaluating students holistically. or .” The examined Curriculum In public debate.

LP develops these instruments and provides guidelines for use. and other non-school sponsored activities. a student’s UPSR grade will no longer be derived from a national examination alone. It also provides teachers with more regular information to take the appropriate remedial actions for their students. so that teachers can focus more time on delivering meaningful learning as stipulated in the curriculum. Schools are given the flexibility to determine how this component will be assessed. but from a combination of PBS and the national examination. not required to comply with these guidelines. sports. aptitude. and some subjects through a combination of examinations and centralised assessments. however. In 2014. and co-curricular activities assessment refers to assessments of student performance and participation in physical and health education. Schools are. uniformed bodies. In 2016. and ▪ Physical. ▪ Psychometric assessment refers to aptitude tests and a personality inventory to assess students’ skills. Aptitude tests are used to assess students’ innate and acquired abilities. The format of the SPM remains the same. with most subjects assessed through thenational examination. clubs. sports. interests. attitude and personality.oral tests (for languages) that assess subject learning. The tests are. the PMR national examinations will be replaced with school and centralised assessment. The personality inventory is used to identify key traits and characteristics that make up the students’ personality. These changes are hoped to reduce the overall emphasis on teaching to the test. however. LP develops the test questions and marking schemes. administered and marked by school teachers. . for example in thinking and problem solving. The new format enables students to be assessed on a broader range of output over a longer period of time.

name. solve mathematical . know principles. account for. Goes one step beyond the simple remembering of material. apply laws and theories to practical situations. know methods and procedures. justify methods and procedures. convert. who? when? where? what? Comprehension The ability to grasp the meaning of material. May involve remembering a wide range of material from specific facts to complete theories. interpret verbal material. methods. give example. estimate the future consequences implied in data. and represent the lowest level of understanding. principles. Learning outcomes in this area require a higher level of understanding than those under comprehension.10. translate. Learning objectives at this level: apply concepts and principles to new situations. identify. estimating future trends (predicting consequences or effects). but all that is required is the bringing to mind of the appropriate information. Learning objectives at this level: know common terms. translate verbal material to mathematical formulae. Represents the lowest level of learning outcomes in the cognitive domain.4 Cognitive Levels of Assessment Bloom's Taxonomy of Cognitive Levels       Knowledge Comprehension Application Analysis Synthesis Evaluation Knowledge Recalling memorized information. interpret charts and graphs. interpret. infer. interpreting material (explaining or summarizing). Learning objectives at this level: understand facts and principles. Question verbs: Define. laws. predict. state. Question verbs: Explain. know basic concepts. label. paraphrase x? Application The ability to use learned material in new and concrete situations. concepts. Applying rules. summarize. and theories. know specific facts. Translating material from one form to another (words to numbers). list.

problems. compare / contrast. This may involve the production of a unique communication (theme or speech). analysis of relationships between parts. with major emphasis on the formulation of new patterns or structure. Learning objectives at this level: recognize unstated assumptions. or ideas). demonstrate the correct usage of a method or procedure. distinguish between facts and inferences. demonstrate. modify. Question verbs: Design. recognition of the organizational principles involved. develop. construct graphs and charts. a plan of operations (research proposal). Learning outcomes in this area stress creative behaviors. change. or a set of abstract relations (scheme for classifying information). writing). how does x affect or relate to y? why? how? What piece of x is missing / needed? Synthesis (By definition. Identifying parts. construct. write a creative short story (or poem or music). It appears here to complete Bloom's taxonomy. make use of. synthesis cannot be assessed with multiple-choice questions. Question verbs: Differentiate. analyze the organizational structure of a work (art. recognizes logical fallacies in reasoning. formulate a new scheme for classifying objects (or events.) The ability to put parts together to form a new whole. imagine. distinguish x from y. integrate learning from different areas into a plan for solving a problem. Question verbs: How could x be used to y? How would you show. evaluate the relevancy of data. music. give a well organized speech. create. propose a plan for an experiment. write a short story and label the following elements: . Learning outcomes here represent a higher intellectual level than comprehension and application because they require an understanding of both the content and the structural form of the material. or apply x to conditions y? Analysis The ability to break down material into its component parts. formulate. solve. Learning objectives at this level: write a well organized paper.

Izard (2001) as well as Raivoce and Pongi (2001) explain that schoolbased assessment (SBA) is often perceived as the process put in place to collect evidence of what students have achieved. especially in . yet at the same time they have to guard against or deal with difficulties related to reliability. judge the value of a work (art. evaluate. Learning outcomes in this area are highest in the cognitive hierarchy because they contain elements of all the other categories. which may be internal (organization) or external (relevance to the purpose). Question verbs: Justify. research report) for a given purpose. The judgments are to be based on definite criteria.Evaluation The ability to judge the value of material (statement. poem. quality control and quality assurance. music. the issue of ‘why’ has been widely written about and there is general agreement on the principles of validity of this form of assessment. In the past few decades. judge the adequacy with which conclusions are supported by data. Learning objectives at this level: judge the logical consistency of written material. plus conscious value judgments based on clearly defined criteria. novel. The student may determine the criteria or be given them. Several educational systems have in turn introduced school-based assessment as part of or instead of external assessment in their certification. appraise. writing) by use of external standards of excellence. writing) by the use of internal criteria.5 School-based Assessment The traditional system of assessment no longer satisfies the educational and social needs of the third millennium. music. While examination bodies acknowledge the immense potential of school-based assessment in terms of validity and flexibility. Which option would be better/preferable to party y? 10. judge the value of a work (art. judge x according to given criteria. In the debate on school-based assessment. many countries have made profound reforms in their assessment systems.

as Raivoce and Pongi (2001) suggest the validity of SBA depends to a large extent on the various assessment tasks students are required to perform. Academic: • School Assessment (using Performance Standards) • Centralised Assessment 2. Daugherty (1994) clarifies that this type of assessment has been recommended: …because of the gains in the validity which can be expected when students’ performance on assessed tasks can be judged in a greater range of contexts and more frequently than is possible within the constraints of time. Sukan dan Kokurikulum . assessment for and of learning • Standard-referenced Assessment • Holistic • Integrated • Balance • Robust Components of SBA/ PBS 1. written examinations. The criteria for successful performance should be clear to all concerned 4. Non-academic: • Physical Activities.PAJSK) • Psychometric/Psychological Tests . In the Malaysian SBA context. However. The assessment should be appropriate to all persons being assessed 5. The assessment should enable the learner to demonstrate positive achievement and reflect the learner’s strengths. 3. Sports and Co-curricular Assessment (Pentaksiran Aktiviti Jasmani. The style of assessment should blend with the learning pattern so it contributes to it. The assessment should be appropriate to what is being assessed.important learning outcomes that do not easily lend themselves to the pen and paper tests. Burton (1992) provides the following five rules of the thumb that may be applied in the planning stage of school-based assessment : 1.

rubrics. Table 10. and LP School Assessment • The emphasis is on collecting first hand information about pupils’ learning based on curriculum standards • Teachers plan the assessment. .1 illustrates some of the major differences between traditional and alternative assessments. alternative assessments are assessment proposals that present “alternatives” to the more traditional examination formats. guidelines. performance. District and State Education Department. or implementation. 2001). As the term indicates. 10. They have become more popular of late because of some doubts raised regarding the ability of traditional assessment to elicit a fair and accurate measure of a student’s performance. It is likely that alternative assessment found its roots in writing assessment because of the need to provide continuous assessment rather than a single impromptu evaluation (Alderson & Banerjee. Alternative assessment brings together with it a complete set of perspectives that contrast against traditional tests and assessments. prepare the instrument and administer the assessment during teaching and learning process • Teachers mark pupils’ responses and report their progress continuously.Centralised Assessment • Conducted and administered by teachers in schools using instruments. time line and procedures prepared by LP • Monitoring and moderation conducted by PBS Committee at School.6 Alternative Assessment Alternative assessments are assessment procedures that differ from the traditional notions and practice of tests with respect to format.

do the scoring.1: Contrasting Traditional and “Alternative” Assessment Source: Adapted from Bailey (1998:207 and Puhl. They describe alternative assessments as performing the following:  Ask the students to perform. or do something. longitudinal assessment Indirect tests Direct tests Inauthentic tests Authentic assessment Individual projects Group projects No feedback to learners Feedback provided to learners Speeded exams Power exams Decontextualised test tasks Contextualised test tasks Norm-referenced score reporting Criterion-referenced score reporting Standardised tests Classroom-based tests Summative Formative Product of instruction Process of instruction Intrusive Integrated Judgmental Developmental Teacher proof Teacher mediated In discussing alternative assessments.  Tap higher-level thinking and problem-solving skills. . create. produce.Table 10. not machines.  Use tasks that represent meaningful instructional activities.  Invoke real-world applications. using human judgment. 1997: 5) Traditional Assessment Alternative Assessment One-shot tests Continuous. (1992: 6) list several of their common characteristics.  Require new instructional and assessment roles for teachers.  People. Herman et al.

1999). comments that alternative assessments focus on documenting individual strengths and development which would assist in the teaching and learning process. 2001: 229). in advocating alternative assessment. Alternative assessments are also said to be limited to the classroom and has not become part of mainstream assessment. Brown and Hudson. alternative assessments present a viable and exciting option in eliciting and assessing the students’ actual abilities. Despite these limitations. Perhaps one of the major limitations of alternative assessments is that accounts of the benefits of alternative assessment tend to be “descriptive and persuasive. They believe that educators should be familiar with all possible formats of assessment and decide on the format that best measures the ability or construct that they are interested in.Alternative assessments are suggested largely due to a growing concern that traditional assessments are not able to accurately measure the ability we are interested in. these alternatives would include all possible assessment formats both traditional and informal. several shortcomings of alternative assessments have been noted. although alternative assessments are compatible with the contemporary emphases on the process as well as product of learning (Croker. rather than research-based” (Alderson & Banerjee. They are also seen to be more student centred as they cater for different learning styles. Nevertheless. cultural and educational backgrounds as well as language proficiencies. seem to have taken a safer approach by suggesting the term “alternatives in assessment”. There are a number of test formats that are considered alternative assessment formats. Tannenbaum (1996). Hence.             Physical demonstration Pictorial products Reading response logs K-W-L (what I know/what I want to know/what I’ve learned) charts Dialogue journals Checklists Teacher-pupils conferences Interviews Performace tasks Portfolios Self assessment Peer assessment .

as well as photographs and other items that illustrate their experiences with as well as achievements in the English language. Bailey even suggests that this section include a reflective essay by the student in order to help express the student’s thoughts and feelings about the portfolio. she argues that portfolios should have what she refers to as an academic works section. an assessment section may contain evaluations made by peers. score reports of tests that they have sat for.  Finally. The actual compilation of the . describes a portfolio to contain four primary elements. it should have an introduction to the portfolio itself which provides an overview to the content of the portfolio.Portfolios A well known and commonly uses alternative assessment is the portfolio assessment. This section is meant to demonstrate the students’ “improvement or achievement in the major skill areas” (p. Table 10. Bailey (1998.  Secondly.1: Contents of a Portfolio Source: Adapted from Bailey (1998: 218) Introductory Section Academic Works Section • Overview • Reflective Essay • Samples of best work • Samples of work demonstrating development Personal Section Assessment Section • • • • • Evaluation by peers • Self-evaluation Journals Score reports Photographs Personal items The portfolio can be said to be a student’s personal documentation that helps demonstrate his or her ability and successes in the language. The contents of the portfolio become evidence of abilities much like how we would use a test to measure the abilities of our students. It may even require students to consciously select items that can document their own progress as learners. teachers as well as self evaluations.  First.  The third section is described as a personal section in which students may wish to include their journals. perhaps explaining strengths and possible weaknesses as well as explain why certain pieces are included in the portfolio. 218). p: 218).

This dual function can be considered as one of the benefits of portfolio assessment. is both a learning and assessment experience.664-665):  enhances student and teacher involvement in assessment. enhances the teacher’s role and improves the testing process.  provides opportunities for teachers to observe students using meaningful language. Portfolio assessment. summarise several other advantages in using portfolios in assessment. The benefits of self and peer assessment are especially found in formative stages of assessment in which the development of the students’ abilities are emphasised. They discuss these advantages according to how the portfolio strengthens students’ learning.  Self Assessment and Peer Assessment Two other common forms of alternative assessment are the selfassessment and peer-assessment procedures.content of the portfolio is in itself a learning experience.  to accomplish various authentic tasks in a variety of contexts and situations. With respect to testing. and make teachers’ ways of assessing student work more systematic. Brown and Hudson (1998). Both these forms of assessment are strongly advocated by Puhl (1997) as she believes that they are essential to continuous assessment. therefore. the advantages of using portfolio as an assessment instrument are listed as follows (pp.  provide opportunities for both students and teachers to work together and reflect on what it means to assess students’ language growth. a cornerstone to alternative assessment. Self appraisals are also thought to be quite accurate and are .  increase the variety of information collected on students.  permit the assessment of the multiple dimensions of language learning. Some suggest that students should attach a short reflection on each piece or item placed in the portfolio.

self assessment is relevant in assessing all the language skills. the students are expected to become more sensitive to their own learning and ultimately perform better in the final summative evaluation at the end of the instructional programme. Nevertheless. Peer assessment differs from self assessment in that it involves . I have difficulty with some questions. but I generally get the meaning 2.said to increase student motivation. I have difficulty understanding most questions even after repetition 1. but I might occasionally ask for repetition 3. I can always understand the questions with no difficulties and without having ask for repetition 4. in order for self assessment to be useful and not a futile exercise. Through asking these types of self assessment questions. An example of the self assessment of the listening skill. I can usually understand questions. In language teaching and learning. describes a case study in which she believes self-assessment forced the students to reread and thereby make necessary editing and corrections to their essays before they handed them in. Puhl (1997). I don’t understand questions well at all These questions are useful in the formative stages of assessment as it helps students identify their own strengths and weaknesses and respond accordingly. as follows: Comprehension of questions asked: 5. the learners need to be trained and initially guided in performing their self assessment. especially in the comprehension of questions asked is suggested by Cohen (1994). This training involves providing students with the rationale for self assessment and how it is intended to work and how it is capable of helping them.

checklists. the written word. 2002: 73).the social and emotional dimensions to a much greater extent. challenge.8) Peer assessment requires that a student take up the role of “a critical friend” to another student in order to “support.  help learners be reflective. EXERCISE In your opinion. Among the reported benefits of peer assessment are as follows:  remind learners they are not working in isolation.  improve the product (“Two heads are better than one”). colours. questionnaires. numbers along a scale.  improve the process. motivates. Peer-assessment can be defined as a response in some form to other learners’ work (Puhl. and  stimulate meta-cognition. 1997). even inspires. etc.  help create a community of learners. It can be given by a group or an individual and it can take “any of a variety of coding systems: the spoken word. and extend each other’s learning” (Brooks. what are the advantages of using portfolios as a form of alternative assessment? .” (p. nonverbal symbols.

L.. (2011). & Wall. (1986b). Educational Assessment Evaluation and Accountability. pp. pp. New York: Longman. (2007). K. Differentiating instruction to include all students. C.. (Retrieved 23 August 2013) Bloom.. Airasian. and Collis. Biggs. Anderson. J. B. Portal (Ed. Kuala Lumpur. 2013) Alderson. Repriviledging reading: The negotiation of uncertainty. E. D.. 22-23. Clapham. Krathwohl. Gardiner. Available at: http://pedagogy. K. (1982). Bachman. Cambridge. (2009).Evaluating the Quality of Learning: the SOLO taxonomy. 1 (1).REFERENCES Allen. A taxonomy for learning. (Ed. C. In: H.. (1995). & Wiliam. Training Material.) Intelligence: Reconceptualization and measurement. R. Alderson. Applying constructive alignment to outcomes.).based teaching and learning. 57-75. NJ: Lawrence Erlbaum. D. Biggs.R. (Ed. & Collis. (1991) Multimodal learning and the quality of intelligent behaviour.R. UK: Cambridge University Press.. Language test construction and evaluation. Cruikshank. C. 93-105. “Quality Teaching for Learning in Higher Education” Workshop for Master Trainers.). Hillsdale. K. 97120. L. Innovations in language testing. Ministry of Higher Education. teaching.W. pp. New York. (2009). J..P. Preventing School Failure. F.B. Developing the theory of formative assessment J. J. Language Composition. J. D. Anderson. Windsor: NFER/Nelson. (2001). Pintrich.). Biggs. J.. Hill. Furst.. NY: Academic Press.A. B. F.W.& Tang.D. Cambridge: Cambridge University Press. J.. J. 51 (3) pp. ed.Raths. Innovations in language testing? In M. Available at: http://eprints. S. M. Statistical Analyses for Language Assessment. P. (2004). & . 5–31. 12 (1) pp.E.. Black. Pedagogy: Critical Approaches to Teaching Literature. & Wittrock. pp. and assessing: A revision of Bloom's Taxonomy of Educational Objectives (Complete edition). M. I. B.). Mayer. (Ed.dukejournals. and Culture. C.J.W.1215/153142001416540(RetrievedSeptember 26.. Rowe (Ed. K .F. 49-54.

J. Brown. Handbook I: The Cognitive Domain. W. Generalizability of performance assessments. & Curtis. NJ: Lawrence Erlbaum Associates. NJ: Prentice Hall Regents. Washington..Cognition in the Formal Modes: Research mathematics and the SOLO taxonomy.). (1958). (1979).. Davidson. Teaching the spoken language. Y. Englewood Cliffs. & Yule. D. K. Journal of Structural Learning. D.New York. Modern Language Aptitude Test.Krathwohl. Watson. 35-49).. & Lynch. Phillips (Ed. & Sapon. Carroll. Technical issues in large-scale performance assessment (NCES 96-802) (pp. Brennan. (1956). Chick. R. B. B. L. 279-298. DC: National Center for Education Statistics. M. DC:TESOL. Mahwah. J. (1983).. H. 10 (2) pp. Washington. & Collis. F.D.). In E. J. L. Hudson. (1956). (2004). Mathematics Education Research Journal. G. Bloom. . Taxonomy of Educational Objectives. K. pp. (1996). In G.New York: David McKay.). (2010). S. & Abeywickrama. H. A. H. S. Cambridge: Cambridge University Press. 19-58).R. Brown. (1994). (1992). (1998). Direct vs. semi-direct tests of speaking ability. Language testing: Operationalization in classroom measurement and L2 research. Handbook 1: Cognitive domain. New York: David McKay Co Inc. P. Watanabe. Clark. B. New York. (Eds. T.. Taxonomy of educational objectives: The classification of educational goals. Washback in language testing: Research contexts and methods... (1985). M. Cheng. Brown. NY: Pearson Education. 4-26. J. 11. F. Language Assessment: Principles and Classroom Practices. Campbell. Teaching by principles: An interactive approach to language pedagogy. NY: The Psychological Corporation.Volume measurement and intellectual development. G. Briere & F. Hinofotis (Eds. Concepts in language testing: Some recent studies (pp.

MA: Newbury House.In M. New Haven. Hughes. Abingdon: Routledge Hattie. Hill. Rowley. (2002).. T.Visible Learning. Davies. An introduction to educational assessment. Cambridge: University ofCambridge Local Examinations Syndicate and Cambridge University Press. R. J. asTTle Technical Report 43 Hook. Grotjahn.). Reliability..). Testcraft: A teacher’s guide to writing and using language test specifications. (1986). New York: Routledge. Australia: Pearson Education New Zealand. Elder. J.English Teaching: Practice and Critique 11 (4). M. Beyond basics: Issues and research in TESOL pp. A. Feldt. In R. 137-152. B. et al. Ed. L. T. J. Cambridge. (1989). Testing for language teachers (2nd. 99–119. (2012) Visible Learning for Teachers: Maximizing Impact on Learning.. (2nd ed.). L. Gottlieb. ed. & Lynch.). G. J. Laughton. (2003). and McNamara. (2004) Cognitive processes in asTTle: The SOLO taxonomy. Hattie. measurement and evaluation. A. (2008). F.158–85.105-146. Educational Measurement. (3rd. Linn (ed.pp. Hattie. P. . S. (2009). CT: Yale University Press. & Brown. Brown. Dictionary of language testing. Celce-Murcia (Ed. & Brennan. (2006). USA: Corwin Press. & Mills.) pp.. K. (2011) SOLO Taxonomy: A Guide for Schools Book 1: A common language of learning.Language Testing 3. C. (1999).Test validation and cognitive psychology: Some methodological considerations.. Gavin. NY: Macmillan.. Davidson. Lumley. MA: Cambridge University Press. L.C. Huang. University of Auckland/Ministry of Education. R. New York. Assessing English Language Learners: Bridges from Language Proficiency to Academic Achievement. S. (2012). UK: Essential Resources Educational Publishers. A. B. pp.

). Chamberlain & R. Moseley. (8th ed. Validity. Cambridge: Cambridge University Press.. R. McMillan. 259– 271. Language testing. J. 13-103.. Malaysia Education Blueprint 2013-2025. & Newton. J. Selongor: Open University Malaysia.elsevier. Pearson. pp. Oxford. & Gronlund. Higgins. Learning and Instruction. Gregson. Linn. Oller. Elliott. (1979). (2000). Hübner.. & Renkl.) Educational measurement. NJ: Merrill/Prentice Hall. (2005). (2001a. (2009). Available at: http://linkinghub. Baumfield. Mousavi.) (Retrieved March 26. Pimsleur Language Aptitude Battery.McNamara. Nückles. ESP in the classroom: Practice and evaluation (Vol. Linn (Ed. P. London: Longman. . E. New York. New York. 128.. 98-107). V. Measurement and assessment in teaching.Tests as levers for change.) Tehran: Rahnama Publications. Management of measurement and evaluation Module. D. Pimsleur. T. Classroom assessment: Principles and practice for effective instruction. (1966). In D. Messick. Brace & World. 19(3). S. Pp. NY:: MacMillan. Language tests at school: A pragmatic approach. S. S. A. (1988). Norleha Ibrahim. A. In R. (2009).. Upper Saddle River. Baumgardner (Eds. 2013). N. S. D. Boston: MA: Allyn & Bacon. I. M. W..(2nd ed. Miller. Enhancing selfregulated learning by writing learning protocols. H. (2000).). NY: Harcourt. J. M.). (2009). An encyclopedic dictionary of language testing (4th ed. L. UK: Oxford University Press.. (1989).Frameworks for Thinking: A handbook for teaching and learning. J. London: Modern EnglishPublications.

com/pages/Introduction-to-Test-Items. Camarthen: Crown House Publishing. Websites http://www. & Reed. . M. (2011) High Performers: The Secrets of Successful Schools. (2007). Paper presented at the Annual Meeting of the American Educational Research Association. Spaan. The story behind the Modern Language Aptitude Test: An interview with John B. 205–211. The Clearing House.8. The role of assessment in a learning culture. & Colby. Stansfield.Shepard. Available http://www. S.(Retrieved 12. T.Language Assessment Quarterly. (2000).2013) Smith.2013) http://assessment.8. 5-29. M. Language Teaching Research. Teaching for Deep Learning.8. A. 80 (5) . Smith.aera.(Retrieved 10.2013) http://myenglishpages.html (Retrieved learning/Concepts/Concept/Reliability-and-validity .htm (Retrieved 10. (2006).catforms. Washback and the classroom: The implications for teaching and learning of studies of washback from exams. Spratt. 3. D. 19.. Carrol (1916-2003). C. 1.8.A. Test and item specifications development.2013) Language Assessment Quarterly. L. A. (2005).43-56. (2004).org. 71-79.

Ed.Ed. (Hons. USA  Sijil Latihan Perguruan Guru Siswazah (Kementerian Pelajaran Malaysia) PENGALAMAN KERJA  4 tahun sebagai guru di sekolah menengah  21 tahun sebagai pensyarah di IPG ANG CHWEE PIN chweepin819@yahoo. USA  B. Universiti Pertanian Malaysia PENGALAMAN KERJA  23 tahun sebagai guru di sekolah menengah  7 tahun sebagai pensyarah di IPG .A TESL University of North KELULUSAN  M. Science/TESL.A (Hons) English North Texas State University.PANEL PENULIS MODUL PROGRAM PENSISWAZAHAN GURU MOD PENDIDIKAN JARAK JAUH (PENDIDIKAN RENDAH) NAMA NURLIZA BT OTHMAN othmannurliza@yahoo.) KELAYAKAN KELULUSAN:  M.TESL Universiti Teknologi Malaysia  B.