Professional Documents
Culture Documents
CHAIRPERSON
NIMHANS NIMHANS
DATE: 31/08/2010
TIME: 3:00 PM
INDEX
SI. No. CONTENT PAGE No.
1. INTRODUCTION 3
2. EDUCATIONAL EVALUATION 3
2.1. EDUCATION 3
2.2. EVALUATION 3
2.3. EDUCATIONAL EVALUATION 3
3. TYPES OF EVALUATION 4
4. INTERPRETING TEST SCORES 4
4.1. CRITERION-REFERENCED TESTS (CRTs) 4
4.1.1. DEFINITION 4
4.1.2. SETTING PASS/FAIL SCORE FOR CRTs 4-5
4.1.3. RELATIONSHIP OF CRTs WITH MASTERY AND DEVELOPMENTAL 6
LEVEL LEARNING
4.1.3.1. MASTERY LEVEL LEARNING 7
4.1.4.1. DEVELOPEMENTAL LEVEL LEARNING 7
4.1.5. STEPS IN CONSTRUCTING CRTs 7-9
4.1.6. CHARACTERISTICS OF CRITERION-REFERENCED TEST 9
4.1.7. USES OF CRITERION-REFERENCED TEST 10
4.1.8. LIMITATIONS OF CRITERION-REFERENCED TEST 10-11
4.1.9 ADMINISTRATION 11
4.1.10. TEST LAYOUT 11
4.1.11. RELIABILITY 11
4.1.12. CUT-OFFF SCORES 11
4.1.13 VALIDITY 11
4.2. NORM-REFERENCED TESTS (NRTs) 11
4.2.1. DEFINITION 11-12
4.2.2. SETTING PASS/FAIL SCORE FOR NRTs 12-13
4.2.3. STEPS IN CONSTRUCTING NRTs 13-14
4.2.4. CHARACTERISTICS OF NORM-REFERENCED TESTS 14
4.2.5. USES OF NORM-REFERENCED TEST 14
4.2.6. LIMITATIONS OF NORM-REFERENCED TEST 14
4.2.7. RELIABILITY 15
4.2.8. VALIDITY 15
5. CRITERION REFERENCED TEST Vs NORM-REFERENCED TEST 15
5.1. SIMILARITIES 15
5.2. DIFFERENCES 15-16
6. RESEARCH ABSTRACT 16-17
7. CONCLUSION 17
8. REFERENCE 17-18
2. EDUCATIONAL EVALUATION
2.1. Education
Education in the largest sense is any act or experience that has a formative effect on the mind, character
or physical ability of an individual. In its technical sense, education is the process by which society
deliberately transmits its accumulated knowledge, skills and values from one generation to another.6
According to Pestalozzi (1819), “Education is the natural, harmonious and progressive development of
man’s innate powers”.10
As per John Dewey (1916), “education is the development of all those capacities in the individual which
will enable him to control his environment and fulfill his responsibilities”.10
According to Mahatma Gandhi (1927),”Education is the all-round drawing out of the best in child, man –
body, mind and spirit”.10
2.2. Evaluation
The term evaluation is derived from the word ‘valoir’, which means to be worthy. Thus evaluation is the
process of judging the value or worth of an individual’s achievements or characteristics. Evaluation is
systematic determination of merit, worth, and significance of something or someone using criteria against
a set of standards.18
“Evaluation is the determination of the worth of a thing. It includes obtaining information for use in
judging the worth of a program, product, procedure or objective or the potential utility of alternative
approaches to attain specific objectives” (Worthen & Sanders, 1974)5
Educational evaluation is concerned with judging the value or worth of the goals attained by the
education system.
“Evaluation is the process of determining to what extent the educational objectives are being
realized” (Ralph Tyler, 1950)10
“Evaluation is the systemic for determining the degree to which change in behavior of student
actually taking place” (Tyler, 1951)6
3. Types of evaluation
A raw test score is meaningless without a framework for interpretation. When people take an
assessment, it’s important for them to understand the implications of their scores, particularly when
passing or failing make a major difference in their lives. There are two ways to score an assessment.
These are referred to as criterion referenced and norm referenced. The raw test score is only given
meaning within the instructional content domain it represents. Criterion-referenced tests assess an
individual’s performance based on the percentage of content mastered. Norm-referenced tests define an
individual’s performance by comparing to others. Although both types of interpretation can be applied to
the same test, the interpretation is more meaningful when the test is specifically designed for a desired
interpretation (Linn & Gronlund, 2000).5
The concept of criterion-referenced measurement was introduced by Glaser (1963) and Popham &
Husek (1969) to highlight the need for tests that can describe the position of a learner on a performance
continuum, rather than the learner’s rank within a group of learners. The word criterion in the CRT refers
to domain of behaviors and in criterion referencing one is interested in referencing one examinee’s test
performance to a well defined domain of behavior measuring an objective or skill.5
4.1.1. Definition
“Criterion-referenced tests are constructed to provide information about the level of an examinee’s
performance in relation to clearly defined domain of content and/or behaviors”.
- Popham, 1978.5
Criterion-referenced test interpret a student’s raw score using a preset standard established by the
faculty. Thus each student’s competency in relation to the preset standard is measured without reference
to any other student. Student scores are then reported as percentage correct with each student’s
performance level determined by the preset, or absolute, standard. An example of a criterion referenced
objective is given below.5
This curve shows the number of people who took the assessment and the scores they achieved. The
bottom scale goes from test scores of zero up to 100 while the left hand side of the scale shows the
number of people who achieved a particular score. The cut score has been determined to be around 70
percent, which was probably set by subject matter experts who had determined the competency required
to pass the exam.
With a criterion referenced score interpretation more or fewer people will qualify from examination event
to examination event, since each sitting will include candidates with more or less knowledge. What’s
important, however, is that a benchmark has been established for the standards required for a particular
job. As an example, a driving test will use a criterion referenced score interpretation as a certain level of
knowledge and skill has been determined to be acceptable for passing a driving test. 2,3,4
Normally, performance standards are set on the test score scale so that examinee’s scores on a CRT
can be scored or classified in to performance categories, such as “failing”, “basic”, “proficient”, and
“advanced”. Today extensive use is made of CRTs and students’ scores from these tests are often
scored in to three or five performance levels or categories on state proficiency tests. With still other
CRTs, test scores are combined in to a single “pass” or “fail” score. For example the criterion may be”
students should be able to correctly add two single-digit numbers”, and the cut-off score may be that
students should correctly answer a minimum of 80% of the questions to pass. 5
One method is to compare students who are assumed not to know much about a subject, to subject
matter experts (SMEs) who do.
These SMEs perform well on the job, and if we look at the overall results a lower number of students
performed well compared to the scores that experts achieved. For students the bell curve is toward the
lower end of the score scale, while for subject matter experts the scores move to the higher end. One
technique for setting the pass/fail is to choose the score where these two curves intersect. In the example
above that would be at about 80%. The score at the intersection minimizes the number of test takers in
these groups that were misclassified by the test, i.e., minimizes the number of non-experts who passed
and the number of SMEs who failed. 2,3,4
If the pass/fail decision is determined only by a single score, then a single set of results can be used as
evidence. However, if multiple cut score levels are used for the various topics in the test then multiple
sets of scores must be examined.
Because CRTs measure a student’s attainment of a set of learning outcomes, no attempt should be
made to eliminate easy items. Today CRTs are called by many names “domain-referenced tests”,
“competency tests”, “basic skill tests”, “mastery tests”, “performance assessments”, “authentic tests”,
“proficiency tests”, “standard based tests”, “licensure exams” and “certification exams”. CRT today
typically use a wide array of item types including multiple-choice test items, true-false test items to essays
to complex performance tasks and simulations on a continuing basis.
CRTs are often teacher-made and are closely tied to the objectives and curriculum. They are most
meaningful when they are specifically designed to measure student ability in a particular area (Gronlund,
1973). Competency is such a critical requirement in nursing education that CRT is often the preferred
form of classroom testing in nursing education (Reilly & Oermann, 1990). With criterion-referenced
clinical evaluation, student performance is compared against preset criteria. In some nursing courses
these criteria are the clinical objectives to be met in the course.5
Gronlund (1973) describes the relationship of criterion-referenced testing to the two levels of learning:
Mastery and Developmental.
The concept of developmental learning applies to constructs that represent complex, higher-order
thinking, such as critical thinking. The abilities associated with these levels are continuously
developing throughout life. Developmental level objectives represent goals to work toward, with
emphasis focused on continuous development, rather than a complete mastery of a set of
predetermined skills (Gronlund, 1973).
Learning outcomes at the developmental level represent degrees of progress towards an objective.
Because it is impossible to identify all the behaviors that represent a complex construct, only a
sample of the behaviors associated with instructional objectives at this level can be identified as
learning outcomes, this behaviors should define the construct and provide a representational sample
of student performance that will be accepted as evidence of the appropriate progress towards the
attainment of the ultimate objective.
Students are not expected to fully achieve objective at the developmental level. However they are
required to demonstrate the behaviors represented by the learning outcomes, and they are also
encouraged to strive for their personal level of maximum achievement towards the ultimate objective-
the personal best. At this level instructional objectives can be designed to show the development of
students as they progress through an instructional program. For example, the same general
objectives can be used in every course in a nursing programme, with the learning outcomes
becoming more complex as the student progress through the program.
Gronlund (1973) asserts that the use of criterion-referenced tests is restricted with assessment of
developmental learning. While test preparations should follow mastery level procedures, he suggests
that in order to adequately describe student performance beyond minimal essentials, tests at the
developmental level should include items of varying difficulty and allow for both criterion and norm-
referenced interpretations. 5,6
When CRT was introduced by Glaser (1963) and Popham & Husek (1969), the goal was to assess
the examinee’s performance in relation to a set of behavioral objectives. Over the years, it became
clear that behavioral objectives did not have the specificity needed to guide instruction or to serve as
targets for test development and test score interpretation (Popham, 1978). Numerous attempts were
made to increase the clarity of behavioral objectives, including the development of detailed domain
specifications that included a clearly written objective, a sample test item or two, detailed
specifications for appropriate content, and details on the construction of relevant assessment
materials (Hamilton, 1998). More recently in CRT practices has been to write objectives focused on
the more important educational outcomes (fewer instructional and assessment targets seem to be
preferable) and then to offer a couple of sample assessments- preferably samples showing the
diversity of approaches that might be used for assessment (Popham,2000).
8. Test assembly.
Today CRTs are widely used in education, credentialing, the armed services, industry.
1. Criterion-referenced measures are particularly useful when the purpose of testing is to ascertain
whether an individual has attained critical clinical competencies or minimum requirements, such
as for practice or for admission to a specific educational program or course. Special educators
use CRTs with individual education programme to monitor student progress and achievement.
2. Class room teachers use CRT, both in their day-to-day management of student progress and in
their evaluation of instructional approaches.
3. In credentialing CRTs are used to identify persons who have met the test performance
requirement, for a license or certificate to practice in a profession (e.g.:- medical practice, nursing-
IELTS, TOEFL, teaching & accountancy). The NCLEX examination is probably the most
commonly used criterion-referenced test in nursing. A standard is set and each individual must
score at that level or above in order to attain nursing licensure.
4. In educational programs criterion-referenced measurement is best used when there is a need for
tests to examine student progress towards attainment of a designated skill or knowledge level.
5. The application of criterion-referenced measurement for ascertaining clinical skills requires each
student to demonstrate critical behaviors before performance would be considered satisfactory.
6. Criterion-referenced tests are used to determine which students are eligible for promotion to the
next grade or graduation.
7. In clinical practice criterion-referenced measures are sometimes used to determine client ability to
perform specific tasks and skills and to categorize clients in regard to their health status or
diagnosis. For example interpretation of intensity of heart murmur can be classified as grade 1, 2,
3,4, 5 or 6 or the measurement of reflexes during the neurological exam may be categorized as 0,
1+, 2+, 3+ or 4+, or a patient may be categorized as hypotensive, normotensive or hypertensive
based on the blood pressure level. The criterion standards applied during the classification
process have been explicated and incorporated in to these procedures so that results will be
accurate as possible.11,12
8. In the armed forces & industry, CRTs are used to identify the training needs of individuals, to
judge people’s job competence, and to determine whether or not trainees have successfully
completed training programs.
1. Criterion-referenced test tells only whether a learner has reached proficiency in task area but
does not show how good or poor a learner’s level of ability.
2. Tasks included in the Criterion-referenced test may be highly influenced by given teacher’s
interests or biases, leading to general validity problem.
3. Only some areas readily lend themselves for listing specific behavioral objectives around which
Criterion-referenced tests can be built and this may be an obstructing element for the teachers.
4. Criterion-referenced test are important for only small fraction of important educational
achievements. On the contrary, promotion and assessment of various skills is a very important
function and it requires norm-referenced testing.1,8
1. The test manual should specify the role and responsibilities of the examiner.
2. The test administrators should have adequate information relating to the purpose, time limits,
answer sheets and scoring of the test.
3. The directions of the test should be clear.
4. The test should be easy to score.1
4.1.10. Reliability
1. The test length should be sufficient enough to find out test score reliability.
2. The sample examinees used in finding out reliability should be adequate and representative.
3. The reliability of information should be provided in the test for each intended use of the test score.
4. The reliability information provided in the test should be appropriate for the use of the score of the
test.1
1. There should be rationale for selection of method for determining cut-off score.
2. There should be evidence for the validity of chosen cut-off score.8
4.1.12. Validity
1. The validity evidence should be adequate for the intended use of test score.
2. The test manual should provide an appropriate discussion on the factor affecting the validity of the
scores.8
While CRTs measure a student’s achievement without reference to other students, the aim of NRT
(Norm-referenced test) is to compare a student’s achievement in relation to the achievement of the
student’s peer group. The word norm in the NRT refers to a designated standard of average performance
of people of given age, background etc. The representative group is known as the ‘Norm group’. Norm
group may be made up of examinees at the local level, district level, state level or national level.5
4.2.1. Definition
“Norm-referenced test is a test designed to measure the growth in a student’s attainment and to compare
his level of attainment with the levels reached by other students and norm group”
- Bormuth (1970)5
- Gronlund (1976)16
A norm referenced test compares the scores of examinees against the scores of other examinees. Often,
typical scores achieved by identified groups in the population of test takers, i.e., the so-called norms, are
published for these tests. Norm-referenced tests are used to make “selection decisions.” For example, a
nursing college entrance exam might be designed to select applicants for 100 available seats in a
college. College decision makers use the test scores to determine the 100 best people of those taking the
test to fill those positions. Some years a higher quality group of students will qualify and sometimes a
lower quality group. The key, however, is that the test will spread the examinees’ scores out from one
another so that the 100 best performers will be readily identifiable.2, 3, 4
Setting the pass/fail score for a norm-referenced test is fairly simple. First, determine how many people
should pass. A report shows how many people reached each score, enabling test administrators to select
the top set of candidates. For example, using the table below, if 1,000 students should pass for
graduation to the next level of their studies, a passing score of 78.6% would achieve that result.
NRTs are designed to discriminate between strong and weak students. The tests are designed to provide
a wide range of scores so that the identification of students at different achievement levels is possible.
Therefore, items that all students are likely to answer correctly are eliminated.2,3,4
The NRT format is commonly used on national standardized tests. These tests have a
generalized content that is commonly taught in different areas. The scores provide a general indication of
strengths and weaknesses of the students in a particular institution, and afford the faculty an external
reference point for comparing their curriculum to a composite national curriculum. Because NRTs are not
concerned with the level of individual student achievement, they are usually not appropriate for classroom
use. When assessing developmental learning, Gronlund (1973) suggests using NRTs to rank the
students with the addition of criterion-referenced interpretations applied to the test to assess degrees of
student progress towards an achievement. In clinical settings, norm-referenced interpretations compare a
student’s clinical performance with those of a group of learners, indicating that the student has more or
less clinical competence than others in the group.5
L.M.Carey (1988) has described the following stages for development of NRT.
1. Used to make comparisons within the groups or with external groups and to use the data for
predictive purposes, such as admission criteria to an educational institution.
2. In aptitude testing for making differential prediction.
3. To get reliable rank ordering of the pupils with respect to the achievement we are measuring.
4. To identify the pupils who have mastered the essentials of the course more than others.
5. To select the best of the applicants for a particular programme.
6. To find out how effective a programme in comparison with other possible programme.
7. Can be used in selection of nursing staff to a hospital12
4.2.7. Reliability
Test length affects reliability: other things being equal, the reliability of test can be increased by
increasing the length. Items of similar content also increase reliability. Items of moderate difficulty
4.2.8. Validity
J.C. Starley and K.D. Hopkins (1972) have stated as “The word criterion in CRT denotes an instructional
objective, an expected post-instructional outcome, an intended level of student’s performance, an
acceptable level of learner’s achievement or a desired standard of product of performance.” By norm-
referenced testing is meant the measurement of student’s achievement in terms of a group, a class, a
school or a state which is taken as a referent for interpreting student’s scores and for passing judgments.
Thus the typical performance or a norm group is used as the basis for judging individual student learning.
In CRT, the emphasis is on improvement of student’s achievement and in NRT, emphasis on
measurement of achievement.16
5.1. Similarities
5.2. Differences
Many educators and members of the public fail to grasp the distinctions between criterion-referenced and
norm-referenced testing. It is common to hear the two types of testing referred to as if they serve the
same purposes, or shared the same characteristics. Much confusion can be eliminated if the basic
differences are understood. Popham (1975) compared CRTs and NRTs as follows 1,8
Criterion-Referenced Norm-Referenced
Dimension
Tests Tests
To determine whether each student To rank each student with respect to
has achieved specific skills or the
concepts. achievement of others in broad areas
Purpose of knowledge.
To find out how much students
know before instruction begins and To discriminate between high and low
after it has finished. achievers.
Each skill is tested by at least four Each skill is usually tested by less than
items in order to obtain an adequate four items.
sample of student performance and
Item
to minimize the effect of guessing. Items vary in difficulty.
Characteristics
The items which test any given skill Items are selected that discriminate
are parallel in difficulty. between high and low achievers.
Each individual is compared with a Each individual is compared with
preset standard for acceptable other examinees and assigned a score--
achievement. The performance of usually expressed as a percentile, a
other examinees is irrelevant. grade equivalent score, or a stanine.
Score
Interpretation A student's score is usually Student achievement is reported for
expressed as a percentage. broad skill areas, although some
norm-referenced tests do report
Student achievement is reported for student achievement for individual
individual skills. skills.
Source: Popham, J. W. (1975). Educational evaluation. Englewood Cliffs, New Jersey: Prentice-Hall, Inc. 17
6. RESEARCH ABSTRACT
In an article published by National league for nursing on Nursing Education Perspectives in may 2004,
the authors Reising, Deanna L, Devich and Lynn E says “with the inception of a new competency-based
nursing curriculum, faculty in a baccalaureate nursing program developed a comprehensive laboratory
and clinical evaluation program aimed at progressive, criterion-based evaluation across four semesters of
the nursing program. This article provides background for the development of the program, the resources
needed, and specific evaluation activities for the four semesters targeted. Course content and program
year competencies, progressively built from one semester to the next, guided the design of the practicum
evaluations. Faculty report satisfaction with the ability of this program to determine whether student
performance is consistent with competency achievement. Refinements have been made to alleviate
student stress and evaluator consistency.”
Paper presented at the Learning Communities and Assessment Cultures Conference organized by the
EARLI Special Interest Group on Assessment and Evaluation, University of Northumbria, August 2002,
by Lee Dunn, Sharon Parry and Chris Morgan says that “Over the past decade, traditional norm
referenced methods of assessment have come into question, and criterion referenced assessment in
undergraduate education has gathered considerable momentum as a method of marking, grading and
reporting students' achievements. The value of criterion referencing lies in its capacity to achieve greater
7. CONCLUSION
Evaluation is the process of making judgments about student learning and achievement, clinical
performance, employee competence and educational programs, based on the assessment data.
Broadfoot (2007) emphasized that evaluation is on making judgments about the quality in nursing
education; evaluation typically takes the form of judging student attainment of the educational objectives
and goals in the classroom and the quality of student performance in the clinical settings. With this
evaluation, learning outcomes are measured, further educational needs are identified, and additional
instruction can be provided to assist the students in their learning and in developing competencies for
practice.6
8. REFERENCE
1. Bharat Singh. Educational measurement and evaluation system. New Delhi: Anmol Publishers; 2006.
256 – 282.
2. Bloxham, S. and Boyd, P. Developing Effective Assessment in Higher Education: A Practical Guide,
London: McGraw-Hill Education and Open University Press; 2007.
6. Oermann Marilyn. & Gaberson Kathlee, B. Evaluation and testing in nursing education. 3RDedition. New
York: Springer publishing company; 2009. 6 - 9.
7. Philips Patricia Pulliam. ASTD Handbook of measuring and evaluating training. USA: The American
Society for Training and Development; 2010. 73 – 85.
9. Reynolds Cecil, R. & Kamphaus Randy, W. Hand book of psychological and educational assessment
of children: Intelligence, Aptitude and Achievements. New York: Guilford Press; 2003. 375 - 404.
10. Sankaranarayanan B, Sindhu B. Learning and teaching nursing. Second Edition. Calicut: Brainfill;
2008. 2 – 4, 204 – 214.
11. Shrock Sharon, A.& Coscarell William, C. Criterion-referenced test development: Technical and
Legal guidelines for corporate training. San Francisco: John Wiley & sons; 2007. 25 – 36.
12. Waltz Carolyn Feher & Strickland Ora Lea. Measurement in nursing and health research. New York:
Springer publishing company; 2010. 124 - 128
13. www.712educators.com
14. www.altalang.com
15. www.chiron.valdosta.edu
16. www.leeds.ac.uk
17. www.qualiyuresearchinernational.com
18. www.sciencedirect.com
19. www.scribd.com
20. www.wikepedia.org