You are on page 1of 67

Chapter 1

Assessment, Measurement, and Evaluation

Lesson 1: Assessment in the Classroom Context


Grades
 reflects a combination of different forms of assessment that both the teacher and the student have
conducted
 were based on a variety of information that the student and teacher gathered in order to objectively
come up with a value that is very much reflective of the student’s performance
 also serve to measure how well the students have accomplished the learning goals intended for
them in a particular subject, course, or training

Assessment - The process of collecting various information needed to come up with an overall information
that reflects the attainment of goals and purposes. It is integrated in all parts of the teaching and the learning
process.

Assessment can take place before instruction, during instruction, and after instruction

Before instruction, teachers can use assessment results as basis for the objectives and instructions for their
plans. These assessment results come from the achievement tests of students from the previous year, grades
of students from the previous year, assessment results from the previous lesson or pretest results before
instruction will take place. Knowing the assessment results from different sources prior to planning the
lesson help teachers decide on a better instruction that is more fit for the kind of learners they will handle,
set objectives appropriate for their developmental level, and think of a better ways of assessing students to
effectively measure the skills learned.

During instruction, there are many ways of assessing student performance. While class discussion is
conducted, teachers can ask questions and students can answer them orally to assess whether students can
recall, understand, apply, analyze, evaluate, and synthesize the facts presented. During instruction teachers
can also provide seat works and work sheets on every unit of the lesson to determine if students have
mastered the skill needed before moving to the next lesson. Assignments are also provided to reinforce
student learning inside the classroom. Assessment done during instruction serves as formative assessment
where it is meant to prepare students before they are finally assessed on major exams and tests. When the
students are ready to be assessed after instruction took place, they are assessed in a variety of skills they
are trained for which then serves as a summative form of assessment.

Final assessments come in the forms of final exams, long tests, and final performance assessment which
covers larger scope of the lesson and more complex skills are required to be demonstrated. Assessments
conducted at the end of the instruction are more structured and announced where students need time to
prepare.
Lesson 2
The Role of Measurement and Evaluation in Assessment
The Nature of Measurement

Measurement is an important part of assessment. Measurement has the features of quantification,


abstraction, and further analysis that is typical in the process of science. Some assessment results
come in the forms of quantitative values that enable the use of further analysis.

Obtaining evidence of different phenomena in the world can be based on measurement. A statement
can be accepted as true or false if the event can be directly observed.

If measurement is carefully done, then the process meets the requirements of scientific inquiry.

Objects per se are not measured, what is measured are the characteristics or traits of objects.

Variables- measurable characteristics or traits


Examples of variables that are studied in the educational setting are:
intelligence, achievement, aptitude, interest, attitude, temperament, and others.

Nunnaly (1970) defined measurement as “consist of rules for assigning numbers to objects in
such a way as to represent quantities of attributes.” Measurement is used to quantify characteristics of
objects.

ADVANTAGES OF QUANTIFICATION OF CHARACTERISTICS OR ATTRIBUTES

1. Quantifying characteristics or attributes determines the amount of that attribute present.


2. Quantification facilitates accurate information.
3. Quantification allows objective comparison of groups.
4. Quantification allows classification of groups.
5. Quantification results make the data possible for further analysis.

Quantifying characteristics or attributes determines the amount of that attribute present. If a student was
placed in the 10th percentile rank on an achievement test, then that means that the student has achieved
less in reference to others. A student who got a perfect score on a quiz on the facts about the life of Jose
Rizal means that the student has remembered enough information about Jose Rizal.
Quantification facilitates accurate information. If a student gets a standard score of -2 on a standardized
test (standard scores ranges from -3 to +3 where 0 is the mean), it means that the student is below average
on that test. If a student got a stannine score of 8 on a standardized test (stannine scores ranges from 1 to 9
where 5 is the average), it means that the student is above the average or have demonstrated superior ability
on the trait measured by the standardized test.
Quantification allows objective comparison of groups. Suppose that male and female students were tested
in their math ability using the same test for both groups. Then mean results of the males math scores is 92.3
and the mean results of the females math scores is 81.4. It can be said that males performed better in the
math test than females when tested for significance.
Quantification allows classification of groups. The common way of categorizing sections or classes is
based on students’ general average grade from the last school year. This is especially true if there are
designated top sections within a level. In the process, students grades are ranked from highest to lowest
and the necessary cut-offs are made depending on the number of students that can be accommodated in a
class.
Quantification results make the data possible for further analysis. When data is quantified, teachers,
guidance counselors, researchers, administrators, and other personnel can obtain different results to
summarize and make inferences about the data. The data may be presented in charts, graphs, and tables
showing the means and percentages. The quantified data can be further estimated using inferential statistics
such as when comparing groups, benchmarking, and assessing the effectiveness of an instructional
program.

The process of measurement in the physical sciences (physics, chemistry, biology) is similar in education
and the social sciences. Both use instruments or tools to arrive with measurement results. The only
difference is the variables of interest being measured. In the physical sciences, measurement is more
accurate and precise because of the nature of physical data which is directly observable and the variables
involved are tangible in all senses. In education, psychology, and behavioral science, the data is subject to
measurement errors and large variability because of individual differences and the inability to control
variations in the measurement conditions. Although in education, psychology, and behavioral science, there
are statistical procedures for obtaining measurement errors such as reporting standard deviations, standard
errors, and variance.

Measurement facilitates objectivity in the observation. Through measurement, extreme differences in


results are avoided, provided that there is uniformity in conditions and individual differences are controlled.
This implies that when two persons measure a variable following the same conditions, they should be able
to get consistent results. Although there may be slight difference (especially if the variable measured is
psychological in nature), but the results should be at least consistent. Repeating the measurement process
several times and consistency of results would mean objectivity of the procedure undertaken.

The process of measurement involves abstraction. Before a variable is measured using an instrument, the
variable’s nature needs to be clarified and studied well. The variable needs to be defined conceptually and
operationally to identify ways on how it is going to be measured. Knowing the conceptual definition based
on several references will show the theory or conceptual framework that fully explains the variable. The
framework reveals whether the variable is composed of components or specific factors. Then these specific
factors need to be measured that comprise the variable. A characteristic that is composed of several factors
or components are called latent variables. The components are usually called factors, subscales, or manifest
variables. An example of a latent variable would be “achievement.” Achievement is composed of factors
that include different subject areas in school such as math, general science, English, and social studies.
Once the variable is defined and its underlying factors are identified, then the appropriate instrument that
can measure the achievement can now be selected. When the instrument or measure for achievement is
selected, it will now be easy to operationally define the variable. Operational definition includes the
procedures on how a variable will be measured or made to occur. For example, ‘achievement’ can be
operationally defined as measured by the Graduate Record Examination (GRE) that is composed of verbal,
quantitative, analytical, biology, mathematics, music, political science, and psychology.

When a variable is composed of several factors, then it is said to be multidimensional. This means that a
multidimensional variable would require an instrument with several subtests in order to directly measure
the underlying factors. A variable that do not have underlying factors is said to be unidimensional. A
unidimensional variable only measures an isolated unitary attribute. An example of unidemensional
measures are the Rosenberg self-esteem scale and the Penn State Worry Questionnaire (PSWQ). Examples
of multidimensional measures are various ability tests and personality tests where it is composed of several
factors. The 16 PF is a personality test that is composed of 16 components (researved, more intelligent,
affected by feelings, assertive, sober, conscientious, venturesome, tough-minded, suspicious, practical,
shrewd, placid, experimenting, self-sufficient, controlled, and relaxed).

The common tools used to measure variables in the educational setting are tests, questionnaires,
inventories, rubrics, checklists, surveys and others. Tests are usually used to determine student achievement
and aptitude that serve a variety of purposes such as entrance exam, placement tests, and diagnostic tests.
Rubrics are used to assess performance of students in their presentations such as speech, essays, songs, and
dances. Questionnaires, inventories, and checklists are used to identify certain attributes of students such
as their attitude in studying, attitude in math, feedback on the quality of food in the canteen, feedback on
the quality of service during enrollment, and other aspects.

The Nature of Evaluation

Evaluation is arrived when the necessary measurement and assessment have taken place. In order to
evaluate whether a student will be retained or promoted to the next level, different aspects of the student’s
performance were carefully assessed and measured such as the grades and conduct. To evaluate whether
the remedial program in math is effective, the students’ improvement in math, teachers teaching
performance, students attitude change in math should be carefully assessed. Different measures are used to
assess different aspects of the remedial program to come up with an evaluation. According to Scriven
(1967) that evaluation is “judging the worth or merit” of a case (ex. student), program, policies, processes,
events, and activities. These objective judgments derived from evaluation enable stakeholders (a person or
group with a direct interest, involvement, or investment in the program) to make further decisions about
the case (ex. students), programs, policies, processes, events, and activities.

In order to come up with a good evaluation, Fitzpatrick, Sanders, and Worthen (2004) indicated that there
should be standards for judging quality and deciding whether those standards should be relative or absolute.
The standards are applied to determine the value, quality, utility, effectiveness, or significance of the case
evaluated. In evaluating whether a university has a good reputation and offers quality education, it should
be comparable to a standard university that topped the World Rankings of University. The features of the
university evaluated should be similar with the standard university selected. Or a standard can be in the
form of ideal objectives such as the ones set by the Philippine Accreditation of Schools, Colleges, and
Universities (PAASCU). A university is evaluated if they can meet the necessary standards set by the
external evaluators.

Fitzpatrick, Sanders, and Worthen (2004) clarified the aims of evaluation in terms of its purpose, outcome,
implication, setting of agenda, generalizability, and standards. The purpose of evaluation is to help those
who hold a stake in whatever is being evaluated. Stakeholders consist of many groups such as students,
teachers, administrators, and staff. The outcome of evaluation leads to judgment whether a program is
effective or not, whether to continue or stop a program, whether to accept or reject a student in the school.
The implication that evaluation gives is to describe the program, policies, organization, product, and
individuals. In setting the agenda for evaluation, the questions for evaluation come from many sources,
including the stakeholders. In making generalizations, a good evaluation is specific to the context in which
the evaluation object rests. The standards of a good evaluation are assessed in terms of its accuracy, utility,
feasibility, and propriety.

A good evaluation adheres to the four standards of accuracy, utility, feasibility, and propriety set by the
‘Joint Committee on Standards for Educational Evaluation’ headed by Daniel Stufflebeam in 1975 at
Western Michigan University’s Evaluation Center. These four standards set are now referred to as
‘Standards for Evaluation of Educational Programs, Projects, and Materials.’ Table 1 presents the
description of the four standards.

Table 1
Standards for Evaluation of Educational Programs, Projects, and Materials

Standard Summary Components


Utility Intended to ensure that an evaluation Stakeholder identification, evaluator credibility,
will serve the information needs of its information scope and selection, values
intended users. identification, report clarity, report timeliness
and dissemination, evaluation impact
Feasibility Intended to ensure that an evaluation Practical procedures, political viability, cost
will be realistic, prudent, diplomatic, effectiveness
and frugal.
Propriety Intended to ensure that an evaluation Service orientation, formal agreements, rights of
will be conducted legally, ethically, human subjects, human interaction, complete
and with due regard for the welfare of and fair assessment, disclosure of findings,
those involved in the evaluation as conflict of interest, fiscal responsibility
well as those affected by its results
Accuracy Intended to ensure that an evaluation Program documentation, content analysis,
will reveal and convey technical described purposes and procedures, defensible
adequate information about the information sources, valid information, reliable
features that determine the worth or information, systematic information, analysis of
merit of the program being evaluated. quantitative information, analysis of qualitative
information, justified conclusions, impartial
reporting, metaevaluation

Forms of Evaluation
Owen (1999) classified evaluation according to its form. He said that evaluation can be proactive,
clarificative, interactive, monitoring, and impact.
1. Proactive. Ensure that all critical areas are addressed in an evaluation process. Proactive evaluation is
conducted before a program begins. It assists stakeholders in making decisions on determining the type of
program needed. It usually starts with needs assessment to identify the needs of stakeholders that will be
implemented in the program. A review of literature is conducted to determine the best practices and
creation of benchmarks for the program.
2. Clarificative. This is conducted during program development. It focuses on the evaluation of all aspects
of the program. It determines the intended outcomes and how the program designed will achieve them.
Determining the how the program will achieve its goals involves determining the strategies that will be
implemented.
3. Interactive. This evaluation is conducted during program development. It focuses on improving the
program. It identifies what is the program trying to achieve, whether the goals are consistent with the plan,
and how can the program be changed to make the goals effective.
4. Monitoring. This evaluation is conducted when the program has settled. It aims to justify and fine tune
the program. It focuses whether the outcome of the program has delivered to its intended stakeholders. It
determines the target population, whether the implementation meets the benchmarks, be changed to be
done in the program to make it more efficient.
5. Impact. This evaluation is conducted when the program is already established. It focuses on the
outcome. It evaluates if the program was implemented as planned, whether the needs were served, whether
the goals are attributable to the program, and whether the program is cost effective.

These forms of evaluation are appropriate at certain time frames and stage of a program. The illustration
below shows when each evaluation is appropriate.

Program Duration
Planning and Implementation Settled
Development
Phase
Proactive Interactive and monitoring Impact
Clarificative

Models of Evaluation

Evaluation is also classified according to the models and framework used. The classifications of the models
of evaluation are objectives-oriented, management oriented, consumer-oriented, expertise-oriented,
participant-oriented, and theory driven.

1. Objectives-oriented. This model of evaluation determines the extent to which the goals
of the program are met. The information that results in this model of evaluation can be used to
reformulate the purpose of the program evaluated, the activity itself, and the assessment
procedures used to determine the purpose or objectives of the program. In this model there should
be a set of established program objectives and measures are undertaken to evaluate which goals
were met and which goals were not met. The data is compared with the goals. The specific models
for the objectives-oriented are the Tylerian Evaluation Approach, Metfessel and Michael’s
Evaluation Paradigm, Provus Discrepancy Evaluation Model, Hammond’s Evaluation Cube, and
Logic Model (see Fitzpatrick, Sanders, & Worthen, 2004).

2. Management-oriented. This model is used to aid administrators, policy-makers, boards


and practitioners to make decisions about a program. The system is structured around inputs,
process, and outputs to aid in the process of conducting the evaluation. The major target of this
type of evaluation is the decision-maker. This form of evaluation provides the information needed
to decide on the status of a program. The specific models of this evaluation are the CIPP (Context,
Input, Process, and Product) by Stufflebeam, Alkin’s UCLA Evaluation Model, and Patton’s
Utilization-focused evaluation (see Fitzpatrick, Sanders, & Worthen, 2004).

3. Consumer-oriented. This model is useful in evaluating whether is product is feasible,


marketable, and significant. A consumer-oriented evaluation can be undertaken to determine if
there will be many enrollees of a school that will be built on a designated location, will there be
takers of a graduate program proposed, and is the course producing students that are employable.
Specific models for this evaluation are Scriven’s Key Evaluation Checklist, Ken Komoski’s EPIE
Checklist, Morrisett and Stevens Curriculum Materials Analysis System (CMAS) (see Fitzpatrick,
Sanders, & Worthen, 2004).

4. Expertise-oriented. This model of evaluation uses an external expert to judge an


institution’s program, product, or activity. In the Philippine setting, the accreditation of schools is
based on this model. A group of professional experts make evaluations based in the existing school
documents. These group of experts should complement each other in producing a sound judgment
of the school’s standards. This model comes in the form of formal professional reviews (like
accreditation), informal professional reviews, ad hoc panel reviews (like funding agency review,
blue ribbon panels), ad hoc individual reviews, and educational connoisseurship (see Fitzpatrick,
Sanders, & Worthen, 2004).

5. Participant-oriented. The primary concern of this model is to serve the needs of those
who participate in the program such as students and teachers in the case of evaluating a course.
This model depends on the values and perspectives of the recipients of an educational program.
The specific models for this evaluation is Stake’s Responsive evaluation, Patton’s Utilization-
focused evaluation, Rappaport’s Empowerment Evaluation (see Fitzpatrick, Sanders, & Worthen,
2004).

6. Program Theory. This evaluation is conducted when stakeholders and evaluators


intend to determine to understand both the merits of a program and how its transformational
processes can be exploited to improve the intervention (Chen,2005). The effectiveness of a
program in a theory driven evaluation takes into account the causal mechanism and its
implementation processes. Chen (2005) identified three strengths of the program theory
evaluation: (1) Serves accountability and program improvement needs, (2) establish construct
validity on the parts of the evaluation process, and (3) increase internal validity. Program theory
measures the effect of program intervention on outcome as mediated by determinants. For
example, a program implemented instructional and training public school students on proper waste
disposal, the quality of the training is assessed. The determinants of the stakeholders are then
identified such as adaptability, learning strategies, patience, and self-determination. These factors
are measured as determinants. The outcome measures are then identified such as the reduction of
wastes, improvement of waste disposal practices, attitude change, and rating of environmental
sanitation. The effect of the intervention on the determinants is assessed and the effect of
determinants on the outcome measures. The direct effect of the intervention and the outcome is
also assessed. The model of this evaluation is illustrated below.

Figure 1
Implicit Theory for Proper Waste Disposal

Determinants

Adaptability, learning
Intervention Outcome
strategies, patience,
and self-determination
Quality of Instruction Reduction of wastes,
and Training improvement of waste
disposal practices,
attitude change, and
rating of environmental
sanitation
Table 2
Integration of the Forms and Models of Evaluation

Form of Evaluation Focus Models of Evaluation


Proactive Is there a need? What do we/others know Consumer-oriented
about the problems to be addressed? Best Identifying Context in CIPP
practices?
Clarificative What is program trying to achieve? Is Setting goals in Tyler’s Evaluation
delivery working, consistent with plan? How Approach
could the program or organization be
changed to be more effective?
Interactive What is the program trying to achieve? Is Stake’s Responsive Evaluation
delivery working, consistent with plan? How Objectives-oriented
could the program or organization be
changed to be more effective?
Monitoring Is the program reaching the target CIPP
population? Is implementation meeting
benchmarks? Differences across sites, time?
How/what can be changed to be more
efficient,
effective?
Impact Is the program implemented as planned? Are CIPP
stated goals achieved? Are needs served? Objectives-oriented
Can you attribute goal achievement to Program theory
program? Unintended outcomes? Cost
effective?
Table 3
Implementing procedures of the Different Models of Evaluation

Form of Focus Models of Evaluation


Evaluation
Objectives-oriented Tylerian Evaluation Approach 1. Establish broad goals
2. Classify the goals
3. Define objectives in behavioral terms
4. Find situations in which achievement of objectives can be shown
5. Develop measurement techniques
6. Collect performance data
7. Compare performance data with behaviorally shared objectives.
Metfessel and Michael’s Evaluation 1. Involve stakeholders as facilitators in program evaluation
Paradigm 2. Formulate goals
3. Translate objectives into communicable forms
4. Select instruments to furnish measures
5. Carry out periodic observation
6. Analyze data
7. Interpret data using standards
8. Develop recommendations for further implementation
Provus Discrepancy Evaluation Model 1. Agreeing on standards
2. Determine whether discrepancy exist between performance and standards
3. Use information on discrepancies to decide whether to improve, maintain, or
terminate the program.
Hammond’s Evaluation Cube 1. Needs of stakeholders
2. Characteristics of the clients
3. Source of service
Logic Model 1. Inputs
2. Service
3. Outputs
4. Immediate, intermediate, long-term, and ultimate outcomes
Management-oriented CIPP (Context, Input, Process, and 1. Context evaluation
Product) by Stufflebeam 2. Input evaluation
3. Process evaluation
4. Product evaluation
Alkin’s UCLA Evaluation Model 1. Systems assessment
2. Program planning
3. Program implementation
4. Program improvement
5. Program certification
Patton’s Utilization-focused evaluation 1. Identifying relevant decision makers and information users
2. What information is needed by various people
3. Collect and provide information
Consumer-oriented Scriven’s Key Evaluation Checklist 1. Evidence of achievement
2. Follow-up results
3. Secondary and unintended efforts
4. Range of utility
5. Moral considerations
6. Costs
Morrisett and Stevens Curriculum 1. Describe characteristics of product
Materials Analysis System (CMAS) 2. Analyze rationale and objectivity
3. Consider antecedent conditions
4. Consider content
5. Consider instructional theory
6. Form overall judgment
Expertise-oriented Formal Professional reviews Accreditation
Informal Professional Reviews Peer reviews
Ad Hoc Panel Reviews Funding agency review, blue ribbon panels
Ad Hoc Individual Reviews Consultation
Educational Connoisseurship Critics
Participant-oriented Stake’s Responsive Evaluation 1. Intents
2. Observations
3. Standards
4. Judgments
Fetterman’s Empowerment Evaluation 1. Training
2. Facilitation
3. Advocacy
4. Illumination
5. Liberation
Program Theory  Determinant mediating the 1. Establish common understanding between stakeholders and evaluator
relationship between 2. Clarifying stakeholders theory
intervention and outcome 3. Constricting research design
 Relationship between program
components that is conditioned
by a third factor
Lesson 3
The Process of Assessment

The previous lesson clarified the distinction between measurement and evaluation. Upon
knowing the process of assessment in this lesson, you should know now how measurement and
evaluation are used in assessment.
Assessment goes beyond measurement. Evaluation can be involved in the process of
assessment. Some definitions from assessment references show the overlap between assessment
and evaluation. But Popham (1998), Gronlund (1993), and Huba and Freed (2000) defined
assessment without overlap with evaluation. Take note of the following definitions:

1. Classroom assessment can be defined as the collection, evaluation, and use of


information to help teachers make better decisions (McMillan, 2001).
2. Assessment is a process used by teachers and students during instruction that provides
feedback to adjust ongoing teaching and learning to improve students’ achievement of intended
instructional outcomes (Popham, 1998).
3. Assessment is the systematic process of determining educational objectives, gathering,
using, and analyzing information about student learning outcomes to make decisions about
programs, individual student progress, or accountability (Gronlund, 1993).
4. Assessment is the process of gathering and discussing information from multiple and
diverse sources in order to develop a deep understanding of what students know, understand, and
can do with their knowledge as a result of their educational experiences; the process culminates
when assessment results are used to improve subsequent learning (Huba & Freed, 2000).

Cronbach (1960) have three important features of assessment that makes it distinct with
evaluation: (1) Use of a variety of techniques, (2) reliance on observation in structured and
unstructured situations, and (3) integration of information. The three important features of
assessment emphasize that assessment is not based on single measure but a variety of measures.
In the classroom, a student’s grade is composed of the quizzes, assignments, recitations, long
tests, projects, and final exams. These sources were assessed through formal and informal
structures and integrated to come up with an overall assessment as represented by a student’s
final grade. In lesson 1, assessment was defined as “the process of the collecting various
information needed to come up with an overall information that reflects the attainment of goals
and purposes.” There are three critical characteristics of this definition:

1. Process of collecting various information. A teacher arrives at an assessment after


having conducted several measures of student’s performance. Such sources are recitations, long
tests, final exams, and projects. Likewise, a student is proclaimed as gifted after having tested
with a battery (several) of intelligence and ability tests. A student to be designated at Attention
Deficit Disorder (ADD) needs to be diagnosed by several attention span and cognitive tests with
a series of clinical interviews by a skilled clinical psychologist. A variety of information is
needed in order to come up with a valid way of arriving with accurate information.

2. Integration of overall information. Coming up with an integrated assessment from


various sources need to consider many aspects. The results of individual measures should be
consistent with each other to meaningfully contribute in the overall assessment. In such cases, a
battery of intelligence tests should yield the same results in order to determine the overall ability
of a case. In cases where some results are inconsistent, there should be a synthesis of the overall
assessment indicating that in some measures the result do not support the overall assessment.

3. Attainment of goals and purposes. Assessment is conducted based on specified goals.


Assessment processes are framed for a specified objective to determine if they are met.
Assessment results are the best way to determine the extent to which a student has attained the
objectives intended.

The Process of Assessment

The process of assessment was summarized by Bloom (1970). He indicated that there are
two processes involve d in assessment:

1. Assessment begins with an analysis of criterion. The identification of criterion


includes the expectations and demands and other forms of learning targets (goals, objectives,
expectations, etc.).

2. It proceeds to the determination of the kind of evidence that is appropriate about the
individuals who are placed in the learning environment such as their relevant strengths and
weaknesses, skills, and abilities.

In the classroom context, it was explain in Lesson 1 that assessment takes place before,
during and after instruction. This process emphasize that assessment is embedded in the teaching
and the learning process. Assessment generally starts in the planning of learning processes when
learning objectives are stated. A learning objective is defined in measurable terms to have an
empirical way of testing them. Specific behaviors are stated in the objectives so that it
corresponds with some form of assessment. During the implementation of the lesson, assessment
can occur. A teacher may provide feedback based on student recitations exercises, short quizzes,
and classroom activities that allow students to demonstrate the skill intended in the objectives.
The assessment done during instruction should be consistent with the skills required in the
objectives of the lesson. The final assessment is then conducted after enough assessment can
demonstrate student mastery of the lesson and their skills. The final assessment conducted can be
the basis for the succeeding objectives for the next lesson. The figure below illustrates the
process of assessment.

Figure 1
The Process of Assessment in the Teaching and Learning Context

Assessme
nt

Learning Learning Assessment


Assessme
nt Objective Experienc
s e
Forms of Assessment

Assessment comes in different forms. It can be classified as qualitative or quantitative,


structured or unstructured, and objective or subjective.

Quantitative and Qualitative

Assessment is not limited to quantitative values, assessment can also be qualitative.


Examples of qualitative assessments are anecdotal records, written reports, written observations
in narrative forms. Qualitative assessments provide a narrative description of attributes of
students, such as their strengths and weaknesses, areas that need to be improved and specific
incidents that support areas of strengths and weaknesses. Quantitative values uses numbers to
represent attributes. The advantages of quantification were described in Lesson 2. Quantitative
values as results in assessment facilitate accurate interpretation. Assessment can be a
combination of both qualitative and quantitative results.

Structured vs. Unstructured

Assessment can come in the form of structured or unstructured way of gathering data.
Structured forms of assessment are controlled, formal, and involve careful planning and
organized implementation. Examples of formal assessment are the final exams where it is
announced, students are provided with enough time to study, the coverage is provided, and the
test items are reviewed. A formal graded recitation can be a structured form of assessment when
it is announced, questions are prepared, and students are informed of the way they are graded in
their answers. Unstructured assessment can be informal in terms of its processes. An example
would be a short unannounced quiz just to check if students have remembered the past lesson,
informal recitations during discussion, and assignments arising from the discussion.

Objective vs. Subjective

Assessment can be objective or subjective. Objective assessment has less variation in


results such as objective tests, seatworks, and performance assessment with rubrics with right
and wrong answers. Subjective assessment on the other hand results to larger variation in results
such as essays and reaction papers. Careful procedures should be undertaken as much as possible
ensure objectivity in assessing essays and reaction papers.

Components of Classroom Assessment

Tests

Tests are basically tools that measure a sample of behavior. Generally there are a variety
of tests provided inside the classroom. It can be in the form of a quiz, long tests (usually covering
smaller units or chapters of a lesson), and final exams. Majority of the tests for students are
teacher-made-tests. These tests are tailored for students depending on the lesson covered by the
syllabus. The tests are usually checked by colleagues to ensure that items are properly
constructed.
Teacher made tests vary in the form of a unit, chapter, or long test. These generally assess
how much a student learned within a unit or chapter. It is a summative test in such a way that it is
given after instruction. The coverage is only what has been taught in a given chapter or tackled
within a given unit.
Tests also come in the form of a quiz. It is a short form assessment. It usually measures
how much the student acquired within a given period or class. The questions are usually from
what has been taught within the lesson for the day or topic tackled in a short period of time, say
for a week. On the other hand, it can be summative or formative. It can be summative if it aims
to measure the learning from an instruction, or formative if to aims to tests how much the
students already know prior the instruction. The results of quiz can be used by the teacher to
know where to start the lesson (example, the students already know how to add single digits, and
then she can already proceed to adding double digits). It can also determine if the objectives for
the day are met.

Recitation

A recitation is the verbal way of assessing students’ expression of their answers to some
stimuli provided in the instruction or by the teacher. It is a kind of assessment in which oral
participation of the student is expected. It serves many functions such as before the instruction to
ask the prior knowledge of the students about the topic. It can also be done during instruction,
wherein the teacher solicits ideas from the class regarding the topic. It can also be done after
instruction to assess how much the student learned from the lesson for the day.
Recitations are facilitated by questions provided by the teacher and it is meant that
students undergo thinking in order to answer the questions. There are many purposes of
recitation. A recitation is given if teachers wanted to assess whether students can recall facts and
events from the previous lesson. A recitation can be done to check whether a student understands
the lesson, or can go further in higher cognitive skills. Measuring high order cognitive skills
during recitation will depend in the kind of question that the teacher provides. Appraising a
recitation can be structured or unstructured. Some teachers announce the recitation and the
coverage beforehand to allow students prepare. The questions are prepared and a system of
scoring the answers are provided as well. Informal recitations are just noted by the teacher.
Effective recitations inside the classroom are marked by all students having an equal chance of
being called. Some concerns of teacher regarding the recitation process are as follows:

Should the teacher call more on the students who are silent most of the time in class?
Should the teacher ask students who could not comprehend the lesson easily more often?
Should recitation be a surprise?
Are the difficult questions addressed to disruptive students?
Are easy questions only for students who are not performing well in class?

Projects

Projects can come in a variety of form depending on the objectives of the lesson, a
reaction paper, a drawing, a class demonstration can all be considered as projects depending on
the purpose. The features of a project should include: (1) Tasks that are more relevant in the real
life setting, (2) requires higher order cognitive skills, (2) can assess and demonstrate affective
and psychomotor skills which supplements instruction, and (4) requires application of the
theories taught in class.

Performance Assessment

Performance assessment is a form of assessment that requires students to perform a task


rather than select an answer from a ready-made list. Examples would be students demonstrating
their skill in communication through a presentation, building of a dayorama, dance number
showing different stunts in a physical examination class. Performance assessment can be in the
form of an extended-response exercise, extended tasks, and portfolios. Extended-response
exercises are usually open-ended where students are asked to report their insights on an issue,
their reactions to a film, and opinions on an event. Extended tasks are more precise that require
focused skills and time like writing an essay, composing a poem, planning and creating a script
for a play, painting a vase. These tasks are usually extended as an assignment if the time in
school is not sufficient. Portfolios are collections of students’ works. For an art class the students
will compile all paintings made, for a music class all compositions are collected, for a drafting
class all drawings are compiled. Table 4 shows the different tasks using performance assessment.

Table 4
Outcomes Requiring Performance Assessment

Outcome Behavior
Skills Speaking, writing, listening, oral reading, performing experiments, drawing,
playing a musical instrument, gymnastics, work skills, study skills, and social
skills
Work habits Effectiveness in planning, use of time, use of equipment resources, the
demonstration of such traits as initiative, creativity, persistence, dependability
Social Concern for the welfare of others, respect for laws, respect the property of
attitudes others, sensitivity to social issues, concern for social institutions, desire to work
toward social improvement
Scientific Open-mindedness, willingness to suspend judgment, cause-effect relations, an
attitudes inquiring mind

Interests Expressing feelings toward various educational, mechanical, aesthetic, scientific,


social, recreational, vocational activities
Appreciations Feeling of satisfaction and enjoyment expressed toward music, art, literature,
physical skill, outstanding social contributions
Adjustments Relationship to peers, reaction to praise and criticism authority, emotional
stability, social adaptability

Assignments

Assignment is a kind of assessment which extends classroom work. It is usually a take


home task which the student completes. It may vary from reading a material, problem solving,
research, and other tasks that are accomplishable in a given time. Assignments are used to
supplement a learning task or preparation for the next lesson.
Assignments are meant to reinforce what is taught inside the classroom. Tasks on the
assignment are specified during instruction and students carry out these tasks outside of the
school. When the student comes back, the assignment should have helped the student learn the
lesson better.

Paradigm Shifts in the Practice of Assessment

For over the years the practice of assessment has changed due to improvement in
teaching and learning principles. These principles are a result of researches that called for more
information on how learning takes place. The shift is shown from old practices to what should be
ideal in the classroom.

From To

Testing Alternative assessment

Paper and pencil Performance assessment

Multiple choice Supply

Single correct answer Many correct answer

Summative Formative

Outcome only Process and Outcome

Skill focused Task-based

Isolated facts Application of knowledge

Decontextualized task Contextualized task

External Evaluator Student self-evaluation

Outcome oriented Process and outcome

The old practice of assessment focuses on traditional forms of assessment such as paper
and pencil with single correct answer and usually conducted at the end of the lesson. For the
contemporary perspectives in assessment, assessment is not necessarily in the form of paper and
pencil tests because there are skills that are better captured in through performance assessment
such as presentations, psychomotor tasks, and demonstrations. Contemporary practice welcomes
a variety of answers from students where they are allowed to make interpretation of their own
learning. It is now accepted that assessment is conducted concurrently with instruction and not
only serving as a summative function. There is also a shift from assessment items that are
contextualized and having more utility. Rather than asking for the definitions of verbs, nouns,
and pronouns, students are required to make an oral or written communication about their
favorite book. It also important that student assess their own performance to facilitate self-
monitoring and self-evaluation.

Uses of Assessment

Assessment results have a variety of application from selection to appraisal and aiding
the in the decision making process. These functions of assessment vary within the educational
setting whether it is conducted for human resources, counseling, instruction, research, and
learning.

1. Appraising

Assessment is used for appraisal. Forms of appraisals are the grades, scores, rating, and
feedback. Appraisals are used to provide a feedback on individual’s performance to determine
how much improvement could be done. A low appraisal or negative feedback indicates that
performance still needs room for improvement while high appraisal or positive feedback means
that performance needs to be maintained.

2. Clarifying Instructional Objectives

Assessment results are used to improve the succeeding lessons. Assessment results point
out if objectives are met for a specific lesson. The outcome of the assessment results are used by
teachers in their planning for the next lesson. If teachers found out that majority of students
failed in a test or quiz, then the teacher assesses whether the objectives are too high or may not
be appropriate for students’ cognitive development. Objectives are then reformulated to
approximate students’ ability and performance that is within their developmental stage.
Assessment results also have implications to the objectives of the succeeding lessons. Since the
teacher is able to determine the students’ performance and difficulties, the teacher improves the
necessary intervention to address them. The teacher being able to address the deficiencies of
students based on assessment results is reflective of effective teaching performance.

3. Determining and reporting pupil achievement of education objectives

The basic function of assessment is to determine students’ grades and report their scores
after major tests. The reported grade communicates students’ performance in many stakeholders
such as with teachers, parents, guidance counselors, administrators, and other concerned
personnel. The reported standing of students in their learning show how much they have attained
the instructional objectives set for them. The grade is a reflection of how much they have
accomplished the learning goals.

4. Planning, directing, and improving learning experiences

Assessment results are basis for improvement in the implementation of instruction.


Assessment results from students serve as a feedback on the effectiveness of the instruction or
the learning experience provided by the teacher. If majority of students have not mastered the
lesson the teacher needs to come up with a more effective instruction to target mastery for all the
students.

5. Accountability and program evaluation

Assessment results are used for evaluation and accountability. In making judgments
about individuals or educational programs multiple assessment information is used. Results of
evaluations make the administrators or the ones who implemented the program accountable for
the stakeholders and other recipients of the program. This accountability ensures that the
program implementation needs to be improved depending in the recommendations from
evaluations conducted. Improvement takes place if assessment coincides with accountability.

6. Counseling

Counseling also uses a variety of assessment results. The variables such as study habits,
attention , personality, and dispositions, are assessed in order to help students improve them.
Students who are assessed to be easily distracted inside the classroom can be helped by the
school counselor by focusing the counseling session in devising ways to improve the attention of
a student. A student who is assessed to have difficulties in classroom tasks are taught to self-
regulate during the counseling session. Students’ personality and vocational interests are also
assessed to guide them in the future courses suitable for them to take.

7. Selecting

Assessment is conducted in order to select students placed in the honor roll, pilot
sections. Assessment is also conducted to select from among student enrollees who will be
accepted in a school, college or university. Recipients of scholarships and other grants are also
based on assessment results.
Chapter 2
The Learning Intents
Lesson 1: The Taxonomic Tools

Having learned about measurement, assessment, and evaluation, this chapter will bring you
to the discussion on the learning intents, which refer to the objectives or targets the teacher sets as
the competency to build on the students. This is the target skill or capacity that you want students
to develop as they engage in the learning episodes. The same competency is what you will soon
assess using relevant tools to generate quantitative and qualitative information about your students’
learning behavior.
Prior to designing your learning activities and assessment tasks, you first have to formulate
your learning intents. These intents exemplify the competency you wish students will develop in
themselves. It this point, your deep understanding on how learning intents should be formulated is
very useful. As you go through this chapter, your knowledge about the guidelines in formulating
these learning intents will help you understand how assessment tasks should be defined.
In formulating learning intents, it is helpful to be aware that appropriate targets of learning
come in different forms because learning environments differ in many ways. What is crucial is the
identification of which intents are more important than the others so that they are given appropriate
priority. When you formulate statements of learning intents, it is important that you have a strong
grasp of some theories of learning as these will aid you in determining what competency could
possibly be developed in the students. If you are familiar with Bloom’s taxonomy, dust yourself
off in terms of your understanding of it so that you can make a good use of it.

EVALUATION

SYNTHESIS

ANALYSIS

APPLICATION

COMPREHENSION

KNOWLEDGE

Figure 1
Bloom’s Taxonomy
Figure 1 shows a guide for teachers in stating learning intents based on six dimensions of
cognitive process. Knowledge, being the one whose degree of complexity is low, includes simple
cognitive activity such as recall or recognition of information. The cognitive activity in
comprehension includes understanding of the information and concepts, translating them into
other forms of communication without altering the original sense, interpreting, and drawing
conclusions from them. For application, emphasis is on students’ ability to use previously
acquired information and understanding, and other prior knowledge in new settings and applied
contexts that are different from those in which it was learned. For learning intents stated at the
Analysis level, tasks require identification and connection of logic, and differentiation of concepts
based on logical sequence and contradictions. Learning intents written at this level indicate
behaviors that indicate ability to differentiate among information, opinions, and inferences.
Learning intents at the synthesis level are stated in ways that indicate students’ ability to produce
a meaningful and original whole out of the available information, understanding, contexts, and
logical connections. Evaluation includes students’ ability to make judgments and sound decisions
based defensible criteria. Judgments include the worth, relevance, and value of some information,
ideas, concepts, theories, rules, methods, opinions, or products.
Comprehension requires knowledge as information is required in understanding it. A good
understanding of information can facilitate its application. Analysis requires the first three
cognitive activities. Both synthesis and evaluation require knowledge, comprehension, application,
and analysis. Evaluation does not require synthesis, and synthesis does not require evaluation
either.
Recently after 45 years since the birth Bloom’s original taxonomy, a revised version has
come into the teaching practice, which was developed by Anderson and Krathwohl. Statements
that describe intended learning outcomes as a result of instruction are framed in terms of some
subject matter content and the action required with the content. To eliminate the anomaly of
unidimensionality of the statement of learning intents in their use of noun phrases and verbs
altogether, Figure 3 shows two separate dimensions of learning: the knowledge dimension and the
cognitive process dimension.

Knowledge Dimension has four categories, three of which include the subcategories of
knowledge in the original taxonomy. The fourth, however, is a new one, something that was not
yet gaining massive popularity at the time when the original taxonomy was conceived. It is new
and, at the same time, important in that it includes strategic knowledge, knowledge about cognitive
tasks, and self-knowledge.
Factual knowledge. This includes knowledge of specific information, its details and other
elements therein. Students make use of this knowledge to familiarize the subject matter or propose
solutions to problems within the discipline.
Conceptual knowledge. This includes knowledge about the connectedness of information
and other elements to a larger structure of thought so that a holistic view of the subject matter or
discipline is formed. Students classify, categorize, or generalize ideas into meaningful structures
and models.
Procedural knowledge. This category of includes the knowledge in doing some procedural
tasks that require specific skills and methods. Students also know the criteria for using the
procedures in levels of appropriateness.
Metacognitive knowledge. This involves cognition in general as well as the awareness and
knowledge of one’s own cognition. Students know how they are thinking and become aware of the
contexts and conditions within which they are learning.

Figure 3. Sample Objectives Using The Revised Taxonomy

Remember Understand Apply Analyze Evaluate Create

Factual #1

Conceptual #2 #3

Procedural

Metacognitive #3

# 1: Remember the characters of the story, “Family Adventure.”


# 2: Compare the roles of at least three characters of the story.
# 3: Evaluate the story according to specific criteria.
# 4: Recall personal strategies used in understanding the story.

Cognitive Process Dimension is where specific behaviors are pegged, using active verbs.
However, so that there is consistency in the description of specific learning behaviors, the
categories in the original taxonomies which were labeled in noun forms are now replaced with
their verb counterparts. Synthesis changed places with Evaluation, both are now stated in verb
forms.

Remember. This includes recalling and recognizing relevant knowledge from long-term
memory.
Understand. This is the determination of the meanings of messages from oral, written or
graphic sources.
Apply. This involves carrying out procedural tasks, executing or implementing them in
particular realistic contexts.
Analyze. This includes deducing concepts into clusters or chunks of ideas and
meaningfully relating them together with other dimensions.
Evaluate. This is making judgments relative to clear standards or defensible criteria to
critically check for depth, consistency, relevance, acceptability, and other areas.
Create. This includes putting together some ideas, concepts, information, and other
elements to produce complex and original, but meaningful whole as an outcome.
The use of the revised taxonomy in different programs has benefited both teachers and
students in many ways (Ferguson, 2002; Byrd, 2002). The benefits generally come from the fact
that the revised taxonomy provides clear dimensions of knowledge and cognitive processes in
which to focus in the instructional plan. It also allows teachers to set targets for metacognition
concurrently with other knowledge dimensions, which is difficult to do with the old taxonomy.

Lesson 2: Assessment in the Classroom Context

Both the Bloom’s taxonomy and the revised taxonomy are not the only existing taxonomic
tools for setting our instructional targets. There are other equally useful taxonomies. One of these
is developed by Robert M. Gagne. In his theory of instruction, Gagne desires to help teachers make
sound educational decisions so that the probability that the desired results in learning are achieved
is high. These decisions necessitate the setting of intentional goals that assure learning.

In stating learning intents using Gagne’s taxonomy, we can focus on three domains. The
cognitive domain includes Declarative (verbal information), Procedural (intellectual skills), and
Conditional (cognitive strategies) knowledge. The psychological domain includes affective
knowledge (attitudes). The psychomotor domain involves the use of physical movement (motor
skills).
Verbal Information includes a vast body of organized knowledge that students acquire through
formal the instructional processes, and other media, such as television, and others. Students
understand the meaning of concepts rather than just memorizing them. This condition of learning
lumps together the first two cognitive categories of Bloom’s taxonomy. Learning intents must
focus on differentiation of contents in texts and other modes of communication; chunking the
information according to meaningful subsets; remembering and organizing information.
Intellectual Skills include procedural knowledge that ranges from Discrimination, to Concrete
Concepts, to Defined Concepts, to Rules, and to Higher Order Rules.
Discrimination involves the ability to distinguish objects, features, or symbols. Detection
of difference does not require naming or explanation.
Concrete Concepts involve the identification of classes of objects, features, or events, such
as differentiating objects according to concrete features, such as shape.
Defined Concepts include classifying new and contextual examples of ideas, concepts, or
events by their definitions. Here, students make use labels of terms denoting defined
concepts for certain events or conditions.
Rules apply a single relationship to solve a group of problems. The problem to be solved
is simple, requiring conformance to only one simple rule.
Higher order rules include the application of a combination of rules to solve a complex
problem. The problem to be solved requires the use of complex formula or rules so that
meaningful answers are arrived at.
Learning intents stated at this level of cognitive domain must given attention to abilities to
spot distinctive features, use information from memory to respond to intellectual tasks in various
contexts, make connections between concepts and relate them to appropriate situations.

Cognitive Strategies consist of a number of ways to make students develop skills in guiding and
directing their own thinking, actions, feelings, and their learning process as a whole. Students
create and hone their metacognitive strategies. These processes help then regulate and oversee
their own learning, and consist of planning and monitoring their cognitive activities, as well as
checking the outcomes of those activities. Learning intents should emphasize abilities to describe
and demonstrate original and creative strategies that students have tried out in various conditions
Attitudes are internal states of being that are acquired through earlier experience of task
engagement. These states influence the choice of personal response to things, events, persons,
opinions, concepts, and theories. Statements of learning intents must establish a degree of success
associated with desired attitude, call for demonstration of personal choice for actions and
resources, and allow observation of real-world and human contexts.
Motor Skills are well defined, precise, smooth and accurately timed execution of performances
involving the use of the body parts. Some cognitive skills are required for the proper execution of
motor activities. Learning intents drawn at this domain should focus on the execution of fine and
well-coordinated movements and actions relative to the use of known information, with acceptable
degree of mastery and accuracy of performance.
Another taxonomic tool is one developed by Stiggins & Conklin (1992), which involves
categories of learning as bases in stating learning intents.

Knowledge This includes simple understanding and mastery of a great deal of subject matter,
processes, and procedures. Very fundamental to the succeeding stages of learning
is the knowledge and simple understanding of the subject matter. This learning may
take the form of remembering facts, figures, events, and other pertinent
information, or describe, explain, and summarize concepts, and cite examples.
Learning intents must endeavor to develop mastery of facts and information as well
as simple understanding and comprehension of them.

Reasoning This indicates ability to use deep knowledge of subject matter and procedures to
make defensible reason and solve problems with efficiency. Tasks under this
category include critical and creative thinking, problem solving, making judgments
and decisions, and other higher order thinking skills. Learning intents must,
therefore, focus on the use of knowledge and simple understanding of information
and concepts to reason and solve problems in contexts.

Skills This highlights the ability to demonstrate skills to perform tasks with acceptable
degree of mastery and adeptness. Skills involve overt behaviors that show
knowledge and deep understanding. For this category, learning intents have to
take particular interest in the demonstration of overt behaviors or skills in actual
performance that requires procedural knowledge and reasoning
In this area, the ability to create and produce outputs for submission or oral
Products
presentations is given importance. Because outputs generally represent mastery of
knowledge, deep understanding, and skills, they must be considered as products
that demonstrate the ability to use those knowledge and deep understanding, and
employ skills in strategic manner so that tangible products are created. For the
statement of learning intents, teachers must state expected outcomes, either
process- or product-oriented.

Focus is on the development of values, interests, motivation, attitudes, self-


Affect
regulation, and other affective states. In stating learning intents on this category, it
is important that clear indicators of affective behavior can easily be drawn from the
expected learning tasks. Although many teachers find it difficult to determine
indicators of affective learning, it is inspiring to realizing that it is not impossible
to assess it.

These categories of learning by Stiggins and Conklin are helpful especially if your intents
focus on complex intellectual skills and the use of these skills in producing outcomes to increase
self-efficacy among students. In attempting to formulate statements of learning outcome at any
category, you can be clear about what performance you want to see at the end of the instruction.
In terms of assessment, you would know exactly what to do and what tools to use in assessing
learning behaviors based on the expected performance. Although stating learning outcomes at the
affective category is not as easy to do as in the knowledge and skill categories, but trying it can
help you approximate the degree of engagement and motivation required to perform what is
expected. Or if you would like to also give prominence to this category without stating another
learning intent that particularly focus on the affective states, you might just look for some
indicators in the cognitive intents. This is possible because knowledge, skills, and attitudes are
embedded in every single statement of learning intent.

Another alternative guide for setting the learning targets is one that had been introduced to
us by Robert J. Marzano in his Dimensions of Learning (DOL). As a taxonomic tool, the DOL
provides a framework for assessing various types of knowledge as well as different aspects of
processing which comprises six levels of learning in a taxonomic model called the new taxonomy
(Marzano & Kendall, 2007). These levels of learning are categorized into different systems.

The Cognitive System


The cognitive system includes those cognitive processes that effectively use or manipulate
information, mental procedures and psychomotor procedures in order to successfully complete a
task. It indicates the first four levels of learning, such as:

Level 1: Retrieval. In this level of the cognitive system students engage some mental
operations for recognition and retrieval of information, mental procedure, or psychomotor
procedure. Students engage in recognizing, where they identify the characteristics, attributes,
qualities, aspects, or elements of information, mental procedure, or psychomotor procedure;
recalling, where they remember relevant features of information, mental procedure, or
psychomotor procedure; or executing, where they carry out a specific mental or psychomotor
procedure. Neither the understanding of the structure and value of information nor the how’s and
why’s of the mental or psychomotor procedure is necessary.
Level 2: Comprehension. As the second level of the cognitive system, comprehension
includes students’ ability to represent and organize information, mental procedure or psychomotor
procedure. It involves symbolizing where students create symbolic representation of the
information, concept, or procedures with a clear differentiation of its critical and noncritical
aspects; or integrating, where they put together pieces of information into a meaningful structure
of knowledge or procedure, and identify its critical and noncritical aspects.

Level 3: Analysis. This level of the cognitive system includes more manipulation of
information, mental procedure, or psychomotor procedure. Here students engage in analyzing
errors, where they spot errors in the information, mental procedure, or psychomotor procedure,
and in its use; classifying the information or procedures into general categories and their
subcategories; generalizing by formulating new principles or generalizations based on the
information, concept, mental procedure, or psychomotor procedure; matching components of
knowledge by identifying important similarities and differences between the components; and
specifying applications or logical consequences of the knowledge in terms of what predictions can
be made and proven about the information, mental procedure, or psychomotor procedure.
Level 4: Knowledge Utilization. The optimal level of cognitive system involves appropriate
use of knowledge. At this level, students put the information, mental procedure, or psychomotor
procedure to appropriate use in various contexts. It allows for investigating a phenomenon using
certain information or procedures, or investigating the information or procedure itself; using
information or procedures in experimenting knowledge in order to test hypotheses, or generating
hypotheses from the information or procedures; problem solving, where students use the
knowledge to solve a problem, or solving a problem about the knowledge itself; and decision
making, where the use of information or procedures help arrive at a decision, or decision is made
about the knowledge itself.

The Metacognitive System


The metacognitive system involves students’ personal agency of setting appropriate goals
of their learning and monitoring how they go through the learning process. Being the 5 th level of
the new taxonomy, the metacognitive system includes those learning targets as specifying goals,
where students set goals in learning the information or procedures, and make a plan of action for
achieving those goals; process monitoring, where students monitor how they go about the action
they decided to take, and find out if the action taken effectively serves their plan for learning the
information or procedures; clarity monitoring, where students determine how much clarity has
been achieved about the knowledge in focus; and accuracy monitoring, where students see how
accurately they have learned about the information or procedures.

The Self System


Placed at the highest level in the new taxonomy, the Self System is the level of learning
that sustains students’ engagement by activating some motivational resources, such as their self-
beliefs in terms of personal competence and the value of the task, emotions, and achievement-
related goals. At this level, students reason about their motivational experiences. They reason about
the value of knowledge by examining importance of the information or procedures in their personal
lives; about their perceived competence by examining efficacy in learning the information or
procedures; about their affective experience in learning by examining emotional response to the
knowledge under study; about their overall engagement by examining motivation in learning the
information or procedures.
In each system, three dimensions of knowledge are involved, such as information, mental
procedures, and psychomotor procedures.

Information
The domain of informational knowledge involves various types of declarative knowledge
that are ordered according to levels of complexity. From its most basic to more complex levels, it
includes vocabulary knowledge in which meaning of words are understood; factual knowledge, in
which information constituting the characteristics of specific facts are understood; knowledge of
time sequences, where understanding of important events between certain time points is obtained;
knowledge of generalizations of information, where pieces of information are understood in terms
of their warranted abstractions; and knowledge of principles, in which causal or correlational
relationships of information are understood. The first three types of informational knowledge
focus on knowledge of informational details, while the next two types focus on informational
organization.

Mental Procedures
The domain of mental procedures involves those types of procedural knowledge that make
use of the cognitive processes in a special way. In its hierarchic structure, mental procedures could
be as simple as the use of single rule in which production is guided by a small set of rules that
requires a single action. If single rules are combined into general rules and are used in order to
carry out an action, the mental procedures are already of tactical type, or an algorithm, especially
if specific steps are set for specific outcomes. The macroprocedures is on top of the hierarchy of
mental procedures, which involves execution of multiple interrelated processes and procedures.

Psychomotor Procedures
The domain of psychomotor procedures involves those physical procedures for completing
a task. In the new taxonomy, psychomotor procedures are considered a dimension of knowledge
because, very similar to mental procedures, they are regulated by the memory system and develop
in a sequence from information to practice, then to automaticity (Marzano & Kendall, 2007).
In summary, the new taxonomy of Marzano & Kendal (2007) provides us with a
multidimensional taxonomy where each system of thinking comprises three dimensions of
knowledge that will guide us in setting learning targets for our classrooms. Table 2a shows the
matrix of the thinking systems and dimensions of knowledge.
Systems of Dimensions of Knowledge

Thinking Information Mental Procedure Psychomotor Procedure

Level 6

(Self System)

Level 5

(Metacognitive System)

Level 4:

Knowledge Utilization

(Cognitive System)

Level 3:

Analysis

(Cognitive System)

Level 2:

Comprehension

(Cognitive System)

Level 1:

Retrieval

(Cognitive System)

Now, if you wish to explore on other alternative tools for setting your learning objectives,
here’s another help for us to make our learning intents target on the more complex learning
outcomes, this one from Edward de Bono (1985). There are six thinking hats, each of which is
named for a color that represents a specific perspective. When these hats are “worn” by the student,
information, issues, concepts, theories, and principles are viewed in ways that are descriptive of
mnemonically associated perspectives of the different hats. Let’s say that your learning intent
necessitates students to mentally put on a white hat whose descriptive mental processes include
gathering of information and thinking how it can be obtained, and the emotional state is neutral,
then learning behaviors may be classifying facts and opinions, among others. It is essential to be
conscious that each hat that represents a particular perspective involves a frame of mind as well as
an emotional state. Therefore, the perspective held by the students when a hat is mentally worn,
would be a composite of mental and emotional states. Below is an attempt to summarize these six
thinking hats.
THE HATS

WHITE RED BLACK YELLOW GREEN BLUE

Perspective Observer Self & others Self & others Self & others Self & others Observer

Stern judge
Representati White paper, Fire, warmth wearing black Sunshine, Vegetation Sky, cool
on neutral rode optimism

Looking for Presenting Judging with Looking for Exploring Establishing


needed views, a logical benefits and possibilities control of the
objective feelings, negative productivity & making process of
Descripti thinking and
facts and emotions, and view, looking with logical hypotheses,
ve information, intuition for wrongs & positive view, composing engagement,
Behavio
r
including how without playing the seeing what new ideas using
these can be explanation or devil’s is good in with creative metacognition
obtained justification advocate anything thinking

Figure 5
Summative map of the Six Thinking Hats

These six thinking hats are beneficial not only in our teaching episodes but also in the
learning intents that we set for our students. If qualities of thinking, creative thinking
communication, decision-making, and metacognition are some of those that you want to develop
in your students, these six thinking hats could help you formulate statements of learning intents
that clearly set the direction of learning. Added benefits would be that when your intents are stated
in the planes these hats, the learning episodes can be defined easily. Consequently, assessment is
made more meaningful.
Lesson 3: Specificity of the learning intent

Learning intents usually come in relatively specific statements of desired learning behavior
or performance we would like to see in our students at the end of the instructional process. In
making these intents facilitate relevant assessment, it is important that they are stated with very
active verbs, those that represent clear actions or behaviors so that indicators of performance are
easily identified. These active verbs are and essential part of the statement of learning intents
because they specify what the students actually do within and at the end of a specified period of
time. In this case, assessment becomes convenient to do because it can specifically focus on the
indicated behaviors or actions.

Gronlund, (in Mcmillan, 2005), uses the term instructional objectives to mean
intended learning outcomes. He emphasizes that instructional objectives should be
stated in terms of specific, observable, and measurable student responses.

In writing statements of learning intents for the course we teach, we aim to state behavior
outcomes to which our teaching efforts are devoted, so that, from these statements, we can design
specific tasks in the learning episodes for our students to engage into. However, we need to make
sure that these statements will have to be set with proper level of generality so that they don’t
oversimplify or complicate the outcome.
A statement of intent could have a rather long range of generality so that many sub-
outcomes may be indicated. Learning intents that are stated in general terms will need to be defined
further by a sample of the specific types of student performance that characterize the intent. In
doing this, assessment will be easy because the performance is clearly defined. Unlike the general
statements of intent that may permit the use of not-so-active verbs such as know, comprehend,
understand, and so on, the specific ones use active verbs in order to define specific behaviors that
will soon be assessed. The selection of these verbs is very vital in the preparation of a good
statement of learning intent. Three points to remember might help select active verbs.

1. See that the verb clearly represents the desired learning intent.
2. Note that the verb precisely specifies acceptable performance of the student.
3. Make sure that the verb clearly describes relevant assessment to be made within or at the
end of the instruction.

The statement, students know the meaning of terms in science is general. Although it gives
us an idea of the general direction of your class towards the expected outcome, we might be
confused as to what specific behaviors of knowing will be assessed. Therefore, it is necessary that
we draw some representative sample of specific learning intent so that we will let students:
 write a definition of particular scientific term

 identify the synonym of the word

 give the term that fits a given description

 present an example of the term

 represent the term with a picture

 describe the derivation of the term

 identify symbols that represent the term

 match the term with concepts

 use the term in a sentence

 describe the relationship of terms

 differentiate between terms

 use the term in


If these behaviors are stated completely as specific statements of learning intent, we can
have a number of specific outcomes. To make specifically defined outcomes, the use of active
verbs is helpful. If more specificity is desired, statements of condition and criterion level can be
added to the learning intents. If you think that the statement, student can differentiate between
facts and opinions, needs more specificity, then you might want to add a condition so that it will
now sound like this:

Given a short selection, the student can identify statements of facts and of opinions.
If more specificity is still desired, you might want to add a statement of
criterion level. This time, the statement may sound like this:
Given a short selection, the student can correctly identify at least 5 statements of
facts and 5 statements of opinion in no more than five minutes without the aid of
any resource materials.

The lesson plan may allow the use of moderately specific statements of learning intents,
with condition and criterion level briefly stated. In doing assessment, however, these intents will
have to be broken down to their substantial details, such that the condition and criterion level are
specifically indicated. Note that it is not necessarily about choosing which one statement is better
than the other. We can use them in planning for our teaching. Take a look at this:
Learning Intent: Student will differentiate between facts and opinions from written texts.

Assessment: Given a short selection, the student can correctly identify at least 5
statements of facts and 5 statements of opinion in no more than five
minutes without the aid of any resource materials.
If you insert in the text the instructional activities or learning episodes in well described
manner as well as the materials needed (plus other entries specified in your context), you can now
have a simple lesson plan.

Should the statement of learning intent be stated in terms of


teacher performance or student performance that is to be
demonstrated after the instruction? How do these two differ
from each other?
Should it be stated in terms of the learning process or
learning outcome? How do these two differ from one
another?
Should it be subject-matter oriented or competency-oriented?
Chapter 3
Characteristics of an Assessment Tool

Lesson 2
Validity

Validity indicates whether an assessment tool is measuring what it intends to measure.


Validity estimates indicate whether the latent variable shared by items in a test is in fact the
target variable of the test developer. Validity is the ability of a scale or test to predict events,
relationship with other measures, and representativeness of item content.

Content Validity

Content validity is the systematic examination of the test content to determine whether it
covers a representative sample of the behavior domain to be measured. For affective measures, it
concerns whether the items are enough to manifest the behavior measured. For cognitive tests, it
concerns whether the items cover all contents specified in an instruction.
Content validity is more appropriate for cognitive tests like achievement tests and teacher
made tests. In these types of tests, there is a presence of a specified domain that will be included
in the test. The content covered is found in the instructional objectives in the lesson plan,
syllabus, table of specifications, and textbooks.
Content validity is conducted through consultation with experts. In the process, the
objectives of the instruction, table of specifications, and items of the test are shown to the
consulting experts. The experts check whether the items are enough to cover the content of the
instruction provided, whether the items are measuring the objectives set, and if the items are
appropriate for the cognitive skill intended. The process also involves correcting the items if they
are appropriately phrased for the level that will take the test and whether the items are relevant to
the subject area tested.
Details on constructing Table of Specifications are explained in the next chapters.

Criterion-Prediction Validity

Criterion-prediction involves prediction from the test to any criterion situation over time
interval. For example, to assess the predictive validity of an entrance exam, it will be correlated
later with the students’ grades after a trimester/semester. The criterion in this case would be the
students’ grade which will come in the future.
Criterion-prediction is used for hiring job applicants, selecting students for admission to
college, assigning military personnel to occupational training programs. For selecting job
applicants, the pre-employment tests are correlated with the obtained supervisor rating in the
future. In assigning military personnel for training, the aptitude test administered before training
will be correlated with the future post assessment in the training. A positive and high correlation
coefficients should be obtained in these cases to adequately say that the test has a predictive-
validity.
Generally the analysis involves the test score correlated with other criterion measures
example are mechanical aptitude and job performance as a machinist.
Construct Validity

Construct validity is the extent to which the test may be said to measure a theoretical
construct or trait. This is usually conducted for measures that are multidimensional or contains
several factors. The goal of construct validity is to explain and prove the factors of the measure
as it is true with the theory used.
There are several methods for analyzing the constructs of a measure. One way is to
correlate a new test with a similar earlier test as measured approximately the same general
behavior. For example, a newly constructed measure for temperament is correlated with an
existing measure of temperament. If high correlations are obtained between the two measures it
means that the two test are measuring the same constructs or traits.
Another widely used technique to study the factor structure of a test is the factor analysis
which can be exploratory of confirmatory. Factor analysis is a mathematical technique that
involves arriving with sources of variation among the constructs involved. These variations are
usually called factors or components (as explained in chapter 1). Factor analysis reduces the
number of variables and it detects the structure in the relationships between variables, or classify
variables. A factor is a set of highly intercorrelated variables. In using a Principal Components
Analysis as a method of factor analysis, the process involves extracting the possible groups that
can be formed through the eigenvalues. A measure of how much variance each successive factor
extracts. The first factor is generally more highly correlated with the variables than the second
factor. This is to be expected because these factors are extracted successively and will account
for less and less variance overall. Factor extraction stops when factors begin to yield low
eigenvalues. An example of the extraction showing eigenvalues is illustrated below in the study
by Magno (2008) where he developed a scale measuring parental closeness with 49 items and
four factors are hypothesized (bonding, support, communication, interaction).

Plot of Eigenvalues
16
15
14
13
12
11
10
Value

Number of Eigenvalues
The scree plot shows that 13 factors can be used to classify the 49 items. The number of
factors is determined by counting the eigenvalues that are greater than 1.00. But having 13
factors is not good because it does not further reduce the variables. One technique in the scree
test is to assess the place where the smooth decrease of eigenvalues appears to level off to the
right of the plot. To the right of this point, presumably, one finds only "factorial scree" - "scree"
is the geological term referring to the debris which collects on the lower part of a rocky slope. In
applying this technique, the fourth eigenvalue shows a smooth decrease in the graph. Therefore,
four factors can be considered in the test.
The items that will belong under each factor is determined by assessing the factor
loadings of each item. Each item in the process will load in each factor extracted. The item that
highly loaded in a factor will technically belong to that factor because it is highly correlated with
the other items in that factor or group. A factor loading of .30 means that the item contributes
meaningfully to the factor. A factor loading of .40 means the item is highly contributions to the
factor. An example of a table with factor loading is illustrated below.

1 2 3 4
item1 0.032 0.196 0.172 0.696
item2 0.13 0.094 0.315 0.375
item3 0.129 0.789 0.175 0.068
item4 0.373 0.352 0.35 0.042
item5 0.621 -0.042 0.251 0.249
item6 0.216 -0.059 0.067 0.782
item7 0.093 0.288 0.307 0.477
item8 0.111 0.764 0.113 0.085
item9 0.228 0.315 0.144 0.321
item10 0.543 0.113 0.306 -0.01

In the table above, the items that highly loaded to a factor should have a loading of .40 and
above. For example, item 1 highly loaded on factor 4 with a factor loading of .696 as compared
with the other loadings .032, .196, and 0.172 for factors 1, 2, and 3 respectively. This means that
item 1 will be classified under factor 4 together with item 6 and item 7 because they all highly
load under the fourth factor. Factor loadings are best assessed when the items are rotated
(Consult scaling theory references for details on factor rotation).

Another way of proving the factor structure of a construct is through Confirmatory Factor
Analysis (CFA). In this technique, there is a developed and specific hypothesis about the
factorial structure of a battery of attributes. The hypothesis concerns the number of common
factors, their pattern of intercorrelation, and pattern of common factor weights. It is used to
indicate how well a set of data fits the hypothesized structure. The CFA is done as follow-up to a
standard factor analysis. In the analysis, the parameters of the model is estimated, and the
goodness of fit of the solution to the data is evaluated. For example, in the study of Magno
(2008) confirmed the factor structure of parental closeness (bonding, support, communication,
succorance) after a series of principal components analysis. The parameter estimates and the
goodness of fit of the measurement model was then analyzed.
Figure 1
Measurement Model of Parental Closeness using Confirmatory Factor Analysis

The model estimates in the CFA shows that all the factors of parental closeness have
significant parameters (8.69*, 5.08*, 5.04*, 1.04*). The delta errors are used (28.83*, 18.02*,
18.08*, 2.58*), and each factor has a significant estimate as well. Having a good fit reflects on
having all factor structures as significant for the construct parental closeness. The goodness of fit
using chi-square is a rather good fit (2=50.11, df=2). The goodness of fit based on the Root
Mean square standardized residual (RMS=0.072) shows that there is less error having a value
close to .01. Using Noncentrality fit indeces, the values show that the four factor solution has a
good fit for parental closeness (McDonald Noncentrality Index=0.910, Population Gamma
Index=0.914).
Confirmatory Factor Analysis can also be used to assess the best factor structure of a
construct. For example, the study of Magno, Tangco, and Sy (2007), the assessed the factor
structure of metacognition (awareness of one’s learning) on its effect on critical thinking
(measured by the Watson Glaser Critical Thinking Appraisal). Two factor structured of
metacognition was assessed. The first model of metacognition includes two factors which is
regulation of cognition and knowledge of cognition (see Schraw and Dennison, ). The second
model tested metacognition with eight factors: Declarative knowledge, procedural knowledge,
conditional knowledge, planning, information management, monitoring, debugging strategy, and
evaluation of learning.
Model 1. Two Factors of Metacognition

Model 2: Eight Factors of Metacognition


The results in the analysis using CFA showed that model 1 has a better fit as compared
to model 2. This indicates that metacogmition is better viewed with two factors (knowledge of
cognition and regulation of cognition) that with eight factors.
The Principal Components Analysis and Confirmatory Factor Analysis can be conducted
using available statistical softwares such as Statistica and SPSS.

Convergent and Divergent Validity

According to Anastasi and Urbina (2002), the method of convergent and divergent
validity is used to prove the correlation of variables with which it should theoretically correlate
(convergent) and also it does not correlate with variables from which it should differ
(divergent).
In convergent validity, constructs that are intercorrelated should be high and positive as
explained in the theory. For example, in the study of Magno (2008) on parental closeness,
when the factors of parental closeness were intercorrelated (bonding, support, communication,
and sucorrance), a positive magnitude was obtained indicating convergence of these
constructs.

Factors of Parental Closeness (1) (2) (3) (4)


(1) Bonding 1.00 0.70** 0.62** 0.44**
(2) Communication 1.00 0.57** 0.28**
(3) Support 1.00 0.59**
(4) Succorance 1.00
**p<.05

For divergent validity, a construct should inversely correlate with its opposite factors.
For example, the study by Magno, Lynn, Lee, and Kho (in press) constructed a scale that
measures Mothers’ involvement on their grade school and high school child. The factors of
mothers involvement in school-related activities are intercorrelated. Observe that these factors
belong in the same test but controlling was negatively related permissive and loving is
negatively related with autonomy. This indicates divergence of the factors within the same
measure.

Factors of Controlling Permissive Loving Autonomy


Mother’s
Involvement
Controlling ---
Permissive -0.05 ---
Loving 0.05 0.17* ---
Autonomy 0.14* 0.41* -0.36* ---
Summary on Validity

Type of Validity Nature Use (Statistical) Procedure


Content Validity Systematic examination of More appropriate for  Items are based on
the test content to determine achievement tests & teacher instructional objectives,
whether it covers a made tests course syllabi & textbooks
representative sample of the  Consultation with
behavior domain to be experts
measured.  Making test-
specifications
Criterion-Prediction Prediction from the test to Hiring job applicants, Test scores are correlated
Validity any criterion situation over selecting students for with other criterion
time interval admission to college, measures ex: mechanical
assigning military personnel aptitude and job
to occupational training performance as a machinist
programs

Construct Validity The extent to which the test Used for personality tests.  Correlate a new test
may be said to measure a Measures that are with a similar earlier test
theoretical construct or trait. multidimensional as measured approximately
the same general behavior
 Factor analysis
 Comparison of the
upper and lower group
 Point-biserial
correlation (pass and fail
with total test score)
 Correlate subtest with
the entire test
Convergent Validity The test should correlate Commonly for personality Multitrait-multidimensional
significantly from variables it measures matrix
is related to
Divergent Validity The test should not correlate Commonly for personality Multitrait-multidimensional
significantly from variables measures matrix
from which it should differ
Exercise

Give the best type of reliability to use in the following cases.

1. A scale measuring motivation was correlated on a scale measuring


laziness, a negative coefficient was expected.

2. An achievement test on personality theories was administered to


psychology majors, and the same test was administered among engineering students who have not
taken the scores. It is expected that there would be a significant difference on the mean scores of
the two groups.

3. The 16 PF that measures 16 personality factors were intercorrelated with


the 12 factors of the Edwards personality Preference Schedule (EPPS). Both instruments are
measures of personality but contains different factors.

4. The multifactorial metamemory questionnaire (MMQ) arrived with


three factors when factor analysis was conducted. It had a total of 57 items that originally belong
to 5 factors.

5. The scores on the depression diagnostic scale was correlated with the
Minnesota Multiphasic Personality Inventory (MMPI). It was found that clients who are diagnosed
to be depressive have high scores on the factors of MMPI.

6. The scores of Mike’s mental ability taken during fourth year high school
was used in order to determine whether he will be qualified to enter in the college he want to
study.

7. Maria who went for drug rehabilitation was assessed using the self-
concept test and her records in the company where she was working at were requested that contains
her previous security scale scores. The two tests were compared.

8. Mrs. Ocampo a math teacher before preparing her test constructs a


table of specifications and after making the items it was checked by her subject area coordinator.

9. In an experiment, self disclosure of participants were obtained by having


three raters listen to the recordings between a counselor and client having a counseling session.
The raters used an ad hoc self-disclosure inventory and later their ratings were compared using the
coefficient of concordance. The concordance indicates whether the three raters agree on their
ratings.

10. A test measuring “sensitivity” was constructed in order to establish its


reliability the scores for each items were entered in a spreadsheet to determine whether the
responses for each were consistent.
11. The items of a newly constructed personality test measuring Carl
Jung’s psychological functions used a lickert scale. The scores for each item were correlated with
all possible combinations.

12. A test on science was made by Ms. Asuncion a science teacher.


After scoring each test she determined the internal consistency of items.

13. In a battery of tests, the section A class received both the Strong
Vocational Interest Blank (SVIB) and the Jackson Vocational Interest Survey (JVIS). Both are
measures of vocational interest and the scores are correlated to determine if one measures the same
construct.

14. The Work Values Inventory (WVI) was separated into 2 forms and two
set of scores were generated. The two set of scores were correlated to see if they measure the same
construct.

15. Children’s moral judgment was studied if it would change overtime.


It was administered during the first week of classes then another at the end of the first quarter.

16. The study of values was deigned to measure 6 basic interests, motives,
or evaluative attitudes such as theoretical, economic, aesthetic, social, political, and religious.
These six factors were derived after a validity analysis.

17. When the EPPS items were presented in a free choice format, the
scores correlated quite highly with the scores obtained with the regular forced-choice form of the
test.

18. The two forms of the MMPI (Form F and form K scales) were
correlated to detect faking or response sets.

19. In a study by Miranda, Cantina and Cagandahan (2004) they


intercorrelated the 15 factors of the Edwards Personal Preference Inventory.
Lesson 3
Item Difficulty and Item Discrimination

Students are usually keen in determining whether an item is difficult or easy and
whether the test is a good test or a bad test based on their own judgment. A test item being
judged as easy or difficult is referred to as item difficulty and whether a test is good or bad is
referred to as item discrimination. Identifying a test items’ difficulty and discrimination is
referred to as item analysis. Two approaches will be presented in this chapter on item analysis:
Classical Test Theory (CTT) and Item Response Theory (IRT). A detailed discussion on the
difference between the CTT and IRT is found at the end of Lesson 3.

Classical Test Theory

Regarded as the “True Score Theory.” Responses of examinees are due only to variation in
ability of interest. All other potential sources of variation existing in the testing materials such as
external conditions or internal conditions of examinees are assumed either to be constant
through rigorous standardization or to have an effect that is nonsystematic or random by
nature. The focus of CTT is the frequency of correct responses (to indicate question difficulty);
frequency of responses (to examine distracters); and reliability of the test and item-total
correlation (to evaluate discrimination at the item level).

Item Response Theory

Synonymous with latent trait theory, strong true score theory, or modern mental test theory. It is
more applicable to for tests with right and wrong (dichotomous) responses. It is an approach to
testing based on item analysis considering the chance of getting particular items right or wrong.
In IRT, each item on a test has its own item characteristic curve that describes the probability of
getting each particular item right or wrong given the ability of the test takers (Kaplan &
Saccuzzo, 1997).

Item difficulty is the percentage of examinees responding correctly to each item in the
test. Generally, an item difficulty is difficult if a large percentage of the test takers are not able
to answer it correctly. On the other hand, an item is easy if a large percentage of the test takers
are able to answer it correctly (Payne, 1992).

Item discrimination refers to the relation of performance on each item to performance


on the total score (Payne, 1992). An item can discriminate if most of the high-scoring test
takers are able to answer the item correctly and an item will have a low discriminating power if
the low- scoring test takers can equally answer the test item correct as contrasted with the high-
scoring test takers.
Procedure for Determining Index of Item Difficulty and Discrimination

1. Arrange the test papers in order from highest to lowest.


2. Identify the high and low scoring group by getting the upper 27% and lower 27%. For example
there are 20 test takers, the 27% of the 5 test takers is 5.4, rounding it off will give 5 test takers.
This means that the top 5 (high scoring test-takers) and the bottom 5 (low scoring test- takers) test
takers will be included in the item analysis.
3. Tabulate the correct and incorrect responses of the high and low test-takers for each item. For
example, in the table below there are 5 test takers in the high group (test takers 1 to 5) and 5 test
takers in the low group (test takers 6 to 10). Test taker 1 and 2 in the high group got a correct
response for items 1 to 5. Test taker 3 was wrong in item 5 marked as “0.”

Item 1 Item 2 Item 3 Item 4 Item 5 Total


High Test taker 1 1 1 1 1 1 5
test Test taker 2 1 1 1 1 1 5
takers
Test taker 3 1 1 1 1 0 4
Group
Test taker 4 1 0 1 1 0 4
Test taker 5 1 1 1 0 0 3
Total 5 4 5 4 2
Low Test taker 6 1 1 0 0 0 2
test Test taker 7 0 1 1 0 0 2
takers
Test taker 8 1 1 0 0 0 2
group
Test taker 9 1 0 0 0 0 1
Test taker 10 0 0 0 1 0 1
Total 3 3 1 1 0

4. Get the total correct response for each item and convert it into a proportion. The proportion is
obtained by dividing the total correct response of each item to the total number of test takers
in the group. For example, in item 2, 4 is the total correct response and dividing it by 5 which
is the total test takers in the high group will give a proportion of .8. The procedure is done for
the high and low group.

pH = Total Correct Response pL = Total Correct Response


N per group N per group
Item 1 Item 2 Item 3 Item 4 Item 5 Total
High Test taker 1 1 1 1 1 1 5
test Test taker 2 1 1 1 1 1 5
takers
Test taker 3 1 1 1 1 0 4
Group
Test taker 4 1 0 1 1 0 4
Test taker 5 1 1 1 0 0 3
Total 5 4 5 4 2
Proportion of the 1 .8 1 .8 .4
High Group (pH)
Low Test taker 6 1 1 0 0 0 2
test Test taker 7 0 1 1 0 0 2
takers
Test taker 8 1 1 0 0 0 2
group
Test taker 9 1 0 0 0 0 1
Test taker 10 0 0 0 1 0 1
Total 3 3 1 1 0
Proportion of the .6 .6 .2 .2 0
low group (pL)

5. Obtain the item difficulty by adding the proportion of the high group (pH) and proportion of
the low group (pL) and dividing by 2 for each item.

pH  pL
Item difficulty 
2

Item 1 Item 2 Item 3 Item 4 Item 5


Proportion of the 1 .8 1 .8 .4
High Group (pH)
Proportion of the low .6 .6 .2 .2 0
group (pL)
Item difficulty .8 .7 .6 .55 .2
Interpretation Easy item Easy item Average item Average item Difficult item

The table below is used to interpret the index of difficulty. Given the table below, items 1 and 2
are easy items because they have high correct response proportions for both high and low group.
Items 3 and 4 are average items because the proportions are within the .25 and .75 middle bound.
Item 5 is a difficult item considering that there are low proportions correct for the high and low
group. In the case of item 5, only 40% are able to answer in the high group and none got it
correct in the low group (0). Generally as the index of difficulty reaches a value of “0,” the more
difficult an item is, as it reaches “1,” it becomes easy.

Difficulty Index Remark


.76 or higher Easy Item
.25 to .75 Average Item
.24 or lower Difficult Item

6. Obtain the item discrimination by getting the difference between the proportion of the high
group and proportion of the low group for each item.
Item discrimination=pH – pL

Item 1 Item 2 Item 3 Item 4 Item 5


Proportion of the 1 .8 1 .8 .4
High Group (pH)
Proportion of the low .6 .6 .2 .2 0
group (pL)
Item discrimination .4 .2 .8 .6 .4
Interpretation Very good item Reasonably Very good item Very good item Very good item
good item

The table below is used to interpret the index discrimination. Generally, the larger the difference
between the proportion of the high and low group, the item becomes good because it shows a
large gap in the correct response between the high and low group as shown by items 1, 3, 4, and
5. In the case of item 2, a large proportion of the low group (60%) got the item correct as
contrasted with the high group (80%) resulting with a small difference (20%) making the item
only reasonably good.

Index discrimination Remark


.40 and above Very good item
.30 - .39 Good item
.20 - .29 Reasonably Good item
.10 - .19 Marginal item
Below .10 Poor item

Analyzing Item Distracters

Analyzing item distracters involve determining whether the options in a multiple


response item type are effective. In multiple response types such as a multiple choice, the test
taker will choose from among the options or distracters the correct answer. In creating
distracters, the test developer ensures they belong in the same category where they are close to
the answer. For example:

What cognitive skill is demonstrated in the objective “Students will compose a five paragraph
essay about their reflection on modern day heroes”?

a. Understanding
b. Evaluating
c. Applying
d. Creating

Correct answer: d

The distracters for the given item are all cognitive skills in Bloom’s revised taxonomy where all
can be a possible answer but there is one best answer. In analyzing whether the distracters are
effective, the frequency of examinees selecting each option is reported.
Group Group Options Total no. Difficulty Discrimination
size of correct Index Index
a b c d❒
High 15 1 3 1 10 17 .57 .20
Low 15 1 6 1 7
❒ Correct answer

For the given item with the correct answer of letter d, majority of the examinees in the
high and low group preferred option “d” which is the correct answer. Among the high group,
distracters a, b, and C are not effective distracters because there are very few examinees who
selected them. For the low group, option “b” can be an effective distracter because 40% (6
examinees) of the examinees selected it as their answer as opposed to 47% (7 examinees) of
them got the correct answer. In this case distracters “a” and “c” need some revision by making it
close to the answer to make it more attractive for test takers.

Item Response Theory: Obtaining Item difficulty Using the Rasch Model

It is said that the IRT is an approach to testing based on item analysis considering the
chance of getting particular items right or wrong. In IRT, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong
given the ability of the test takers (Kaplan & Saccuzzo, 1997). This will be realized at the latter
section in the computational procedure.
In using the Rasch model as an approach for determining item difficulty, the calibration
of test item difficulty is independent of the person used for the calibration unlike in the classical
test theory approach where it is dependent on the group. The method of test calibration does not
matter whose responses to these items use for comparison. It gives the same results regardless on
who takes the test. The score a person obtains on the test can be used to remove the influence of
their abilities from the estimation of their difficulty. Thus, the result is a sample free item
calibration.
Rasch’s (1960), the proponent who derived the technique, intended to eliminate
references to populations of examinees in analyses of tests unlike in classical test theory where
norms are used to interpret test scores. According to him that test analysis would only be
worthwhile if it were individual centered with separate parameters for the items and the
examinees (van der Linden & Hambleton, 2004).
The Rasch model is a probabilistic unidimensional model which asserts that: (1) The
easier the question the more likely the student will respond correctly to it, and (2) the more able
the student, the more likely he/she will pass the question compared to a less able student. When
the data fit the Rasch model, the relative difficulties of the questions are independent of the
relative abilities of the students, and vice versa (Rasch, 1977).
As shown in the graph below (Figure 1), a function of ability (θ) which is a latent trait
forms the boundary between the probability areas of answering an item incorrectly and
answering the item correctly.
Figure 1
Item Characteristic Curves of an 18-item Mathematical Problem Solving Test

Easy item

Easy item

Easy item

Easy item

Difficult item

Difficult item

Difficult item

In the item characteristic curve, the score on the item represents ability (θ) and the x-axis is the
range of item difficulties in log functions. It can be noticed that items 1, 7, 14, 2, 8, and 15 do not
require high ability to be answered correctly as compared t items 5, 12, 18, and 11 that require
high ability. The item characteristic curves are judged within 50% of the ability and a cut off of
“0” on item discrimination. The curves within the left side of the “0” item difficulty as marked in
the 50% ability are easy items and the ones on the right side are difficult items. The program
called WINSTEPS was used to produce the curves.
The IRT Rasch model basically identifies the location of a persons’ ability in a set of
items for a given test. The test items has a predefined set of difficulties, the person’s position
should be reflective that his ability should be matched with the difficult of the items. The ability
of the person as symbolized by θ and the items as δ. In the figure below, there are 10 items (δ1 to
δ10), and the location of the person’s ability (θ) is in between δ7 and δ8. In the continuum, the
items are prearranged from the easiest (at the left) to the most difficult (at the right). If the
position of the person’s ability is between δ7 and δ8, then it is expected that the person taking the
test should be able to answer items δ1 to δ6 (“1” correct response, “0” incorrect response), since
this items are answerable given the level of ability of the person. This kind of calibration is said
to fit the Rasch model where the position of the person’s ability is within a defined line of item
difficulties.

Case 1

In Case 2, the person is able to answer four difficult items and unable to respond correctly with
the easy items. There is now difficulty in locating the person in the continuum. If the items are
valid measures of ability, then the easy items should be more answerable than the difficult ones.
This means that the items are not suited for the person’s ability. This case do not fit the Rasch
model.

Case 2

The Rasch model allow to estimate person ability (θ) through their score on the test and
the item’s difficulty (δ) through the item correct separately that’s why it is considered to be test
free and sample free.
In different cases, it can be encountered that the person’s response (θ) to the test is higher
than the specified item difficulty (δ), so their difference (θ–δ) is greater than zero. But when the
ability or response (θ) is less than the specified item difficulty (δ), their difference (θ–δ) is less
than 0 as in Case 2. When the ability of the person (θ) is equivalent to the item’s difficulty (δ),
the difference (θ–δ) is 0 as in Case 1. This variation in person responses and item’s difficulty is
represented in an Item Characteristic Curve (ICC) which show the way the item elicits responses
from persons of every ability (Wright & Stone, 1979).

Figure 1
ICC of a Given Ability and Item Difficulty

An estimate of response x is obtained when a person with ability (θ) is acting on an item with
diffuculty (δ). It can be specified in the model that in the interaction between ability (θ) and item
difficulty (δ) that when ability is greater than the difficulty, the probability of getting the correct
answer is more than .5 or 50%. When the ability is less than the difficulty, the probability of of
getting the correct answer is less than .5 or 50%. The variation of these estimates on the
probability of getting a correct response is illustrated in Figure 1. The mathematical units for θ
and δ are defined in logistic functions (ln) to produce a linear scale and generality of measure.
The next section guides you in estimating the calibration of item difficulty and person
ability measure.

Procedure for the Rasch Model

The Rasch model will be used for the responses of 10 students in a 25 item problem
solving test. In determining the item difficulty in the Rasch model, all participants who took the
test are included unlike the classical test theory where the upper and lower 27% are the only ones
included in the analysis.
ITEM NUMBER
Examinees 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 total
9 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 20
10 1 1 0 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 13
5 1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 1 11
3 0 0 1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 10
8 1 0 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 10
1 1 0 1 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 9
6 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 9
7 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 9
4 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 0 0 0 0 1 0 1 8
2 0 0 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 7
Total 5 3 6 1 3 4 8 5 2 4 3 3 6 5 6 1 4 2 6 5 2 5 5 2 10

Grouped Distribution of Different Item Scores

1. Code each score for each item as “1” for right answer and “0” for wrong answer.
2. Arrange the scores (persons) from highest to lowest
3. Remove items where the all respondents got it correct
4. Remove items where the all respondents got it wrong
5. Rearrange scores (person from highest to lowest).
6. Group the items with similar total item score (si)
7. Indicate the frequency (fi) of items for each group of items 
8. Divide each total item score (s ) with N ( ) = proportion correct  si 
i i  i   

 n 
9. Subtract 1 with the i = proportion incorrect (1 – ρi)
10. Divide the proportion incorrect with the proportion correct and get the natural log of this
 1   i   

using a scientific calculator = logit incorrect (xi)  xi  ln 
 i  
11. Multiply the frequency (fi) with the Logit incorrect (xi)=fixi
12. Square the xi and multiply with each fi=fixi2
13. Compute for the value of x
f x
x  i i
fi
14. To get the initial item calibration (do ) subtract the logit incorrect (x ) with the x
i i
(do = x - x)
i i

15. Estimate the value of U which will be used later in the final estimates.

f ( x )2  [(fi)( x)2 ]
U i i
fi  1
Table 1
Grouped Distribution of the 7 Different Item Scores of 10 Examinees

item
initial
score item item item Proportion proportion logit Frequency X Frequency
item
group name score frequency correct incorrect incorrect logit X logit2
calibration
index

si fi ρi 1-ρi xi fix i fi(x i)2 doi= x i- x


1 7 8 1 0.8 0.2 -1.39 -1.39 1.92 -1.87

3, 13,
2 15, 19 6 4 0.6 0.4 -0.41 -1.62 0.66 -0.89
1, 8,
14,
20,
3 22, 23 5 6 0.5 0.5 0.00 0.00 0.00 -0.48
6, 10,
4 17 4 3 0.4 0.6 0.41 1.22 0.49 -0.07

2, 5,
5 11, 12 3 4 0.3 0.7 0.85 3.39 2.87 0.37

9, 18,
6 21, 24 2 4 0.2 0.8 1.39 5.55 7.69 0.91
7 4, 16 1 2 0.1 0.9 2.20 4.39 9.66 1.72

Σfi(xi)2
Σfi=24 Σ fix i =11.54 =23.29

fi xi
x  x 
11.54
x• = 0.48
f i 24

f ( x )2  [(fi)( x)2 ] 23.29  [(24)(0.48)2 ]


U i i U
24  1
U = 0.77
fi  1

Grouped Distribution of Observed Person Scores

16. Count the number of possible scores (r) for each of the person’s total score (L).
17. Count the number of persons for each possible score=person frequency (nr)
 r
18. Divide each possible score with the total score = proportion correct  r 
 L 
19. Obtain the proportion incorrect by subtracting the proportion correct with 1 (1-ρr)
20. Determine the logit correct (yr) of the quotient between proportion correct (ρr)and
  r   

proportion incorrect (1-ρr)  y r  ln 


1  r  
21. Multiply the logit correct (yr)with each person frequency (nr) (nryr)
22. Square the values of the logit correct (y 2) and multiply with the person frequency (n )
r r
23. The logit correct (yr)is the initial person measure (bro=yr)
24. Compute for the value of y• and V to be used later in the final estimates.
Table 2
Grouped Distribution of Observed Examinee Scores on the 24 Item Mathematical Problem
Solving Test

Initial
Possible Person Proportion Logit Frequency X Person
score frequency correct correct Frequency X Logit Logit2 Measure
nr ρr yr nryr nr(yr)2 bro=yr
7 1 0.29 -0.89 -0.89 0.79 -0.89
8 1 0.33 -0.69 -0.69 0.48 -0.69
9 3 0.38 -0.51 -1.53 0.78 -0.51
10 2 0.42 -0.34 -0.67 0.23 -0.34
11 1 0.46 -0.17 -0.17 0.03 -0.17
12 0 0.50 0.00 0.00 0.00 0.00
13 1 0.54 0.17 0.17 0.03 0.17
14 0 0.58 0.34 0.00 0.00 0.34
15 0 0.63 0.51 0.00 0.00 0.51
16 0 0.67 0.69 0.00 0.00 0.69
17 0 0.71 0.89 0.00 0.00 0.89
18 0 0.75 1.10 0.00 0.00 1.10
19 0 0.79 1.34 0.00 0.00 1.34
20 1 0.83 1.61 1.61 2.59 1.61
Σnr=10 Σnryr =-2.18 Σnr(yr)2= 4.92

nr yr  2.18
y  y  y•= -0.22
nr 10

n ( y )2  n ( y)2 4.92  [10(0.22)2 ]


V r r r
V  V = 0.49
nr  1 10  1

Final Estimates of Item difficulty

25. Compute for the expansion factor (Y)


V 0.49
1  1
Y  2.89 Y  2.89 Y = 1.11
UV (0.77)(0.49)
1 1
8.35 8.35

V = 0.49 (from Table 2)


U = 0.77 (from Table 1)
26. Multiply the expansion factor (Y) with the initial item calibration (dio). The item score
group index, item name, and initial item calibration is taken from Table 1.
27. Compute the Standard Error (SE) for each item scores

N
SE(di )  Y
Si ( N  Si )

Table 3
Final Estimates of Item Difficulties from 10 Examinees

Sample
Spread
Item Score Initial Item Expansion Corrected Item Calibration
Group Index Item Name Calibration Factor Calibration Item Score Standard Error
i doi Y di=Ydoi si SE (di)
1 7 -1.87 1.11 -2.07 8 0.878

2 3, 13, 15, 19 -0.89 1.11 -0.98 6 0.717


1, 8, 14, 20,
3 22, 23 -0.48 1.11 -0.53 5 0.702
4 6, 10, 17 -0.07 1.11 -0.08 4 0.717

5 2, 5, 11, 12 0.37 1.11 0.41 3 0.766

6 9, 18, 21, 24 0.91 1.11 1.01 2 0.878


7 4, 16 1.72 1.11 1.91 1 1.170
N=10

Final Estimates of Person measures

28. Compute for the value of X

U 0.77
1  1
X  2.89 X  2.89 X = 1.18
UV (0.77)(0.49)
1 1
8.35 8.35

V = 0.49 (from Table 2)


U = 0.77 (from Table 1)

29. Multiply the expansion factor (X) with each of the initial measure (bro) = corrected
measure to obtain the corrected measure (br). The possible score and initial measure is
taken from Table 2.
30. Compute for the Standardized Error (SE).

L
SE  X
r(L  r)

Test width corrected Measure


Possible score Initial measure expansion factor measure standard error nr
r bro X br=Xbro
7 -0.89 1.18 -1.05 0.53 1
8 -0.69 1.18 -0.82 0.51 1
9 -0.51 1.18 -0.60 0.50 3
10 -0.34 1.18 -0.40 0.49 2
11 -0.17 1.18 -0.20 0.48 1
12 0.00 1.18 0.00 0.48 0
13 0.17 1.18 0.20 0.48 1
14 0.34 1.18 0.40 0.49 0
15 0.51 1.18 0.60 0.50 0
16 0.69 1.18 0.82 0.51 0
17 0.89 1.18 1.05 0.53 0
18 1.10 1.18 1.30 0.56 0
19 1.34 1.18 1.57 0.59 0
20 1.61 1.18 1.90 0.65 1
Figure 3
Item Map for the Calibrated Item Difficulty and Person Ability
Score
items logit persons
-2.07 0
7
-1.05 1 (Case2)

-0.98 0
8
Items that do not 0 -0.82 1 (Case 4)
9
Require high ability -0.60 3 (Case 1) Case 6 Case 7
(δ<θ) Item 1
Z=.4 -0.53 0
10
0 -0.40 2 (Case 3) Case 8
11
-0.20 1 (Case 5)

-0.08 0
δ=θ 0 0.00 0
13
0 0.20 1 (Case 10)
0.40 0

0.41 0
0 0.60 0
Items that 0.82 0
require high
ability (δ >θ) 1.01 0
0 1.05 0
0 1.30 0
0 1.57 0
20
1.90 1 (Case 9)

1.91 0

Figure 3 shows the item map of calibrated item difficulty (left side) and person ability (right
side) across their logit values. Observe that as the items become more difficult (increasing
logits) the person with the highest score (high ability) is matched close with the item. This
match is termed as a goodness of fit in the Rasch model. A good fit is indicates that difficult
items require high ability to be answered correctly. More specifically the match in the logits of
person ability and item difficulty indicates a goodness of fit. In this case the goodness of fit of
the item difficulties are estimated using the z value. Lower z and non significant z values
indicates a goodness of fit of the item difficulty and person ability.
SPECIAL TOPIC

A Review of Psychometric Theory


Carlo Magno

This special topic presents the nature of psychometrics including the issues on psychological
measurement, its relevant theories and its current practice. The basic scaling models are discussed since it
is a process enabling quantification of psychological constructs. The issues and research trends in classical
test theory and item response theory with its different models and their implication on test construction are
explained

The Nature of Psychometrics and Issues in Measurement

Psychometrics concerns itself with the science of measuring psychological constructs such as
ability, personality, affect and skills. Psychological measurement methods are crucially important for basic
research in psychology. Research in psychology involves the measurement of variables in order to conduct
further analysis. In the past, obtaining adequate measurement on psychological constructs is considered an
issue in the science of psychology. Some references indicate that there are psychological constructs that
are deemed to be unobservable and is difficult to quantify. This issue is carried over by the fact that
psychological theories are filled with variables that either cannot be measured at all at the present time or
can be measured only approximately (Kaplan & Saccuzzo, 1997) such as anxiety, creativity, dogmatism
achievement, motivation, attention and frustration. Moreover according to Emmanuel Kant that “it is
impossible to have a science of psychology because the basic data could not be observed and measured.”
Although the field of psychological measurement have been advanced and practitioners in the field of
psychometrics were able to properly deal with issues and devise methods on the basic premise of scientific
observation and measurement. Since most psychological constructs involves subjective experiences such
as feelings sensations and desires – and when individuals makes a judgment, state their preferences and
even talk about these experiences, then it is possible for measurement to take place and thus it meets the
requirements of scientific inquiry. It is very much possible to assign numbers to psychological constructs as
to represent quantities of attributes and even formulate rules of standardizing the measurement process.
In the process of standardizing psychological measurement, it requires a process of abstraction
where psychological attributes are observed in relation to other constructs such as attitude and
achievement (Magno, 2003). This process allows to establish the association among variables such as
construct validation and criterion-predictive processes. Also, emphasizing measurement of psychological
constructs forces researchers and test developers to consider carefully the nature of the construct before
attempting to measure it. This involves a thorough literature review on the conceptual definition of an
attribute before constructing valid items of a test. It is also a common practice in psychometrics where
numerical scores are used to communicate the amount of an attribute of an individual. Quantification is so
much intertwined with the concept of measurement. In the process of quantification, mathematical systems
and statistical procedures are used enabling to examine the internal relationship among data obtained
through a measure. Such procedures enable psychometrics to build theories considering itself part of the
system of science.
Branches of Psychometric Theory

There are two branches of psychometric theory: The classical test theory and the items response
theory. Both theories enable to predict outcomes of psychological tests by identifying parameters of item
difficulty and the ability of test takers. Both are concerned to improve the reliability of psychological tests.

Classical Test Theory

Classical test theory in references is regarded as the “true score theory.” The theory starts from the
assumption that systematic effects between responses of examinees are due only to variation in ability of
interest. All other potential sources of variation existing in the testing materials such as external conditions
or internal conditions of examinees are assumed either to be constant through rigorous standardization or
to have an effect that is nonsystematic or random by nature (van der Linden & Hambleton, 2004). The
central model of the classical test theory is that observed test scores (TO) are composed of a true score (T)
and an error score (E) where the true and the error scores are independent. The variables are established
by Spearman (1904) and Novick (1966) and best illustrated in the formula:

TO = T + E

The classical theory assumes that each individual has a true score that would be obtained if there
were no errors in measurement. However, because measuring instruments are imperfect, the score
observed for each person may differ from an individual’s true ability. The difference between the true score
and the observed test score results from measurement error. Using a variety of justifications, Error is often
assumed to be a random variable having a normal distribution. The implication of the classical test theory
for test takers is that tests are fallible imprecise tools. The score achieved by an individual is rarely the
individual’s true score. This means that the true score for an individual will not change with repeated
applications of the same test. This observed score is almost always the true score influenced by some
degree of error. This error influences the observed to be higher or lower. Theoretically, the standard
deviation of the distribution of random errors for each individual tells about the magnitude of measurement
error. It is usually assumed that the distribution of random errors will be the same for all individuals.
Classical test theory uses the standard deviation of errors as the basic measure of error. Usually this is
called the standard error of measurement. In practice, the standard deviation of the observed score and the
reliability of the test are used to estimate the standard error of measurement (Kaplan & Saccuzzo, 1997).
The larger the standard error of measurement, the less certain is the accuracy with which an attribute is
measured. Conversely, small standard error of measurement tells that an individual score is probably close
to the true score. The standard error of measurement is calculated with the formula:
Sm  S 1  r

Standard errors of measurement are used to create confidence intervals around specific observed scores
(Kaplan & Saccuzzo, 1997). The lower and upper bound of the confidence interval approximates the value
of the true score.
Traditionally, methods of analysis based on classical test theory have been used to evaluate such
tests. The focus of the analysis is on the total test score; frequency of correct responses (to indicate
question difficulty); frequency of responses (to examine distractors); reliability of the test and item-total
correlation (to evaluate discrimination at the item level) (Impara & Plake, 1997). Although these statistics
have been widely used, one limitation is that they relate to the sample under scrutiny and thus all the
statistics that describe items and questions are sample dependent (Hambelton, 2000). This critique may
not be particularly relevant where successive samples are reasonably representative and do not vary
across time, but this will need to be confirmed and complex strategies have been proposed to overcome
this limitation.

Item Response Theory

Another branch of psychometric theory is the item response theory (IRT). IRT may be regarded as
roughly synonymous with latent trait theory. It is sometimes referred to as the strong true score theory or
modern mental test theory because IRT is a more recent body of theory and makes stronger assumptions
as compared to classical test theory. This approach to testing based on item analysis consider the chance
of getting particular items right or wrong. In this approach, each item on a test has its own item
characteristic curve that describes the probability of getting each particular item right or wrong given the
ability of the test takers (Kaplan & Saccuzzo, 1997). The Rasch model is appropriate for modeling
dichotomous responses and models the probability of an individual's correct response on a dichotomous
item. The logistic item characteristic curve, a function of ability, forms the boundary between the probability
areas of answering an item incorrectly and answering the item correctly. This one-parameter logistic model
assumes that the discriminations of all items are assumed to be equal to one (Maier, 2001).
Another fundamental feature of this theory is that item performance is related to the estimated
amount of respondent’s latent trait (Anastasi & Urbina, 2002). A latent trait
is symbolized as theta () which refers to a statistical construct. In cognitive tests, latent traits are called
the ability measured by the test. The total score on a test is taken as an estimate of that ability. A person’s
specified ability () succeeds on an item of specified difficulty.
There are various approaches to the construction of tests using item response theory. Some
approaches use the two-dimensions: Item discriminations and item difficulties are plotted. Other
approaches use a three-dimension for the probability of test takers with very low levels of ability getting a
correct response (as demonstrated in figure 2). Other approaches use only the difficulty parameter (one
dimension) such as the Rasch Model. All these approaches characterize the item in relation to the
probability that those who do well or poorly on the exam will have different levels of performance.

Two – Parameter Model/Normal – Ogive Model. The ogive model postulates a normal
cumulative distribution function as a response function for an item. The model demonstrates that an item
difficulty is a point on an ability scale where an examinee has a probability of success on the item of .50 (van
der Linden & Hambleton, 2004). In the model, the difficulty of each item can be defined by 50%
threshold which is customary in establishing sensory thresholds in psychophysics. The discriminative
power of each item represented by a curve in the graph is indicated by its steepness. The steeper the
curve, the higher the correlation of item performance with total score and the higher the discriminative
index.
The original idea of the model was traced back from Thurstone’s use of the normal model in his
discriminal dispersion theory of stimulus perception (Thurstone, 1927). Researchers in psychophysics
study the relation between psychophysical properties from a stimuli and their perception from human
subjects. Stimuli scaling will be presented in detail further in this paper. In the process a stimulus is
presented to the subject and he/she will report the detection of the stimulus. The detection increases as the
stimulus intensity also increases. With this pattern, the cumulative distribution with parametrization was
used a s a function.
Three – Parameter Model/Logistic Model. In plotting an ability () with the probability of
correct response Pi () in a three parameter model, the slope of the curve itself indicates the item
discrimination. The higher the value of the item discrimination, the steeper the slope. In the model,
Birnbaum (1950) proposed a third parameter to account for the nonzero performance of low ability
examinees on multiple choice items. The nonzero performance is due to the probability of guessing correct
answers to multiple choice items (van der Linden & Hambleton, 2004).

Figure 2. Hypothetical Item Characteristic Curves for Three Items.

The item difficulty parameter (b1, b2, b3) corresponds to the location on the ability axis at which the
probability of a correct response is .50. It is shown in the curve that item 1 is easier and item 2 and 3 have
the same difficulty at .50 probability of correct response. Estimates of item parameters and ability are
typically computed through successive approximations procedures where approximations are repeated until
the values stabilize.

One – Parameter Model/Rasch Model. The Rasch model is based on the assumption that
both guessing and item differences in discrimination are negligible. In constructing tests, the proponents
of the Rasch model frequently discard those items that do not meet these assumptions (Anastasi & Urbina,
2002). Rasch began his work in educational and psychological measurement in the late 1940’s. Early in the
1950’s he developed his Poisson models for reading tests and a model for intelligence and
achievement tests which was later called the “structure models for items in a test” which is called today as
the Rasch model.
Rasch’s (1960) main motivation for his model was to eliminate references to populations of
examinees in analyses of tests. According to him that test analysis would only be worthwhile if it were
individual centered with separate parameters for the items and the examinees (van der Linden &
Hambleton, 2004). His worked marked IRT with its probabilistic modeling of the interaction between an
individual item and an individual examinee. The Rasch model is a probabilistic unidimensional model which
asserts that (1) the easier the question the more likely the student will respond correctly to it, and (2) the
more able the student, the more likely he/she will pass the question compared to a less able student .
The Rasch model was derived from the initial Poisson model illustrated in the formula:

 

where  is a function of parameters describing the ability of examinee and difficulty of the test, 
represents the ability of the examinee and  represents the difficulty of the test which is estimated by the
summation of errors in a test. Furthermore, the model was enhanced to assume that the probability that a
student will correctly answer a question is a logistic function of the difference between the student's ability
[#] and the difficulty of the question [#] (i.e. the ability required to answer the question correctly), and only a
function of that difference giving way to the Rasch model

From this, the expected pattern of responses to questions can be determined given the estimated #
and #. Even though each response to each question must depend upon the students' ability and the
questions' difficulty, in the data analysis, it is possible to condition out or eliminate the student's abilities (by
taking all students at the same score level) in order to estimate the relative question difficulties (Andrich,
2004; Dobby & Duckworth, 1979). Thus, when data fit the model, the relative difficulties of the questions
are independent of the relative abilities of the students, and vice versa (Rasch, 1977). The further
consequence of this invariance is that it justifies the use of the total score (Wright & Panchapakesan,
1969). In the current analysis this estimation is done through a pair-wise conditional maximum likelihood
algorithm.
The Rasch model is appropriate for modeling dichotomous responses and models the probability of
an individual's correct response on a dichotomous item. The logistic item characteristic curve, a function of
ability, forms the boundary between the probability areas of answering an item incorrectly and answering
the item correctly. This one-parameter logistic model assumes that the discriminations of all items are
assumed to be equal to one (Maier, 2001).
According to Fischer (1974) the Rasch model can be derived from the following assumptions:
(1) Unidimensionality. All items are functionally dependent upon only one underlying continuum.
(2) Monotonicity. All item characteristic functions are strictly monotonic in the latent trait, u. The
item characteristic function describes the probability of a predefined response as a function of the latent
trait.
(3) Local stochastic independence. Every person has a certain probability of giving a predefined
response to each item and this probability is independent of the answers given to the preceding items.
(4) Sufficiency of a simple sum statistic. The number of predefined responses is a sufficient statistic
for the latent parameter.
(5) Dichotomy of the items. For each item there are only two different responses, for example
positive and negative. The Rasch model requires that an additive structure underlies the observed data.
This additive structure applies to the logit of P ij, where Pij is the probability that subject i will give a
predefined response to item j, being the sum of a subject scale value ui and an item scale value vj, i.e. In
(Pij/1 - Pij) = ui + vj
There are various applications of the Rasch Model in test construction through item-mapping
method (Wang, 2003) and as a hierarchical measurement method (Maier, 2001).
Rasch Standard-setting Through Item-mapping. According to Wang (2003) that it is logical
to justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure
and certification examinations. In the study the researcher wanted to determine a score that decides a
passing level of competency using the Angoff as a standard setting method in the Rasch Model. The Angoff
(1971) procedure with various modifications is the most widely used for multiple-choice licensure and
certification examinations (Plake, 1998). As part of the Angoff standard-setting process, judges are asked
to estimate the proportion (or percentage) of minimally competent candidates (MCC) who will answer an
item correctly. These item performance estimates are aggregated across items and averaged across
judges to yield the recommended cut score. As noted (Chang, 1999; Impara & Plake, 1997; Kane, 1994),
the adequacy of a judgmental standard-setting method depends on whether the judges adequately
conceptualize the minimal competency of candidates, and whether judges accurately estimate item
difficulty based on their conceptualized minimal competency. A major criticism of the Angoff method is
that judges' estimates of item difficulties for minimal competency are more likely to be inaccurate, and
sometimes inconsistent and contradictory (Bejar, 1983; Goodwin, 1999; Mills & Melican, 1988; National
Academy of Education [NAE], 1993; Reid, 1991; Shepard, 1995). Studies found that judges are able to
rank order items accurately in terms of item difficulty, but they are not particularly accurate in
estimating item performance for target examinee groups (Impara & Plake, 1998; National Research
Council, 1999; Shepard, 1995). A fundamental flaw of the Angoff method is that it requires judges to
perform the nearly impossible cognitive task of estimating the probability of MCCs answering each
item in the pool correctly (Berk, 1996; NAE).
An item-mapping method, which applies the Rasch IRT model to the standard setting process, has
been used to remedy the cognitive deficiency in the Angoff method for multiple-choice licensure and
certification examinations (McKinley, Newman, & Wiser, 1996). The Angoff method limits judges to each
individual item while they make an immediate judgment of item performance for MCCs. In contrast, the
item-mapping method presents a global picture of all items and their estimated difficulties in the form of a
histogram chart (item map), which serves to guide and simplify the judges' process of decision making
during the cut score study. The item difficulties are estimated through application of the Rasch IRT model.
Like all IRT scaling methods, the Rasch estimation procedures can place item difficulty and candidate
ability on the same scale. An additional advantage of the Rasch measurement scale is that the difference
between a candidate's ability and an item's difficulty determines the probability of a correct response
(Grosse & Wright, 1986). When candidate ability equals item difficulty, the probability of a correct answer to
the item is .50. Unlike the Angoff method, which requires judges to estimate the probability of an MCC's
success on an item, the item-mapping method provides the probability (i.e., .50) and asks judges to
determine whether an MCC has this probability of answering an item correctly. By utilizing the Rasch
model's distinct relationship between candidate ability and item difficulty, the item-mapping method enables
judges to determine the passing score at the point where the item difficulty equals the MCC's ability level.
The item-mapping method incorporates item performance in the standard-setting process by
graphically presenting item difficulties. In item mapping, all the items for a given examination are ordered in
columns, with each column in the graph representing a different item difficulty. The columns of items are
ordered from easy to hard on a histogram-type graph, with very easy items toward the left end of the graph,
and very hard items toward the right end of the graph. Item difficulties in log odds units are estimated
through application of the Rasch IRT model (Wright & Stone, 1979). In order to present items on a metric
familiar to the judges, logit difficulties are converted to scaled values using the following formula: scaled
difficulty = (logit difficulty × 10) + 100. This scale usually ranges from 70 to 130.
Figure 3. Example of Item Map.

In the example, the abscissa of the graph represents the rescaled item difficulty. Any one column has items
within two points of each other. For example, the column labeled "80" has items with scaled difficulties
ranging from 79 to values less than 81. Using the scaling equation, this column of items would have a
range of logit difficulties from -2.1 to values less than -1.9, yielding a 0.2 logit difficulty range for items in
this column. Similarly, the next column on its right has items with scaled difficulties ranging from 81 to
values less than 83 and a range of logit difficulties from -1.9 to values less than -1.7. In fact, there is a 2-
point range (1 point below the labeled value and 1 point above the labeled value) for all the columns on the
item map. Within each column, items are displayed in order by item ID numbers and can be identified by
color and symbol-coded test content areas. By marking item content areas of the items on the map, a
representative sample of items within each content area can be rated in the standard-setting process. The
goal of item mapping is to locate a column of items on the histogram where judges can reach consensus
that the MCC has a .50 chance answering the items correctly.

Rasch Hierarchical Measurement Method. In a study by Maier (2001) a hierarchical


measurement model is developed that enables researchers to measure a latent trait variable and model
the error variance corresponding to multiple levels. The Rasch hierarchical measurement model (HMM)
results when a Rasch IRT model and a one-way ANOVA with random effects are combined. Item response
theory models and hierarchical linear models can be combined to model the effect of multilevel
covariates on a latent trait.
Through the combination, researchers may wish to examine relationships between person-ability estimates
and person-level and contextual-level characteristics that may affect these ability estimates. Alternatively, it
is also possible to model data obtained from the same individuals across repeated questionnaire
administrations. It is also made possible to study the effect of person characteristics on ability estimates
over time.

Advantages of the IRT

The benefits of the item response theory is that its treatment of reliability and error of measurement
through item information function are computed for each item (Lord, 1980). These functions provide a
sound basis for choosing items in test construction. The item information function takes all items
parameters into account and shows the measurement efficiency of the item at different ability levels.
Another advantage of the item response theory is the invariance of item parameters which pertains to the
sample-free nature of its results. In the theory the item parameters are invariant when computed in groups
of different abilities. This means that a uniform scale of measurement can be provided for use in different
groups. It also means that groups as well as individuals can be tested with a different set of items,
appropriate to their ability levels and their scores will be directly comparable (Anastasi & Urbina, 2002).

Scaling Models

Measurement essentially is concerned with the methods used to provide quantitative descriptions
of the extent to which individuals manifest or possess specified characteristics” (Ghiselli, Campbell, &
Zedeck, 1981, p. 2). “Measurement is the assigning of numbers to individuals in a systematic way as a
means of representing properties of the individuals” (Allen & Yen, 1979, p. 2). “‘Measurement’ consists of
rules for assigning symbols to objects so as to (1) represent quantities of attributes numerically (scaling) or
(2) define whether the objects fall in the same or different categories with respect to a given attribute
(classification)” (Nunnally &Bernstein, 1994, p. 3).
There are important aspects to consider in the process of measurement in psychometrics. First, it
is needed to quantify an attribute of interest. That is, there are numbers to designate how much (or little) of
an attribute an individual possesses. Second, attribute of interest must be quantified in a consistent and
systematic way (i.e., standardization). That is, when the measurement process is replicated, it is systematic
enough that meaningful replication is possible. Finally, attributes of individuals (or objects) are measured
not the individuals per se.

Levels of Measurement

As the definition of Nunnally and Bernstein (1994) suggests, by systematically measuring the
attribute of interest individuals can either be classified or scaled with regard to the attribute of interest.
Engaging in classification or scaling depends in large part on the level of measurement used to assess a
construct. For example, if the attribute is measured on a nominal scale of measurement, then it is only
possible to classify individuals as falling into one or another mutually exclusive category (Agresti & Finlay,
2000). This is because the different categories (e.g., men versus women) represent only qualitative
differences. Nominal scales are used as measures of identity (Downie & Heath, 1984). When gender are
coded such as males coded 0, females 1 that does not mean that these values have any quantitative
meaning. They are simply labels for gender categories. At the nominal level of measurement, there are a
variety of sorting techniques. In this case, subjects are asked to sort the stimuli into different categories
based on some dimension.
There are some data that reflect rank order of individuals or objects such as a scale evaluating the
beauty of a person from highest to lowest (Downie & Heath, 1984). This would represent an ordinal scale of
measurement where objects are simply rank ordered. It does not provide how much hotter one object is
than another, but it can be determined that that A is hotter than B, if A is ranked higher than B. At the
ordinal level of measurement, the Q-sort method, paired comparisons, Guttman’s Scalogram, Coomb’s
unfolding technique, and a variety of rating scales can be used. The major task of subject is to rank order
items from highest to lowest or from weakest to strongest.
The interval scale of measurement have equal intervals between degrees on the scale. However,
the zero point on the scale is arbitrary; 0 degrees Celsius represents the point at which water freezes at
sea level. That is, zero on the scale does not represent “true zero,” which in this case would mean a
complete absence of heat. In determining the area of a table a ratio scale of measurement is used because
zero does represent “true zero”.
When the construct of interest is measured at the nominal (i.e., qualitative) level of measurement,
objects are only classified into categories. As a result, the types of data manipulations and statistical
analyses that can be perform on the data is very limited. In cases of descriptive statistics, it is possible to
compute frequency counts or determine the modal response (i.e., category), but not much else. However, if
it were at least possible to rank order the objects based on the degree to which the construct of interest
possess, then it is possible to scale the construct. In addition, higher levels of measurement allow for more
in-depth statistical analyses. With ordinal data, for example, statistics such as the median, range, and
interquartile range can be computed (Downie & Heath, 1984). When the data is interval-level, it is possible
to calculate statistics such as means, standard deviations, variances, and the various statistics of shape
(e.g., skewness and kurtosis). With interval-level data, it is important to know the shape of the distribution,
as different-shaped distributions imply different interpretations for statistics such as the mean and standard
deviation.
At the interval and ratio level of measurement, there is direct estimation, the method of bisection,
and Thurstone’s methods of comparative and categorical judgments. With these methods, subjects are
asked not only to rank order items but also to actually help determine the magnitude of the differences
among items. With Thurstone’s method of comparative judgment, subjects compare every possible pair of
stimuli and select the item within the pair that is the better item for assessing the construct. Thurstone’s
method of categorical judgment, while less tedious for subjects when there are many stimuli to assess in
that they simply rate each stimulus (not each pair of stimuli), does require more cognitive energy for each
rating provided. This is because the SME must now estimate the actual value of the stimulus.

Unidimensional Scaling Models

Psychological measurement is typically most interested in scaling some characteristic, trait, or


ability of a person. It determines how much of an attribute of interest a given person possesses. This will
allow to estimate the degree of inter-individual and intra-individual differences among the subjects on the
attribute of interest. There are various ways of scaling such as scaling the stimuli given to individuals, as
well as the responses that individuals provide.

Scaling for a Stimuli (Psychophysics)

Scaling of stimuli is more prominent in the area of psychophysics or sensory/perception psychology


that focuses on physical phenomena and whose roots date back to mid–19th century Germany. It was not
until the 1920s that Thurstone began to apply the same scaling principles to scaling psychological attitudes.
In the process of psychophysical scaling one factor is held constant (e.g., responses), collapse across a
second (e.g., stimuli), and then scale the third (e.g., individuals) factor. With psychological scaling,
however, it is typical to ask participants to provide their professional judgment of the particular stimuli,
regardless of their personal feelings or attitudes toward the topic or stimulus. This may include ratings of
how well different stimuli represent the construct and at what level of intensity the construct is represented.
In scaling for stimuli, research issue frequently concern the exact nature of functional relations
between scaling of the stimuli in different circumstances (Nunnaly, 1970).
There are variety of ways on scaling for stimuli through psychophysical method. Psychophysical
methods examine the relationship between the placement of objects on the two scales and attempts to
establish principles or laws that connect the two (Roberts, 1999). The following psychophysical method
includes rank order, constant stimuli and successive categories.
(1) Method of Adjustment - An experimental paradigm which allows the subject to make small
adjustments to a comparison stimulus until it matches a standard stimulus. The intensity of the stimulus is
adjusted until target is just detectable.
(2) Method of Limits – Adjust intensity in discreet steps until observer reports that stimulus is just
detectable.
(3) Method of Constant Stimuli – Experimenter has control of stimuli. Several chosen stimulus
values are chosen to bracket the assumed threshold. Stimulus is presented many times in random order.
Psychometric function is derived from proportion of detectable responses.
(4) Staircase Method – To determine a threshold as quickly as possible. Compromise between the
method of limits and method of constant stimuli.
(5) Method of Forced Choice (2AFC) – Observer must choose between two or more options. Good
for cases where observers are less willing to guess.
(6) Method of Average Error – The subject is presented with a standard stimuli. The subject then
undergoes trials to target the stimulus presented.
(7) Rank order – requires the subject to rank stimuli from most to least with respect to some
attribute of judgment or sentiment.
(8) Paired comparison – a subject is required to rank a stimuli two at a time in all possible pairs.
(9) Successive categories – the subject is asked to sort a collection of stimuli into a number of
distinct piles or categories, which are ordered with respect to a specified attribute.
(10) Ratio judgment – The experimenter selects a standard stimulus and a number of variable
stimuli that differ quantitatively from the standard stimulus on a given characteristic. The subjects selects
from the range of variable stimuli, the stimulus whose amount of the given characteristic corresponds to the
ratio value.
(11) Q sort – subjects are required to sort the stimuli into an approximate normal distribution, with
its being specified how many stimuli are to be placed in each category.

Scaling for People (Psychometrics)

Many issues arise when performing a scaling study. One important factor is who is selected to
participate in the study. Many stimuli or scaling involve some psychological (latent) dimension of people
without any connection to a direct counterpart "physical" dimension. When people (psychometrics) are
scaled, it is typical to obtain a random sample of individuals from the population to generalize. With
psychometrics participants are asked to provide their individual feelings, attitudes, and/or personal ratings
toward a particular topic. In doing so, one is able to determine how individuals differ on the construct of
interest. With stimulus scaling, however, the researcher would sum across raters within a given stimulus
(e.g., question) in order to obtain rating(s) of each stimulus. Once the researcher is confident that each
stimulus did, in fact, tap into the construct and had some estimate of the level at which it did so, only then
should the researcher feel confident in presenting the now scaled stimuli to a random sample of relevant
participants for psychometric purposes. Thus, with psychometrics, items (i.e., stimuli) are summed across
within an individual respondent in order to obtain his or her score on the construct.
The major requirement in scaling for people is that variables should be monotonically related to
each other. A relationship is monotonic if higher scores in one scale correspond to higher scores on
another scale, regardless of the shape of the curve (Nunnally, 1970). In scaling for people many items on a
t4st is used to minimize measurement error. The specificity of items can be averaged when they are
combined. By combining items, one can make relatively fine distinctions between people. The problem of
scaling people with respect to attributes is then one of collapsing responses to a number of items as to
obtain one score for each person.
One variety of scaling for people is the deterministic model and it assumes that there is no error in
item trace lines. Trace lines shows that a high level of ability would have a probability close to 1.0 of
correctly obtaining a response. The model assumes that up to a point on the attribute, the probability of
response alpha is zero and beyond that point the probability of response alpha is 1.0. Each item has a
biserial correlation of 1.0 with the attribute , and consequently each item perfectly discriminates at a
particular point of the attribute.
There are varieties of scaling models for people that includes Thurstone, Lickert scale, Guttman
scale, and Semantic differential scaling.
(1) Thurstone scaling – There are 300 or so judges to rate 100 statements on a particular issue on
an 11 point scale. A subset of statements are then shown to respondents and their score is the mean of the
ratings for the statement they select.
(2) Lickert scale - Respondents are request to state their level of agreement with a series of
attitude statements. Each scale point is given a value (say, 1- 5) and the person is given the core
corresponding to their degree of agreement. Often a set of Likert items are summed to provide a total score
for the attitude.
(3) Guttman scale - It involves producing a set of statements that form a natural hierarchy. Positive
answers to the item at one point on the hierarchy assume positive answers to all the statements below (e.
g. disability scale). Gets over problem of item totals being formed by different sets of responses.

Scaling Responses

The third category of responses, which is said to typically hold constant, also needs to be
identified. That is, a decision is arrived in what fashion will the subjects respond to a stimuli. Such response
options may include requiring participants to make comparative judgments (e.g.,which is more important, A
or B?), subjective evaluations (e.g., strongly agree to strongly disagree), or an absolute judgment (e.g., how
hot is this object?). Different response formats may well influence how to write and edit stimuli. In addition,
they may also influence how one evaluates the quality or the “accuracy” of the response. For example,
with absolute judgments, standard of comparisons are used, especially if subjects are being asked to rate
physical characteristics such as weight, height, or intensity of sound or light. With attitudes and
psychological constructs, such “standards” are hard to come by. There are a few options (e.g., Guttman’s
Scalogram and Coomb’s unfolding technique) for simultaneously scaling people and stimuli, but more often
than not only one dimension is scaled at a time. However, a stimuli is scaled first (or seek a well -
established measure) before having confidence in scaling individuals on the stimuli.

Multidimensional Scaling Models

With unidimensional scaling, as described previously, subjects are asked to respond to stimuli with regard to a particular
dimension. With multidimensional scaling (MDS), how-ever, subjects are typically asked to give just their general
impression or broad rating of similarities or differences among stimuli. Subsequent analyses, using Euclidean spatial
models, would “map” the products in multidimensional space. The different multiple dimensions would then be
“discovered” or “extracted” with multivariate statistical techniques, thus establishing which dimensions the consumer
is using to distinguish the products. MDS can be particularly useful when subjects are unable to articulate “why” they like a
stimulus, yet they are confident that they prefer one stimulus to another
67

You might also like