Chapter 3 Designing and Developing Assessments

Chapter 3
Designing and Developing Assessments
At the end of the unit, the pre-service teacher (PST) can:
a. develop assessment tools that are learner-appropriate and target-matched; and

b. improve assessment tools based on assessment data.
Introduction
An assessment tool is a technique or method of evaluating information to
determine how much a person knows and whether this knowledge aligns with the bigger
picture of a theory or framework. The assessment tool comprises the assessment
instrument and the context and conditions of assessment. An assessment tool can also
contain the administration, recording and reporting requirements of the assessment.
An assessment instrument is part of the assessment tool. The assessment

instrument is the documented activities developed to support the assessment method
and used to collect the evidence of student competence. An assessment instrument
could include: oral and written questions, observation/demonstration checklists,
projects, case studies, scenarios, and recognition or workplace portfolio
Assessment methods differ based on context and purpose. For example,

personality tests use your responses to discover character traits, and financial
assessments measure how much you know about concepts like saving and
investments.
Regardless of the context, all assessment tools depend on a set of well-thought-

out questions to gather insights and arrive at informed conclusions by interpreting these
insights.
A. Characteristics of Quality Assessment Tools

1. Reliability
Reliability refers to the extent to which an assessment instrument yields consistent

results about the knowledge, skills, or characteristics being assessed. The right
assessment tool produces the same results over time. So there's a consistency, or
accuracy in these results. Here, you should consider whether the test can replicate
results whenever it is used. For example, if the students perform the same tasks
simultaneously, then such assessment passes as reliable. An instrument‟s reliability is
estimated using a correlation coefficient of one type or another.
2. Validity
The validity of an assessment boils down to how well it measures the different
criteria being tested. In other words, it is the idea that the test measures what it intends
to measure. This means your assessment method should be relevant to the specific
context. For example, if you're testing physical strength, you should not send out a
written test. Instead, your tests should include physical exercises like pushups and
weightlifting. Likewise, a test of reading comprehension should not require mathematical
ability.
Note: There is an important relationship between reliability and validity. An assessment

that has very low reliability will also have low validity; clearly, a measurement with very
poor accuracy or consistency is unlikely to be fit for its purpose.
3. Objectivity
Objectivity is the extent to which several independent and competent examiners

agree on what constitutes an acceptable level of performance.
Objectivity is an important characteristic of a good test. It affects both validity and

reliability of test scores. Objectivity of a measuring instrument moans the degree to
which different persons scoring the answer receipt arrives of at the same result. C.V.
Good (1973) defines objectivity in testing is “the extent to which the instrument is free
from personal error (personal bias), that is subjectivity on the part of the scorer”.
Gronlund and Linn (1995) states “Objectivity of a test refers to the degree to
which equally competent scores obtain the same results. So a test is considered
objective when it makes for the elimination of the scorer‟s personal opinion and bias
judgement. In this context there are two aspects of objectivity which should be kept in
mind while constructing a test.”
(i) Objectivity in scoring. The personal judgement of the individual who score the
answer script should not be a factor affecting the test scores.
(ii) Objectivity in interpretation of test items by the testee. By item objectivity we
mean that the item must call for a definite single answer. Well-constructed
test items should lead themselves to one and only one interpretation by
students who know the material involved.
4. Practicality
Practicality in assessment means that the test is easy to design, easy to

administer and easy to score. No matter how valid or reliable a test is, it has to be
practical to make and to take, This means that:
a. It is economical to deliver. It is not excessively expensive

b. The layout should be easy to follow and understand.
c. It stays within appropriate time constraints
d. It is relatively easy to administer
e. Its correct evaluation procedure is specific and time-efficient
5. Equitability
A good assessment tool is equitable, which means it does not favor or disfavor
any participant. Fair assessments imply that students are tested using methods and
procedures most appropriate to them. Every participant must be familiar with the test
context so they can put up an acceptable performance.
The extent to which an assessment involves similar content and format and is
administered and scored in the same way for everyone. For example, students should get
the same instructions, perform identical or similar tasks, have the same time limit, and
work under same constraints. In the same manner, students‟ responses should be scored
as consistently as possible.
B. Types of Teacher-made Tests
Teacher-made tests are normally prepared and administered for testing
classroom achievement of students, evaluating the method of teaching adopted by the
teacher and other curricular programmes of the school. Teacher-made test is one of the
most valuable instruments in the hands of the teacher to solve his purpose.
Types of teacher-made tests
1. Selected response
In selected response items, students choose a response provided by the
teacher or test developer, rather than construct one in their own words or by their
own actions. Tests with these items are called objective because the results are
not influenced by scorers‟ judgments or interpretations and so are often machine
scored. Examples of this type are alternative response (true or false), multiple
choice and matching type tests.
2. Constructed response
Constructed-response items refer to a wide range of test items that require

examinees to produce answers in various formats. Students are ask to construct,
create, or do something to build their answers to a question or problem.
Examples of this type are: completion, short answer, essay, problem solving and
enumeration tests
We are concerned with the development of tests for assessing the attainment of
educational objectives in this chapter. For this reason, we restrict our attention on
studying the guidelines in constructing the following types of objective tests: true or false
items, multiple-choice type items, matching type items, fill in the blanks or completion
items, and essays.
Guidelines in Constructing True-False Test
Binomial-choice tests are tests that have only two options such as true or false,
right or wrong, good or better and so on.
1. Do not give a hint in the body of the question.
2. Avoid using the words always, never, often, and other adverbs that tend to be
either always true or false always.
3. Avoid long sentences as these tend to be true. Keep sentences short.
4. Avoid trick statements with some minor misleading word or spelling anomaly,
misplaced phrases, etc.
5. Avoid quoting verbatim from reference materials or textbooks.
6. Avoid specific determiners or give-away qualifiers.
Guidelines in Constructing Multiple-Choice Test
1. Do not use unfamiliar words, terms and phrases
Parts of a multiple-choice item
Stem: A question or statement followed by a number of choices or

alternatives that answer or complete the question or statement
Alternatives: All the possible choices or responses to the stem

Distractors (foils): Incorrect alternatives
Correct answer: The correct alternative
2. Do not use modifiers that are vague and whose meanings can differ from one
person to the next such as : much often, usually, etc.
3. Avoid complex or awkward word arrangements.
4. Do not use negatives or double negatives as such statements tend to be
confusing.
5. Each item stem should be as short as possible.
6. Distracters should be equally plausible and attractive.
7. All multiple choice options should be grammatically consistent with the stem.
8. The length, explicitness or degree of technicality of alternatives should not be
the determinants of the correctness of the answer.
9. Avoid stems that reveal the answer to another item.
10. Avoid alternatives that are synonymous with others or those that include or
overlap others.
11. Avoid presenting sequenced items in the same order as in the text.
12. Avoid use of assumed qualifiers that many examinees may not be aware of.
13. Avoid use of unnecessary words or phrases, which are not relevant to the
problem at hand.
14. Avoid use of non-relevant sources of difficulty such as requiring a complex
calculation when only knowledge of a principle is being tested.
15. Avoid extreme specificity requirements in responses.
16. Include as much of the item as possible in the stem.
17. Use the none of the above option only when the keyed answer is totally
correct.
18. Not that use of all of the above may allow credit for partial knowledge.
Guidelines in Constructing Multiple-Choice Test
1. Match homogeneous not heterogeneous.

2. The stem must be in the first column while the options must be in the second
column.
3. The option must be more in number than in the stem to prevent the students
from arriving at the answer by mere process of elimination.
4. To help the examinee find the answer easier, arrange the options
alphabetically or chronologically, whichever is applicable.
5. Like any other test, the direction of the test must be given, examiners must
know what to do.
Guidelines in Constructing Completion Type of Test
1. Give the students a reasonable basis for the responses desired.

2. Avoid giving students unwarranted clues to desired response
3. Arrange the test so as to facilitate scoring.
Guidelines in Constructing Essay Test
Types of Essay Test
a. Restricted essay
b. Non-restricted/Extended essay
C. Learning Target and Assessment Method Match
The following are important steps in planning for a test:
1. Identifying test objectives

2. Deciding on the type of objective test to be prepared
3. Preparing table of specifications (TOS)
4. Constructing the draft test items
5. Try-out and validation
1. Identifying test objectives
An objective test, if it is comprehensive, must cover the various levels of Bloom‟s

taxonomy. Each objective consists of a statement of what is to be achieved. Example:
The objective is to construct a test on the topic “Subject-verb agreement in English” for
a Grade V1 class. Objectives such as the following can be created.
Knowledge: The students must be able to identify the subject and the verb in a
given sentence.
Comprehension: The students must be able to determine the appropriate form
of a verb to be used given the subject of a sentence.
Application: The students must be able to write sentences observing rules on
subject-verb agreement.
Analysis: The students must be able to break down a given sentence into its
subject and predicate.
Synthesis/Evaluation: The students must be able to formulate rules to be
followed regarding subject-verb agreement.
2. Deciding on the type of objective test to be prepared

The test objectives guide the kind of objective tests that will be designed and
constructed by the teacher.
3. Preparing table of specifications (TOS)

A table of specifications or TOS is a test map that guides the teacher in
constructing a test. The TOS ensures that there is balance between items that test
lower level thinking skills and those which test higher order thinking skills in the test.
Parts of a Table of Specifications
An example TOS
4. Constructing the draft test items

The actual construction of the test items follows the test. It is advised, as a
general rule, that the accrual number of items to be constructed in the draft should be
double the desired number of items.
5. Try-out and validation
The test draft is tried out to a group of pupils or students. The purpose of the try
out is to determine the : a) item characteristics through item analysis, and, b)
characteristics of the test itself-validity, reliability, and practicality.
D. Assessment Tools Development

1. Assessment development cycle
Assessment is a constant cycle of improvement. Data gathering is on-going.

Academic departments or programs need to constantly ask: What do we want students to be
able to know, do, and appreciate and how do we know that students are achieving the intended
learning outcomes? After implementing an assessment plan and measuring student learning
outcomes, departments, programs, and units need to analyze the results obtained and use
those results to make necessary changes or improvements to the unit or program. Before
beginning, it is important to have the proper structure in place to ensure that the process, once
in place, is self-sustaining.
The Four Steps of the Assessment Cycle
Step 1: Clearly define and identify the learning outcomes
Each program should formulate between 3 and 5 learning outcomes that

describe what students should be able to do (abilities), to know (knowledge), and
appreciate (values and attitudes) following completion of the program.
Step 2: Select appropriate assessment measures and assess the learning

outcomes
Multiple ways of assessing the learning outcomes are usually selected

and used. Although direct and indirect measures of learning can be used, it is
usually recommended to focus on direct measures of learning. Levels of student
performance for each outcome is often described and assessed with the use of
rubrics.
It is important to determine how the data will be collected and who will be
responsible for data collection. Results are always reported in aggregate format
to protect the confidentiality of the students assessed.
Step 3: Analyze the results of the outcomes assessed
It is important to analyze and report the results of the assessments in a

meaningful way.
Step 4: Adjust or improve programs following the results of the learning

outcomes assessed
Assessment results are worthless if they are not used. This step is a
critical step of the assessment process. The assessment process has failed if the
results do not lead to adjustments or improvements in programs. The results of
assessments should be disseminated widely to faculty in the department in order
to seek their input on how to improve programs from the assessment results. In
some instances, changes will be minor and easy to implement.
2. Test item formulation
A test item is a specific task test takers are asked to perform. Test items can
assess one or more points or objectives, and the actual item itself may take on a
different constellation depending on the context. For example, an item may test one
point (understanding of a given vocabulary word) or several points (the ability to obtain
facts from a passage and then make inferences based on the facts). Likewise, a given
objective may be tested by a series of items. For example, there could be five items all
testing one grammatical point (e.g., tag questions). Items of a similar kind may also be
grouped together to form subtests within a given test. A multiple-choice item, for
example, is objective in that there is only one right answer. A free composition may be
more subjective in nature if the scorer is not looking for any one right answer, but rather
for a series of factors (creativity, style, cohesion and coherence, grammar, and
mechanics).
3. Item analysis
Item analysis is a process which examines student responses to individual test

items (questions) in order to assess the quality of those items and of the test as a
whole. Item analysis is especially valuable in improving items which will be used again
in later tests, but it can also be used to eliminate ambiguous or misleading items in a
single test administration. In addition, item analysis is valuable for increasing instructors‟
skills in test construction, and identifying specific areas of course content which need
greater emphasis or clarity.
In determining the usefulness of an item as a measure of individual differences in

ability or personality assessment, the tester needs some external criterion measure of
the characteristic. For example, if a test is being constructed to predict performance in
a job or at school, then possible external criteria might be an index of job performance
(such as supervisor's ratings) or an index of school achievement (such as grades
assigned by teachers).
With classroom achievement tests however, there is usually no external criterion

against which items can be validated. Instead, another procedure is used that involves
determining the percentage of test takers who passed each item and the correlation of
each item with some criterion. In this case, though, the criterion consists of total scores
on the test itself. Two statistics can help us to evaluate the usefulness of each test item
namely:
a) item difficulty
b) discrimination index
The item difficulty or difficulty of an item is defined as the number of students

who are able to answer the item correctly divided by the total number of students.
Thus,
Item difficulty = number of students with correct answer/total number of students

Example: What is the item difficulty index of an item if 25 students are unable to
answer it correctly while 75 answered it correctly? Here the total
number of students is 100, hence, the item difficulty index is 75/100
or 75%.
The following arbitrary rule is often used to decide on the basis on this index
whether the item is too difficult or too easy.
The index of discrimination tells whether the item can discriminate between those
who know and those who do not know the answer
Difficulty in:
◦ Upper 25% of the class

◦ Lower 25% of the class
Index of discrimination = DU – DL
Example: Obtain the index of discrimination of an item if the upper 25% of the class had a
difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while the
lower 25% of the class had a difficulty index of 0.20.
Here, DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.
Table of interpretation:
The index of discrimination can range from -1.0 to 1.0. When index of
discrimination is equal to – 1, it means that all of the lower 25% of the students got the
correct answer while all the upper 25% got the wrong answer. It discriminates correctly
between the two groups but the item itself is highly questionable. On the other hand.
when the index of discrimination is 1.0, it means that all of the lower 25% failed to get
the correct answer while all the upper 25% got the correct answer. This is perfectly
discriminating item and is the ideal item that should be included in the test.
Example: Consider a multiple choice type of test from which the following data
were obtained:
Item Options
A B* C D
1 0 40 20 20 Total
0 15 5 0 Upper 25%
0 5 10 5 Lower 25%
Here, the correct answer is B. Let us compute the difficulty index and index of
discrimination:
Difficulty index = no. of students getting correct answers

total number of students
= 40/80
= .50 or 50%, within range of “good item”
Discrimination index = DU – DL
= .75 - .25
= .50, the item also has a “good discriminating power”
where:
DU = no. of students in the upper 25% with correct answer
total number of students in the upper 25%
= 15/20
= .75 or 75%
DL = no. of students in the lower 25% with correct answer

total number of students in the lower 25%
= 5/20
= .25 or 25%
It is alo important to note that the distracter A is not an effective distracter since
this was never selected by the students (having a zero number of students). Distracters
C and D appear to have good appeal as distracters.
Procedural steps for performing item analysis:
1. Arrange the scores in descending order.
2. Separate two sub-groups of the test papers
3. Take 25% of the scores out of the highest scores and 25% of the scores
falling at the bottom.
4. Count the number of right answers in highest group and count the number of
right answers in the lowest group.
5. Count the no-response (N.R.) examinees
4. Validation
After performing the item analysis and revising the items which need revision, the
next step is to validate the instrument. The purpose of validation is to determine the
characteristics of the whole test itself; namely the validity and reliability of the test.
Validation is the process of collecting and analysing evidence to support the
meaningfulness and usefulness of the test.
Validity refers to the extent to which a test measures what it purports to measure
or as referring to the appropriateness, correctness, meaningfullness and usefulness of
the specific decisions a teacher makes based on the test results.
Types of Validity
a. Face validity
Face validity refers to whether a test appears to be valid or not (e.g. from
external appearance whether the items appear to measure the required aspect or not).
If a test measures what the test author desires to measure, we say that the test has
face validity. Thus, face validity refers not to what the test measures, but what the test
„appears to measure‟. The content of the test should not obviously appear to be
inappropriate, irrelevant. For example, a test to measure “Skill in addition” should
contain only items on addition. When one goes through the items and feels that all the
items appear to measure the skill in addition, then it can be said that the test is validated
by face.
Although it is not an efficient method of assessing the validity of a test and as
such it is not usually used still then it can be used as a first step in validating the test.
Once the test is validated at face, we may proceed further to compute validity
coefficient.
Moreover, this method helps a test maker to revise the test items to suit to the
purpose. When a test is to be constructed quickly or when there is an urgent need of a
test and there is no time or scope to determine the validity by other efficient methods,
face validity can be determined. This type of validity is not adequate as it operates at
the facial level and hence may be used as a last resort.
b. Content validity
Content Validity is a process of matching the test items with the instructional
objectives. Content validity is the most important criterion for the usefulness of a test,
especially of an achievement test. It is also called as Rational Validity or Logical Validity
or Curricular Validity or Internal Validity or Intrinsic Validity.
Content validity refers to the degree or extent to which a test consists items
representing the behaviours that the test maker wants to measure. The extent to which
the items of a test are true representative of the whole content and the objectives of the
teaching is called the content validity of the test.
Content validity is estimated by evaluating the relevance of the test items; i.e. the
test items must duly cover all the content and behavioural areas of the trait to be
measured. It gives idea of subject matter or change in behaviour.This way, content
validity refers to the extent to which a test contains items representing the behaviour
that we are going to measure. The items of the test should include every relevant
characteristic of the whole content area and objectives in right proportion.Before
constructing the test, the test maker prepares a two-way table of content and objectives,
popularly known as “Specification Table”.
Suppose an achievement test in Mathematics is prepared. It must contain items
from Algebra, Arithmetic, Geometry, Mensuration and Trigonometry and moreover the
items must measure the different behavioural objectives like knowledge, understanding,
skill, application etc. So it is imperative that due weightage be given to different content
area and objectives.
c. Criterion-Related Validity
A test is said to have criterion-related validity when it has demonstrated its

effectiveness in predicting criteria, or indicators, of a construct. For example, when an
employer hires new employees, they will examine different criteria that could predict
whether or not a prospective hire will be a good fit for a job. People who do well on a
test may be more likely to do well at a job, while people with a low score on a test will do
poorly at that job.
There are two different types of criterion validity: concurrent and predictive.
1. Concurrent Validity
The dictionary meaning of the term „concurrent‟ is „existing‟ or „done at the
same time‟. Thus the term „concurrent validity‟ is used to indicate the process of
validating a new test by correlating its scores with some existing or available
source of information (criterion) which might have been obtained shortly before or
shortly after the new test is given. Concurrent validity occurs when criterion
measures are obtained at the same time as test scores, indicating the ability
of test scores to estimate an individual‟s current state.
Concurrent validity refers to the extent to which the test scores correspond
to already established or accepted performance, known as criterion. To know the
validity of a newly constructed test, it is correlated or compared with some
available information. For example, if a test is designed to measure
mathematics ability of students and correlates highly with a standardized
mathematics achievement test (external criterion).
2. Predictive Validity
Predictive validity is concerned with the predictive capacity of a test. It
indicates the effectiveness of a test in forecasting or predicting future outcomes
in a specific area. The test user wishes to forecast an individual‟s future
performance. Test scores can be used to predict future behaviour or
performance. Examples of tests with predictive validity are career or aptitude
tests, which are helpful in determining who is likely to succeed or fail in certain
subjects or occupations.
In order to find predictive validity, the tester correlates the test scores with
testee‟s subsequent performance, technically known as “Criterion”. Criterion is an
independent, external and direct measure of that which the test is designed to
predict or measure.
The predictive validity differs from concurrent validity in the sense that in former
validity we wait for the future to get criterion measure. But in ease of concurrent validity
we need not wait for longer gaps.
Apart from the use of correlation coefficient (obtained value using statistical tool)
in measuring criterion-related validity, Gronlund suggested using so called “expectancy
table”. This table is easy to construct and consists of test (predictor) categories listed on
the left hand side and criterion categories listed horizontally along the top of the chart.
For example, suppose that a mathematics achievement test is constructed and the
scores are categorized as high, average, and low. The criterion measure used is the
final average grades of the students in high school: Very Good, Good, and Needs
Improvement. The two-way table lists down the number of students falling under each of
the possible pairs (test, grade) as shown below:
Grade Point Average

Test score Very Good Good Needs Improvement
High 20 10 5
Average 10 25 5
Low 1 10 14
d. Construct validity
A test has construct validity if it demonstrates an association between the test

scores and the prediction of a theoretical trait. Intelligence tests are one example of
measurement instruments that should have construct validity. A valid intelligence test
should be able to accurately measure the construct of intelligence rather than other
characteristics, such as memory or education level.
A construct is mainly psychological. Usually it refers to a trait or mental process.

Construct validation is the process of determining the extent to which a particular test
measures the psychological constructs that the test maker intends to measure.
It must be noted that construct validity is inferential. It is used primarily when

other types of validity are insufficient to indicate the validity of the test. Construct validity
is usually involved in such as those of study habits, appreciation, honesty, emotional
stability, sympathy etc.
5. Reliability
a. Test-retest reliability is a measure of reliability obtained by administering the

same test twice over a period of time to a group of individuals. The scores from
Time 1 and Time 2 can then be correlated in order to evaluate the test for
stability over time.
Example: A test designed to assess student learning in psychology could be

given to a group of students twice, with the second administration
perhaps coming a week after the first. The obtained correlation
coefficient would indicate the stability of the scores.
b. Parallel forms reliability is a measure of reliability obtained by administering

different versions of an assessment tool (both versions must contain items that
probe the same construct, skill, knowledge base, etc.) to the same group of
individuals. The scores from the two versions can then be correlated in order to
evaluate the consistency of results across alternate versions.
Example: If you wanted to evaluate the reliability of a critical thinking

assessment, you might create a large set of items that all pertain to
critical thinking and then randomly split the questions up into two
sets, which would represent the parallel forms.
c. Inter-rater reliability is a measure of reliability used to assess the degree to

which different judges or raters agree in their assessment decisions. Inter-rater
reliability is useful because human observers will not necessarily interpret
answers the same way; raters may disagree as to how well certain responses or
material demonstrate knowledge of the construct or skill being assessed.
Example: Inter-rater reliability might be employed when different judges

are evaluating the degree to which art portfolios meet
certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use
of this type of reliability would probably be more likely when
evaluating artwork as opposed to math problems.
d. Internal consistency reliability is a measure of reliability used to evaluate

the degree to which different test items that probe the same construct produce
similar results.
e. Split-half reliability is another subtype of internal consistency reliability. The

process of obtaining split-half reliability is begun by “splitting in half” all items of a
test that are intended to probe the same area of knowledge (e.g., World War II) in
order to form two “sets” of items. The entire test is administered to a group of
individuals, the total score for each “set” is computed, and finally the split-half
reliability is obtained by determining the correlation between the two total “set”
scores.
The following table is a standard followed almost universally in educational

test and measurement:
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best
standardized test
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test, in the range of most.
there are probably a few items which could be
improved.
.60 - .70 Somewhat low. This test needs to be supplemented
by other measures (e.g. more tests) to determine
grades. There are probably some items which could
be improved
.50 - .60 Suggests need for revision of test, unless it is quite
short ( ten or fewer items). the test definitely needs
to be supplemented by other measures (e.g. more
tests) for grading
.50 or below Questionable reliability. This test should not
contribute heavily to the course grade, and it needs
revision.
REFERENCES
Formplus 2021. Retrieved from: https://www.formpl.us/blog/asssessment-tools
Course Hero (2021). Qualities of Good Assessment Four RSVP. Retrieved from:
https://www.coursehero.com/file/p5ttc3d/Qualities-of-Good-Assessment- Four-
RSVP-Characteristics-of-Good-Classroom/
http://helid.digicollection.org/en/d/Jh0208e/11.6.html
https://assessment.tki.org.nz/Using-evidence-for-learning/Working-with-
data/Concepts/Reliability-and-validity
https://chfasoa.uni.edu/reliabilityandvalidity.htm
https://www.scribbr.com/methodology/types-of-reliability/
https://www.missouristate.edu/assessment/the-assessment-process.htm
https://www.yourarticlelibrary.com/statistics-2/validity-of-a-test-6-types-statistics/92597
https://www.verywellmind.com/what-is-validity-
2795788#:~:text=Validity%20can%20be%20demonstrated%20by,%2C%20and%2For%
20face%20validity.

Chapter 3 Designing and Developing Assessments

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 Designing and Developing Assessments

Uploaded by

Copyright:

Available Formats

Chapter 3

Designing and Developing Assessments

At the end of the unit, the pre-service teacher (PST) can:

a. develop assessment tools that are learner-appropriate and target-matched; and

An assessment instrument is part of the assessment tool. The assessment

Assessment methods differ based on context and purpose. For example,

Regardless of the context, all assessment tools depend on a set of well-thought-

A. Characteristics of Quality Assessment Tools

Reliability refers to the extent to which an assessment instrument yields consistent

Note: There is an important relationship between reliability and validity. An assessment

Objectivity is the extent to which several independent and competent examiners

Objectivity is an important characteristic of a good test. It affects both validity and

Practicality in assessment means that the test is easy to design, easy to

a. It is economical to deliver. It is not excessively expensive

Types of teacher-made tests

Constructed-response items refer to a wide range of test items that require

Guidelines in Constructing True-False Test

Guidelines in Constructing Multiple-Choice Test

1. Do not use unfamiliar words, terms and phrases

Parts of a multiple-choice item

Stem: A question or statement followed by a number of choices or

Alternatives: All the possible choices or responses to the stem

Guidelines in Constructing Multiple-Choice Test

1. Match homogeneous not heterogeneous.

Guidelines in Constructing Completion Type of Test

1. Give the students a reasonable basis for the responses desired.

Guidelines in Constructing Essay Test

Types of Essay Test

The following are important steps in planning for a test:

1. Identifying test objectives

1. Identifying test objectives

An objective test, if it is comprehensive, must cover the various levels of Bloom‟s

2. Deciding on the type of objective test to be prepared

3. Preparing table of specifications (TOS)

4. Constructing the draft test items

5. Try-out and validation

D. Assessment Tools Development

Assessment is a constant cycle of improvement. Data gathering is on-going.

The Four Steps of the Assessment Cycle

Step 1: Clearly define and identify the learning outcomes

Each program should formulate between 3 and 5 learning outcomes that

Step 2: Select appropriate assessment measures and assess the learning

Multiple ways of assessing the learning outcomes are usually selected

Step 3: Analyze the results of the outcomes assessed

It is important to analyze and report the results of the assessments in a

Step 4: Adjust or improve programs following the results of the learning

2. Test item formulation

Item analysis is a process which examines student responses to individual test

In determining the usefulness of an item as a measure of individual differences in

With classroom achievement tests however, there is usually no external criterion

The item difficulty or difficulty of an item is defined as the number of students

Item difficulty = number of students with correct answer/total number of students

◦ Upper 25% of the class

Difficulty index = no. of students getting correct answers

DL = no. of students in the lower 25% with correct answer

A test is said to have criterion-related validity when it has demonstrated its

Grade Point Average

A test has construct validity if it demonstrates an association between the test

A construct is mainly psychological. Usually it refers to a trait or mental process.

It must be noted that construct validity is inferential. It is used primarily when

a. Test-retest reliability is a measure of reliability obtained by administering the

Example: A test designed to assess student learning in psychology could be

b. Parallel forms reliability is a measure of reliability obtained by administering