Professional Documents
Culture Documents
Introduction
An assessment tool is a technique or method of evaluating information to
determine how much a person knows and whether this knowledge aligns with the bigger
picture of a theory or framework. The assessment tool comprises the assessment
instrument and the context and conditions of assessment. An assessment tool can also
contain the administration, recording and reporting requirements of the assessment.
The validity of an assessment boils down to how well it measures the different
criteria being tested. In other words, it is the idea that the test measures what it intends
to measure. This means your assessment method should be relevant to the specific
context. For example, if you're testing physical strength, you should not send out a
written test. Instead, your tests should include physical exercises like pushups and
weightlifting. Likewise, a test of reading comprehension should not require mathematical
ability.
3. Objectivity
4. Practicality
5. Equitability
A good assessment tool is equitable, which means it does not favor or disfavor
any participant. Fair assessments imply that students are tested using methods and
procedures most appropriate to them. Every participant must be familiar with the test
context so they can put up an acceptable performance.
The extent to which an assessment involves similar content and format and is
administered and scored in the same way for everyone. For example, students should get
the same instructions, perform identical or similar tasks, have the same time limit, and
work under same constraints. In the same manner, students‟ responses should be scored
as consistently as possible.
B. Types of Teacher-made Tests
Teacher-made tests are normally prepared and administered for testing
classroom achievement of students, evaluating the method of teaching adopted by the
teacher and other curricular programmes of the school. Teacher-made test is one of the
most valuable instruments in the hands of the teacher to solve his purpose.
1. Selected response
In selected response items, students choose a response provided by the
teacher or test developer, rather than construct one in their own words or by their
own actions. Tests with these items are called objective because the results are
not influenced by scorers‟ judgments or interpretations and so are often machine
scored. Examples of this type are alternative response (true or false), multiple
choice and matching type tests.
2. Constructed response
We are concerned with the development of tests for assessing the attainment of
educational objectives in this chapter. For this reason, we restrict our attention on
studying the guidelines in constructing the following types of objective tests: true or false
items, multiple-choice type items, matching type items, fill in the blanks or completion
items, and essays.
Binomial-choice tests are tests that have only two options such as true or false,
right or wrong, good or better and so on.
1. Do not give a hint in the body of the question.
2. Avoid using the words always, never, often, and other adverbs that tend to be
either always true or false always.
3. Avoid long sentences as these tend to be true. Keep sentences short.
4. Avoid trick statements with some minor misleading word or spelling anomaly,
misplaced phrases, etc.
5. Avoid quoting verbatim from reference materials or textbooks.
6. Avoid specific determiners or give-away qualifiers.
2. Do not use modifiers that are vague and whose meanings can differ from one
person to the next such as : much often, usually, etc.
3. Avoid complex or awkward word arrangements.
4. Do not use negatives or double negatives as such statements tend to be
confusing.
5. Each item stem should be as short as possible.
6. Distracters should be equally plausible and attractive.
7. All multiple choice options should be grammatically consistent with the stem.
8. The length, explicitness or degree of technicality of alternatives should not be
the determinants of the correctness of the answer.
9. Avoid stems that reveal the answer to another item.
10. Avoid alternatives that are synonymous with others or those that include or
overlap others.
11. Avoid presenting sequenced items in the same order as in the text.
12. Avoid use of assumed qualifiers that many examinees may not be aware of.
13. Avoid use of unnecessary words or phrases, which are not relevant to the
problem at hand.
14. Avoid use of non-relevant sources of difficulty such as requiring a complex
calculation when only knowledge of a principle is being tested.
15. Avoid extreme specificity requirements in responses.
16. Include as much of the item as possible in the stem.
17. Use the none of the above option only when the keyed answer is totally
correct.
18. Not that use of all of the above may allow credit for partial knowledge.
a. Restricted essay
b. Non-restricted/Extended essay
C. Learning Target and Assessment Method Match
Knowledge: The students must be able to identify the subject and the verb in a
given sentence.
Comprehension: The students must be able to determine the appropriate form
of a verb to be used given the subject of a sentence.
Application: The students must be able to write sentences observing rules on
subject-verb agreement.
Analysis: The students must be able to break down a given sentence into its
subject and predicate.
Synthesis/Evaluation: The students must be able to formulate rules to be
followed regarding subject-verb agreement.
The test draft is tried out to a group of pupils or students. The purpose of the try
out is to determine the : a) item characteristics through item analysis, and, b)
characteristics of the test itself-validity, reliability, and practicality.
It is important to determine how the data will be collected and who will be
responsible for data collection. Results are always reported in aggregate format
to protect the confidentiality of the students assessed.
A test item is a specific task test takers are asked to perform. Test items can
assess one or more points or objectives, and the actual item itself may take on a
different constellation depending on the context. For example, an item may test one
point (understanding of a given vocabulary word) or several points (the ability to obtain
facts from a passage and then make inferences based on the facts). Likewise, a given
objective may be tested by a series of items. For example, there could be five items all
testing one grammatical point (e.g., tag questions). Items of a similar kind may also be
grouped together to form subtests within a given test. A multiple-choice item, for
example, is objective in that there is only one right answer. A free composition may be
more subjective in nature if the scorer is not looking for any one right answer, but rather
for a series of factors (creativity, style, cohesion and coherence, grammar, and
mechanics).
3. Item analysis
The following arbitrary rule is often used to decide on the basis on this index
whether the item is too difficult or too easy.
The index of discrimination tells whether the item can discriminate between those
who know and those who do not know the answer
Difficulty in:
Index of discrimination = DU – DL
Example: Obtain the index of discrimination of an item if the upper 25% of the class had a
difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while the
lower 25% of the class had a difficulty index of 0.20.
Here, DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.
Table of interpretation:
The index of discrimination can range from -1.0 to 1.0. When index of
discrimination is equal to – 1, it means that all of the lower 25% of the students got the
correct answer while all the upper 25% got the wrong answer. It discriminates correctly
between the two groups but the item itself is highly questionable. On the other hand.
when the index of discrimination is 1.0, it means that all of the lower 25% failed to get
the correct answer while all the upper 25% got the correct answer. This is perfectly
discriminating item and is the ideal item that should be included in the test.
Example: Consider a multiple choice type of test from which the following data
were obtained:
Item Options
A B* C D
1 0 40 20 20 Total
0 15 5 0 Upper 25%
0 5 10 5 Lower 25%
Here, the correct answer is B. Let us compute the difficulty index and index of
discrimination:
Discrimination index = DU – DL
= .75 - .25
= .50, the item also has a “good discriminating power”
where:
DU = no. of students in the upper 25% with correct answer
total number of students in the upper 25%
= 15/20
= .75 or 75%
It is alo important to note that the distracter A is not an effective distracter since
this was never selected by the students (having a zero number of students). Distracters
C and D appear to have good appeal as distracters.
Procedural steps for performing item analysis:
1. Arrange the scores in descending order.
2. Separate two sub-groups of the test papers
3. Take 25% of the scores out of the highest scores and 25% of the scores
falling at the bottom.
4. Count the number of right answers in highest group and count the number of
right answers in the lowest group.
5. Count the no-response (N.R.) examinees
4. Validation
After performing the item analysis and revising the items which need revision, the
next step is to validate the instrument. The purpose of validation is to determine the
characteristics of the whole test itself; namely the validity and reliability of the test.
Validation is the process of collecting and analysing evidence to support the
meaningfulness and usefulness of the test.
Validity refers to the extent to which a test measures what it purports to measure
or as referring to the appropriateness, correctness, meaningfullness and usefulness of
the specific decisions a teacher makes based on the test results.
Types of Validity
a. Face validity
Face validity refers to whether a test appears to be valid or not (e.g. from
external appearance whether the items appear to measure the required aspect or not).
If a test measures what the test author desires to measure, we say that the test has
face validity. Thus, face validity refers not to what the test measures, but what the test
„appears to measure‟. The content of the test should not obviously appear to be
inappropriate, irrelevant. For example, a test to measure “Skill in addition” should
contain only items on addition. When one goes through the items and feels that all the
items appear to measure the skill in addition, then it can be said that the test is validated
by face.
Although it is not an efficient method of assessing the validity of a test and as
such it is not usually used still then it can be used as a first step in validating the test.
Once the test is validated at face, we may proceed further to compute validity
coefficient.
Moreover, this method helps a test maker to revise the test items to suit to the
purpose. When a test is to be constructed quickly or when there is an urgent need of a
test and there is no time or scope to determine the validity by other efficient methods,
face validity can be determined. This type of validity is not adequate as it operates at
the facial level and hence may be used as a last resort.
b. Content validity
Content Validity is a process of matching the test items with the instructional
objectives. Content validity is the most important criterion for the usefulness of a test,
especially of an achievement test. It is also called as Rational Validity or Logical Validity
or Curricular Validity or Internal Validity or Intrinsic Validity.
Content validity refers to the degree or extent to which a test consists items
representing the behaviours that the test maker wants to measure. The extent to which
the items of a test are true representative of the whole content and the objectives of the
teaching is called the content validity of the test.
Content validity is estimated by evaluating the relevance of the test items; i.e. the
test items must duly cover all the content and behavioural areas of the trait to be
measured. It gives idea of subject matter or change in behaviour.This way, content
validity refers to the extent to which a test contains items representing the behaviour
that we are going to measure. The items of the test should include every relevant
characteristic of the whole content area and objectives in right proportion.Before
constructing the test, the test maker prepares a two-way table of content and objectives,
popularly known as “Specification Table”.
Suppose an achievement test in Mathematics is prepared. It must contain items
from Algebra, Arithmetic, Geometry, Mensuration and Trigonometry and moreover the
items must measure the different behavioural objectives like knowledge, understanding,
skill, application etc. So it is imperative that due weightage be given to different content
area and objectives.
c. Criterion-Related Validity
There are two different types of criterion validity: concurrent and predictive.
1. Concurrent Validity
The dictionary meaning of the term „concurrent‟ is „existing‟ or „done at the
same time‟. Thus the term „concurrent validity‟ is used to indicate the process of
validating a new test by correlating its scores with some existing or available
source of information (criterion) which might have been obtained shortly before or
shortly after the new test is given. Concurrent validity occurs when criterion
measures are obtained at the same time as test scores, indicating the ability
of test scores to estimate an individual‟s current state.
Concurrent validity refers to the extent to which the test scores correspond
to already established or accepted performance, known as criterion. To know the
validity of a newly constructed test, it is correlated or compared with some
available information. For example, if a test is designed to measure
mathematics ability of students and correlates highly with a standardized
mathematics achievement test (external criterion).
2. Predictive Validity
Predictive validity is concerned with the predictive capacity of a test. It
indicates the effectiveness of a test in forecasting or predicting future outcomes
in a specific area. The test user wishes to forecast an individual‟s future
performance. Test scores can be used to predict future behaviour or
performance. Examples of tests with predictive validity are career or aptitude
tests, which are helpful in determining who is likely to succeed or fail in certain
subjects or occupations.
In order to find predictive validity, the tester correlates the test scores with
testee‟s subsequent performance, technically known as “Criterion”. Criterion is an
independent, external and direct measure of that which the test is designed to
predict or measure.
The predictive validity differs from concurrent validity in the sense that in former
validity we wait for the future to get criterion measure. But in ease of concurrent validity
we need not wait for longer gaps.
Apart from the use of correlation coefficient (obtained value using statistical tool)
in measuring criterion-related validity, Gronlund suggested using so called “expectancy
table”. This table is easy to construct and consists of test (predictor) categories listed on
the left hand side and criterion categories listed horizontally along the top of the chart.
For example, suppose that a mathematics achievement test is constructed and the
scores are categorized as high, average, and low. The criterion measure used is the
final average grades of the students in high school: Very Good, Good, and Needs
Improvement. The two-way table lists down the number of students falling under each of
the possible pairs (test, grade) as shown below:
5. Reliability
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best
standardized test
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test, in the range of most.
there are probably a few items which could be
improved.
.60 - .70 Somewhat low. This test needs to be supplemented
by other measures (e.g. more tests) to determine
grades. There are probably some items which could
be improved
.50 - .60 Suggests need for revision of test, unless it is quite
short ( ten or fewer items). the test definitely needs
to be supplemented by other measures (e.g. more
tests) for grading
.50 or below Questionable reliability. This test should not
contribute heavily to the course grade, and it needs
revision.
REFERENCES
Course Hero (2021). Qualities of Good Assessment Four RSVP. Retrieved from:
https://www.coursehero.com/file/p5ttc3d/Qualities-of-Good-Assessment- Four-
RSVP-Characteristics-of-Good-Classroom/
http://helid.digicollection.org/en/d/Jh0208e/11.6.html
https://assessment.tki.org.nz/Using-evidence-for-learning/Working-with-
data/Concepts/Reliability-and-validity
https://chfasoa.uni.edu/reliabilityandvalidity.htm
https://www.scribbr.com/methodology/types-of-reliability/
https://www.missouristate.edu/assessment/the-assessment-process.htm
https://www.yourarticlelibrary.com/statistics-2/validity-of-a-test-6-types-statistics/92597
https://www.verywellmind.com/what-is-validity-
2795788#:~:text=Validity%20can%20be%20demonstrated%20by,%2C%20and%2For%
20face%20validity.