You are on page 1of 12

ASSIGNMENT

Submitted to: Mrs. Asmat Ara

Submitted by: Asifa Iqbal

Roll No. 01

Class: MA English

Semester: IV

Session: 2013-2015

Subject: Testing &Evaluation

Topic: Characteristics of Tests

The Women university Multan


Characteristics of Tests

“Test is a systematic procedure for observing persons and describing them


with either a numeric scale or a category system. Thus a test may give
qualitative or quantitative information.” (Anthony J. Nitko)

Now testing is the part and parcel of human life; in general and education
system; in particular. Therefore tests should have certain qualities so that
these tests represent the true nature of the capabilities of the candidates.
Following are the most important characteristics of the tests.

 Reliability
 Validity
 Practicality
 Discrimination
 Backwash effect
 Objectivity
 Comprehensiveness
 Authenticity
 Simplicity
 Scorability

Reliability:
Reliability is one of the most important characteristics of all tests in general, and
language tests in particular. In fact, an unreliable test is worth nothing. In order
to understand the concept of reliability, an example may prove helpful. Suppose
a student took a test of grammar comprising one hundred items and received a
score of 90. Suppose further that the same student took the same test two days
later and got a score of 45. Finally suppose that the same student took the same
test for the third time and received a score of 70. What would you think of these
scores? What would you think of the student? Assuming that the student’s
knowledge of English cannot go under drastic changes within this short period,
the best explanation would be that there must have been something wrong with
the test. How would you rely on a test that does not produce consistent scores?
How can you make a sound decision on the basis of such test scores? This is the
essence of the concept of reliability, i.e., producing consistent scores.
Reliability, then, can be technically defined as
“The extent to which a test produces consistent scores at different
administrations to the same or similar group of examinees”
Reliability is “the extent to which a test produces consistent scores." This
means that the higher the extent, the more reliable the test.
There are four methods of estimating Reliability.
 Test-Retest.
 Parallel forms
 Split-half.
 KR-21

1. Test-Retest Method:
As the name implies, in this method a single test is administered to a single
group of examinees twice. The first administration is called “test” and the
second administration is referred to as “retest”. The correlation between the two
sets of scores obtained from testing and retesting, would determine the
magnitude of reliability. Since there is a time interval (usually more than two
weeks) between the two administrations, this kind of reliability estimate is also
known as “stability of scores over time”.

Disadvantages of this method:


It has some practical disadvantages. First, it is not very easy to have the same
group of examinees available in two different administrations. Second, the time
interval creates two obstacles. On the one hand, if it is very short, there might be
practice effect as well as memorization effect carried over from the first
administration. On the other hand, if the interval is too long, there will be the
learning effect, i.e., the examinee’s state of knowledge will not be the same as it
was in the first administration. To avoid these problems, other methods of
estimating reliability have been developed.

Parallel forms method:


In order to remove some of the problems inherent in the test-retest method,
experts
have developed the parallel forms method. In this method, two parallel forms of
a
single test are given to one group of examinees. The correlation between the
scores
obtained from the two tests is computed to indicate the reliability of the scores.
Advantages of this method:
 There is no need for administering the test twice.
 Thus, the problem of examinees’ knowledge undergoing changes does
not exist in this method.
Disadvantages of this method:
 Constructing two parallel forms of a test is not an easy task.
 There are certain logical and statistical criteria that a pair of parallel
forms must meet.

 Therefore, most teachers and test developers avoid this method. Due to
the complexity of the task, they prefer to use other methods of estimating
reliability

3. Split-Half Method:
In test-retest method, one group of examinees was needed for two
administrations. In parallel forms method, on the other hand, two forms of a
single test were needed.
Each of these requirements is considered a disadvantage. To obviate these
shortcomings, the split-half method has been developed. In this method, a single
form of a test is given to a single group of examinees. Then each examinee’s
test is split (divided) into two halves. The correlation between the scores of the
examinees in the first half and the second half will determine the reliability of
the test scores.
Advantages of this Method:
 It is more practical than others.
 There is no need to administer the test twice.
 There is no need to develop two parallel forms of the test.

Disadvantages of this test:


 The main shortcoming is the development of test with homogenous items.
 Different subsections in a test like grammar, vocabulary, listening or
reading comprehension will jeopardize test homogeneity and reduce score
Reliability.

4. The KR-21 Method:


The previously mentioned methods to estimate test score reliability require a
statistical procedure called ‘correlation’. Majority of teachers and non-
professional test developers, however, are not quite familiar with statistics.
Thus, they may have some problems in using statistical formulas and
interpreting the outcome of statistical analyses. To overcome these problems,
two statisticians – named Kuder and Richardson – developed a series of
formulas to be used in statistics. One of these formulas is used to estimate test
score reliability through simple mathematical operations. The formula is called
KR-21, in which K and R refer to the first initials of the two statisticians and 21
refers to the number of the formula in the series. This formula is used to
estimate the reliability of a single test given to one group of examinees through
a single administration. This method requires only the testers and teachers to be
able to calculate two simple statistical parameters. These parameters are:
 The Mean.
 Variance.

Factors Influencing Reliability:

 Influence of Testees
 Influence of scorers/Raters
 Influence of test factors

Influence of Testees:
 As human beings are dynamic ,so the performances of students may vary
from place to place and time to time , may be due to
 Noise level
 Distractions
 Misreading of instructions
 Sickness
 Heterogeneity of the group of students may influence Reliability.

Influence of test factors:


 Test length
 Test difficulty level
 Test duration

Influence of scoring factors:


 Objectively scored Test:
In this, the likes and dislikes of the raters /scorers may not
influence the results.
 Subjectively scored Test:
In this, the likes and dislikes of the scorers may influence the
results.

Sources of Errors:
 Intra-rater errors:
Errors, that are due to fluctuations of the raters scoring the single test
twice

 Inter-rater errors:
Errors that are due to fluctuations of different scorers, scoring a single
test.

INTER-RATER RELIABILITY :
Inter-rater reliability is the extent to which two or more individuals (coders or
raters) agree. Inter-rater reliability assesses the consistency of how a measuring
system is implemented.
For example, when two or more teachers use a rating scale with
which they are rating the students’ oral responses in an interview.
 Inter-rater reliability is dependent upon the ability of two or more
individuals to be consistent. Training, education and monitoring
skills can enhance inter-rater reliability.

. INTRA-RATER RELIABILITY:
Intra-rater reliability is a type of reliability assessment in which the same
assessment is completed by the same rater on two or more occasions. These
different ratings are then compared, generally by means of correlation. Since the
same individual is completing both assessments, the rater's subsequent ratings
are contaminated by knowledge of earlier ratings.

Validity:

Validity can be defined as


“Validity is the extent to which a test measures what it claims to measure.”
Validity is certainly the most important single characteristic of a test. If not
valid, even a reliable test does not worth much. The reason is that a reliable test
may not be valid; however, a valid test is to some extent reliable as well.
Furthermore, where reliability is an independent statistical concept and has
nothing to do with the content of the test, validity is directly related to the
content and form of the test validity is defined as "the extent to which a test
measures what it is supposed to measure". This means that if a test is designed
to measure examinees’ language ability, it should measure their language ability
and nothing else. Otherwise, it will not be a valid test for the purposes intended.

Example, suppose a test of reading comprehension is given to a student and on


the basis of his test score, it is claimed that the student is very good at listening
comprehension. This kind of interpretation is quite invalid. That particular score
can be a valid indication of the student’s reading comprehension ability;
however, the same score will be an invalid indication of the same student’s
listening comprehension ability. Thus, a test can be valid for one purpose but
not the other. In other words, a good test of grammar may be valid for
measuring the grammatical ability of the examinees but not for measuring other
kinds of abilities.
In order to guarantee that a test is valid, it should be evaluated from different
dimensions. Every dimension constitutes a different kind of validity and
contributes to the total validity of the test.
Types of Validity:

Following are the types of validity.


 Face validity.
 Content validity.
 Criterion-related validity.
 Construct Validity

1.Face Validity:
Face validity refers to
“The extent to which the physical appearance of the test corresponds to what
it is claimed to measure.”
. For instance, a test of grammar must contain grammatical items and not
vocabulary items. Of course, a test of vocabulary may very well measure
grammatical ability as well; however, that type of test will not show high face
validity. It should be mentioned that face validity is not a very crucial or
determinant type of validity. In most cases, a test that does not show high face
validity has proven to be highly valid by other criteria. Therefore, teachers and
administrators should not be very much concerned about the face validity of
their tests. They should however be careful about other types of validity.
Example:
People may have negative reactions to an intelligence test that did not appear to
them to be measuring their intelligence.
2. Content Validity:
Content validity refers to
“The correspondence between the contents of the test and the contents of the
materials to be tested.”
Content Validity consists of:
 Instructional Objectives.
 Instructional Activities.
 Test.
Of course, a test cannot include all the elements of the content to be tested.
Nevertheless, the content of the test should be a reasonable sample and
representative of the total content to be tested. In order to determine the content
validity of a test, a careful examination of the direct correspondence between
the content of the test and the materials to be tested is necessary.

3. Criterion-Related Validity:
Criterion-related validity refers to the correspondence between the results of
the test in question and the results obtained from an outside criterion. The
outside criterion is usually a measurement device for which the validity is
already established. In contrast to face validity and content validity, which are
determined subjectively, criterion-related validity is established quite
objectively.
That is why it is often referred to as empirical validity. Criterion-related validity
is determined by correlating the scores on a newly developed test with scores on
an already-established test.
Example:
An IQ test should relate positively with school performance.

Types of Criterion-related validity:

Criterion-related validity is of two major kinds.


 Concurrent Validity
 Predictive Validity

Concurrent Validity:

If a newly developed test is given concurrently with a criterion test, the validity
index is called concurrent validity.
Predictive Validity:

When the two tests are given within a time interval, the correlation between two
sets of scores is called predictive validity.
CONSTRUCT VALIDITY:
It implies using the construct correctly (concepts, ideas, notions). Construct
validity seeks agreement between a theoretical concept and a specific measuring
device or procedure.
For example, a test of intelligence nowadays must include measures of
multiple intelligences, rather than just logical-mathematical and linguistic
ability measures.

Note:
Both concurrent and predictive validity indexes serve the purposes of
prediction. The magnitude of correlation predicts the performance of the
examinees on the criterion measure from their performance on a newly
developed test or vice-versa.

Importance of Validity:
It is important for the test to be valid in order for the results to be accurately
applied and interpreted.

External Validity:
It is the extent to which the results of a study can be generalized to different
situations, different groups of people, different settings, and different
conditions.

Internal Validity:
It is the extent to which a study is free from flaws and that any differences in the
measurements are due to an independent variable and nothing else.

Population Validity:
It refers to the extent to which a study can be generalized to the populations of
the people.

Ecological Validity:
It refers to the extent to which the findings can be generalized beyond the
present situation.

Factors affecting Validity:


 Directions (simple and clear)
 Difficulty level of the test(not too easy nor too difficult)
 Structure of items (poorly constructed or ambiguous items contribute to
invalidity)
RELATIONSHIP BETWEEN VALIDITY & RELIABILITY :

 Validity and reliability are closely related.

 A test cannot be considered valid unless the measurements resulting from


it are reliable.

 Likewise, results from a test can be reliable and not necessarily valid.

Practicality:
. It refers to facilities available to test developers regarding both administration
and scoring procedures of a test .It is defined as:
“Practicality refers to the ease of administration and scoring of a test.”
As far as administration is concerned, test developers should be attentive to the
possibilities of giving a test under reasonably acceptable conditions.

For example, suppose a team of experts decide on giving a listening


comprehension test to large groups of examinees. In this case, test developers
should make sure that facilities such as audio equipments and/or suitable
acoustic rooms are available. Otherwise, no matter how reliable and valid the
test may be, it will not be practical.

Ease of administration:
It is = Clarity, simplicity and the ease of reading instructions
 Fewer numbers of subtests
 The time required for test

Ease of scoring:
A test can be scored subjectively or objectively. Since scoring is difficult and
time consuming, the trend is toward objectivity, simplicity and machine scoring.

Ease of Interpretation:
It refers to the meaningfulness of scores obtained from that test if the test
results are misinterpreted or misapplied, they will be of little value and may
actually be harmful to some individual or group.
Main Points related to Practicality:

 It refers to the economy of time, effort and money in testing. In other


words, a test should be…
 Easy to design

 Easy to administer

 Easy to mark

 Easy to interpret (the results)

Discrimination:

Another most important characteristic of a test is its ability of discrimination


and comparison. A test must have the ability to discriminate among the different
candidates and to reflect the differences in the performances of different
candidates of a group.

For example, 70% means nothing at all unless all the other scores obtained in
test are known. Furthermore, tests on which all the candidates score 70% fail to
discriminate between the various students.

COMREHENSIVENESS
It covers all the items that have been taught or studied. It includes items
from different areas of the material assigned for the test so as to check
accurately the amount of students’ knowledge.

BALANCE:
It tests linguistic as well as communicative competence and it reflects
the real command of the language. It tests also appropriateness and
accuracy.

OBJECTIVITY:
If it is marked by different teachers, the score will be the same. Marking
process should not be affected by the teacher’s personality. Questions
and answers are so clear and definite that the marker would give the
students the score he/she deserves.

ECONOMICAL:
It makes the best use of the teacher’s limited time for preparing and
grading and it makes the best use of the pupil’s assigned time for
answering all items. So, we can say that oral exams in classes of +30
students are not economical as it requires too much time and effort to be
conducted.

AUTHENTICITY
The language of the test should reflect everyday discourse.
SIMPLICITY
Simplicity means that the test should be written in a clear, correct and
simple language, it is important to keep the method of testing as simple
as possible while still testing the skill you intend to test.

SCORABILITY
Scorability means that each item in the test has its own mark related to
the distribution of marks given by, The Ministry of Education.

Backwash Effect:

 Backwash effect (also known as wash back) is the influence of testing on


teaching and learning. It is also the potential impact that the form and
content of a test may have on learners’ conception of what is being
assessed (language proficiency) and what it involves. Therefore, test
designers, delivers and raters have a particular responsibility, considering
that the testing process may have a substantial impact, either positive or
negative.

LEVELS OF BACKWASH :

 It is believed that backwash is a subset of a test’s impact on society,


educational systems and individuals. Thus, test impact operates at two
levels:

 The micro level (the effect of the test on individual students and teachers)

 The macro level (the impact of the test on society and the educational
system)

(Bachman and Palmer, 1996)

Conclusion:
These are the characteristics of the tests that are very necessary so that the test
results be reliable, valid, objective, applicable, practical, and effective in
evaluation of the candidates in every field of human activity and especially in
education.

You might also like