You are on page 1of 19

RELIABILITY OF TEST ITEMS

Arranged to Fulfill Course Assignment of Language Testing

Arranged by Group 3

Hanifah 1911040346

Nanda Permata Aulia 1911040144

Shilla Ananda Putri 1911040485

Class 5 A

Lecture: Prof. Dr. H. Idham Kholid, M.Ag

ENGLISH EDUCATION MAJOR

EDUCATION AND TEACHER TRAINING FACULTY

RADEN INTAN ISLAMIC STATE UNIVERSITY, LAMPUNG

2021
PREFACE

Praise to Allah SWT for providing us convience so that we can finish this paper on time.
Without The help, of course we wan’t finish this paper on time. Sholawat and Salam we send to
our Prophet Muhammad SAW who brought us from the darkness in to the brightness.

The authors give thanks to Allah SWT because The blessings of healthy favors, both in
the form of physical health and mind, so the authors are able to complete the writing for
Language Testing assignment entitled “Reliability of Test Item”.

The authors certainly realize that this paper is far from perfect and there are still many
mistakes and deficiencies. For this reason, the authors expect criticism and suggestions from
readers for this paper, so this paper later will become a better. Then if there are many mistakes in
this paper the authors apologize for that. Thus, the authors hope this paper can be useful. Thank
you.

Bandar Lampung, September 2nd 2021

Group7

ii
TABLE OF CONTENTS

TITLE PAGE ............................................................................................................................. i

PREFACE ................................................................................................................................... ii

TABLE OF CONTENS ............................................................................................................. iii

CHAPTER ONE : PREVIEW .................................................................................................. 1

A. Background ...................................................................................................................... 1
B. Formulation of the Problem ............................................................................................. 1
C. Writing Purpose ............................................................................................................... 2

CHAPTER TWO : DISCUSSION

A. Definition of Reliability.................................................................................................... 3
B. Types of Reliability .......................................................................................................... 5
C. Standart Error of Measurement......................................................................................... 8
D. How to Calculate Reliability and Example....................................................................... 9

CHAPTER THREE : CONCLUSION

A. Conclusion .......................................................................................................................15

REFRENCES

iii
CHAPTER 1

PREVIEW

A. Background
Validity and reliability are the main indicators of a test. The word reliability in
Indonesian is taken from the word reliability in English, which comes from the word
reliability which means trustworthy. Frequent errors in the use of the terms validity and
valid, also occur in the use of the terms reliability and reliable. The term reliability is a
noun, and the word reliable is an adjective or condition.
A person can be trusted if the person always talks about something, does not change the
content of the conversation from time to time and is always consistent. The test also has
constancy, a test that is declared to have constancy if the test can provide the same
(similar) information even though it is tested on different occasions, and can measure the
ability of the test according to reality. The big surprise is that the information can be
trusted" can be seen in the magnitude of the reliability value with various methods that
are suitable for testing the needs and conditions as well as the supporting factors of the
measurement.
Along with the development of science, technology and art (IPTEKS), there are several
methods that can be used to find the value of test reliability and how to calculate it with
various programs or software. Reliability is one component in the item analysis process.
Item analysis can be done using the classical test theory approach (Classical Test Theory
or CTT) and modern test theory known as item response theory (Item Response Theory
or IRT). One of the programs based on the Classical Test Theory approach is Iteman.
Several programs based on the Item Response Theory approach include: Quest, Ascal,
Rascal, Bilog Rigstep. Dil Reliability tests can also be analyzed using the SPSS
(Statistical Program for Social Science) program.
B. Formulation of Problems
1. What is the definition of reliability?
2. What are types of reliability?
3. How are measurement of reliability?
4. How to Calculate Reability?
C. Writing Purpose
1. To know the definition of reliability.
2. To know types of reliability.
3. To know the measurement of reliability.
4. To know how to calculate reliability.

2
CHAPTER II

DISCUSSION

A. Definition of Reliability

According to Anastasi (1957), the reliability of test refers to the consistency of scores
obtained by the individual on different occasions or with different sets of equivalent.

According to Stodola and Stordahl (1972), the reliability of test can be defined as the
correlation between two or more sets of scores of equivalent tests from the same group of
individuals.

According to Guiford (1954), reliability is the proportion of the true variance in


obtained test scores.

Reliability is the consistency of your measurement, or the degree to which an


instrument measures the same way each time it is used under the same condition with the
same subject. In short, it is the repeatability of measurement. A measure is considered
reliable if a person’s score on the same test given twice is similar. It is important to
remember that reliability is not measured. It is estimated. For instance, if a test is constructed
to measure a particular trait; say neuroticism, then each time it is administered, it should
yield same result. A test is considered reliable if we get same result repeatedly.

The reliability of test is also defined from another angle. Whenever we measure
something, measurement involves some kind of measure. Error of measurement is generally
between true scores and the observed score. However, in psychological term, word error does
not imply the mistake has been made. In other words, error in psychological testing implies
that there is always some inaccuracy in measurement. Hence, goal of psychological
measurement remains to find out the magnitude of such error and develop ways to minimize
them.

A test is seen as being reliable when it can be used by a number of different


researchers under stable conditions, with consistent results and the results not varying.
Reliability reflects consistency and replicability over time. Furthermore, reliability is seen as
the degree to which a test is free from measurement errors, since the more measurement

3
errors occur the less reliable the test (Fraenkel & Wallen, 2003; McMillan & Schumacher,
2001, 2006; Moss, 1994; Neuman, 2003). In the same way, Maree and Fraser (2004) ask how
far the test would produce the same results if it was administered to the same children under
the same conditions. This helps the researcher and educator to make comparisons that are
reliable. Reliability is a very important factor in assessment, and is presented as an aspect
contributing to validity and not opposed to validity.

Reliability is the extent to which test scores are not affected by chance factors-by the
luck of the draw. It is the extent to which the test take’s score does not depend on:

 The specific day and time of the test (as compared with other possible days and
times of testing)
 The specific questions or problems that were on the edition of the test that the test
taker took (as compared with those on other editions), and
 The specific raters who rated the test take’s responses (if the scoring process
involved any judgment).

Another way to say this is…

Reliability is consistency

Test scores are reliable to the extent that they are consistent over:

 Different occasions of testing.


 Different editions of the test, containing different questions or problems designed
to measure the same general skills or types of knowledge, and
 Different scorings of the test take’s responses, by different rates.

How does we account for an individual who does not get exactly the same test score
every time he or she takes the test? Some possible reasons are the following:

 Test taker’s is temporary psychological or physical state. Test performance


can be influenced by a person’s psychological or physical state at the time of
testing.

4
 Environmental factors. Differences in the testing environment, such as room
temperature, lighting, noise, or even the test administrator, can influence an
individual’s test performance.
 Test form. Many tests have more than one version or form. Items differ on each
form, but each form is supposed to measure the same thing. Different forms of a
test are known as parallel forms or alternate forms. These forms are designed to
have similar measurement characteristics, but they contain different items.
Because the forms are not exactly the same, a test taker might do better on one
form than on another.
 Multiple raters. In certain test, scoring is determined by a rater’s judgments of
the test taker’s performance or responses.

B. Types of reliability

Reliability is one of the most significant components of test wuality.it is involved


with the reproducibility, consistency, or an examinee’s performance on the test. Reliability is
the total consistency of a certain measure. A measure is considered to have a high reliability
when it yields the same results under consistent conditions (Neil, 2009). There are several
types of reliability such as Test-Retest Reliability, Alternate-Forms reliability, Split-Half
reliability, and Internal Consistency Reliability.

1. Test-Retest Reliability. Test-Retest reliability refers to the temporal stability of a test


from one measurement session to another. The procedure is to administer the test to a
group of respondents and then administer the same test to the same respondents at a
later date. The correlation between scores on the identical tests given at different times
operationally defines its test-retest reliability.
Despite its appeal, the test-retest reliability technique has several limitations (Rosenthal
& Rosnow, 1991). For instance, when the interval between the first and second test is
too short, respondents might remember what was on the first test and their answers on
the second test could be affected by memory. Alternatively, when the interval between
the two tests is too long, maturation happens. Maturation refer to changes in the subject
factors or respondents (other than those associated with the independent variable) that

5
occur over time and cause a change from the initial measurements to the later
measurements. During the time between the two tests, the respondents could have been
exposed to things which changed their opinions, feelings or attitudes about the behavior
under study.
2. Alternate-Forms reliability is computed through correlation. “It is the degree of
relatedness of different forms of the same test” (Ralph & Robert, 2007). This is requires
using in a different way word questions in order to measure the same factor or construct.
That is, two identical tests in every method except for the real factors included. Factors
shall emphasis on the same exact item of conduct with the same terminology and
difficulty levels. The alternative forms technique to estimate reliability is similar to the
test retest method, except that different measures of a behavior (rather than the same
measure) are collected at different times (Bollen, 1989). If the correlation between the
alternative forms is low, it could indicate that considerable measurement error is
present, because two different scales were used. For example, when testing for general
spelling, one of the two independently composed tests might not test general spelling
but a more subject-specific type of spelling such as business vocabulary. This type of
measurement error is then attributed to the sampling of items on the test. Several of the
limits of the test-retest method are also true of the alternative forms technique.
3. Split-Half reliability it can be difficult to administer a test two times to estimate its
reliability. Practice other changes between two times: time 1 and time 2 might
undermine stability estimates of reliability. Another method is to divide the factors into
two sets, compute each subject’s value on the ach half, and correlate the two groups of
values (Karl, 2012).
There are several aspects that make the split-halves approach more desirable than the
test-retest and alternative forms methods. First, the effect of memory discussed
previously does not operate with this approach. Also, a practical advantage is that the
split-halves are usually cheaper and more easily obtained than over time data (Bollen,
1989). A disadvantage of the split-half method is that the tests must be parallel
measures – that is, the correlation between the two halves will vary slightly depending
on how the items are divided. Nunnally (1978) suggests using the split-half method
when measuring variability of behaviours over short periods of time when alternative

6
forms are not available. For example, the even items can first be given as a test and,
subsequently, on the second occasion, the odd items as the alternative form. The
corrected correlation coefficient between the even and odd item test scores will indicate
the relative stability of the behaviour over that period of time.
4. Internal Consistency reliability. Internal consistency evaluates the consistency of
results across factors within a test. Cronbach's alpha is the most used internal
consistency measure, which is generally founded as the mean of all possible split-half
coefficients (Cortina, 1993). It is an overall of an earlier procedure of estimating
internal consistency.
Internal Consistency Reliability test determine how all factors on the test relate to all
other factors. It is applied to sets of factors proposed to measure different features of the
same concept. Its process works as a single component hits only one feature of a
concept. If many different factors are employed to gain information about a specific
construct, then the data set is more reliable.
Cronbach’s alpha shows degree of internal consistency. It is a meaning of the number of
factors in the scale and the degree of their inter-correlations. It ranges from zero to one
exclusively, and it measures the proportion of variability that is shared among factors
(in another words it is the covariance among factors). Moreover, if all factors dispose to
measure the same entity, then they are highly related and the value of alpha must be
high. On the other hand, if all factors dispose to measure different entities, then the
correlation among each other is very low, and the value of alpha is low, too. Note that
the main cause of measurement error is because of the sampling’s content.

The conceptual formula of Cronbach’s Alpha is defined by:

Where K is number of factors and r is the average correlation among all factors (the mean
of the K (K-1)/ 2 non-redundant correlation coefficients (i.e., the mean of an upper triangular, or
lower triangular, correlation matrix).

7
C. Standard Error of Measurement.
Test manuals report a statistic called the  standard error of measurement
(SEM). It gives the margin of error that you should expect in an individual test score
because of imperfect reliability of the test. The SEM represents the degree of confidence
that a person's "true" score lies within a particular range of scores. The SEM is a useful
measure of the accuracy of individual test scores. The smaller the SEM, the more
accurate the measurements.

The SEM is a useful measure of the accuracy of individual test scores. The smaller the
SEM, the more accurate the measurements.

When evaluating the reliability coefficients of a test, it is important to review the


explanations provided in the manual for the following:

 Types of reliability used. The manual should indicate why a certain type of
reliability coefficient was reported. The manual should also discuss sources of
random measurement error that are relevant for the test.
 How reliability studies were conducted. The manual should indicate the conditions
under which the data were obtained, such as the length of time that passed between
administrations of a test in a test-retest reliability study. In general, reliabilities tend
to drop as the time between test administrations increases.
 The characteristics of the sample group. The manual should indicate the important
characteristics of the group used in gathering reliability information, such as
education level, occupation, etc. This will allow you to compare the characteristics of
the people you want to test with the sample group. If they are sufficiently similar,
then the reported reliability estimates will probably hold true for your population as
well.

8
For more information on reliability, consult the APA Standards, the SIOP Principles, or
any major textbook on psychometrics or employment testing. Appendix A lists some possible
source.
D. How to Calculate Reliability and Examples
The following describes how to calculate reliability and examples using SPSS and
Quest programs.

1. How to Calculate Reliability and Examples using the SPSS program


The scores obtained from the testees are arranged as follows:

In the SPSS window, click AnalyzeScaleReliability Analysis enter all Items into
Items Select the desired method in the Model toolOk

Windows will appear as follows:

9
After that, the following output will appear:

Reliability Statistics

Cronbach's
Alpha N of Items

.248 30

The model chosen in this test is by using the Alpha model/method so that the interpretation of
the output that appears is by looking at the numbers in the Cronbach's Alpha column where in
this test, the reliability value of the data being tested is 0.248. Because the reliability value does
not meet the reliability standard (number of questions 30, reliability standard = 0.55), the
data/test package is declared less reliable.

2. How to calculate reliability and examples using the Quest program


The Quest program is one of the programs used based on the Item Response Theory
approach which includes: Quest, Ascal, Rascal, Bilog, Bigstep, etc. Before IRT was developed,
there was first an approach in analyzing items, namely the Classical Test Theory or CTT
approach where one example of a CTT-based program was Iteman. However, Subali & Suyata
(2011a; 2011b) explained that there are limitations in the CTT-based item analysis process,
including:
a. CTT statistics depend on the subpopulation taking the test. Different groups of test takers
also have different mean scores of the measured variable attributes. Thus, test developers
must be careful when selecting samples for item calibration. If the calibration samples

10
differ in characteristics from the operational sample (the actual population sample as the
target), the psychometric properties of the measurement results will change dramatically.
b. The measurement accuracy of a test (standard error or standard error of measurement) is
implicitly averaged to all levels of ability being measured. Thus, the accuracy of
measurement at certain score levels is unknown.

Therefore, in this paper, it is shown how to find the value of reliability using the Quest
program which is one of the IRT-based item analysis programs.
Reliability estimation according to IRT is calculated based on the item called the item
separation index and based on the testee (case/person) and is called the person separation index.
The higher the estimated item separation index, the more accurate the overall item analyzed
according to the model used. The higher the person separation index, the more consistent each
measuring item is used to measure the testee concerned. The reliability estimate based on the
testee (case/person) is the same as the reliability according to CTT—that is, reliability according
to Cronbach's alpha for polytomic data and reliability according to Kuder-Richardson-20 for
dichotomous data. The item separation index (RI) is referred to as "sample reliability", the
person separation index is referred to as "test reliability".
How to calculate reliability using the Quest program is described as follows:
a. Create data to be analyzed in notepad with the save code .txt or .dat as shown
below:

11
b. Make the syntax on Windows Quest as shown below:

12
With the following information:

Next, several outputs and selected outputs will appear to see the reliability value, namely the
output coded XXXXsh.out. The following is an example of the output for interpreting test
reliability scores:

13
Based on the reliability of estimate obtained a reliability value of 0.00 which means that
the data is less reliable. The reliability value based on the estimated case or testee is called test
reliability. The higher the value, the more convincing that the measurement gives consistent
results. This result is also determined by the characteristics of the sample. The lower it means
also the more samples for the trial that do not provide the expected information. (not working, or
doing carelessly). The data is the result of a multiple choice test in the form of data with a
dichotomous scale.

14
CHAPTER III
CLOSING

A. Conclusion

Based on the description of the material above, the contents of this paper can be
concluded as follows:

a. Reliability is the degree of consistency, trust, determination, stability of a test in


measuring something that is expected to be measurable.
b. Several methods/methods to determine the value of reliability based on these
three principles, among others:
(1) the retest method (test-retest method)
(2) parallel form method (alternate/parallel form)
(3) the split-half method
(4) the Kuder-Richardson-20&21 method
(5) Cronbach alpha method.
c. How to calculate the reliability value using the SPSS program can use several
models according to the needs of the examiner.
d. How to calculate the reliability value using the Quest program can be seen in the
output coded sh.out by interpreting the numbers in the reliability of estimate.

15
REFERENCES

Anastasi, Anne. (1988). Psychological Testing (6th edition). London: Mac-Millan.

Bollen, K. A. (1989). Structural Equations with Latent Variables (pp. 179-225). John Wiley &
Sons.

Guilford, J.P. (1953). Psychometric Methods. New Delhi: Tata McGraw Hill.

Karl L. Wuensch, (2012). A Brief Introduction to Reliability, Validity, and Scaling. Reliability-
Validity- Scaling. docx.URL: http://www.core.ecu.edu/psyc/.../Reliability-Validity-
Scaling.docx

Stodola, Q. and Stordahl, K. (1972). Basic Educational Tests and Measurement. New Delhi:
Thomson (India).

Rosenthal, R. and Rosnow, R.L. (1991). Essential of Behavioral Research: Methods and Data
Analysis. Second Edition. McGraw-Hill Publishing Company, pp. 46-65.

Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications.
Journal of Applied Psychology, 78(1), 98–104.

Ralph Rosnow & Robert Rosenthal (2008). Essentials of Behavioral Research: Methods and
Data Analysis. McGraw-Hill Humanities/Social Sciences/Languages; 3rd edition, April 4,
2007.

https://www.ets.org/Media/Research/pdf/RM-18-01.pdf

https://hr-guide.com/Testing_and_Assessment/Reliability_and_Validity.htm

https://www.scribd.com/doc/244346720/Makalah-Reliabilitas-Tes

16

You might also like