Professional Documents
Culture Documents
Arranged by Group 3
Hanifah 1911040346
Class 5 A
2021
PREFACE
Praise to Allah SWT for providing us convience so that we can finish this paper on time.
Without The help, of course we wan’t finish this paper on time. Sholawat and Salam we send to
our Prophet Muhammad SAW who brought us from the darkness in to the brightness.
The authors give thanks to Allah SWT because The blessings of healthy favors, both in
the form of physical health and mind, so the authors are able to complete the writing for
Language Testing assignment entitled “Reliability of Test Item”.
The authors certainly realize that this paper is far from perfect and there are still many
mistakes and deficiencies. For this reason, the authors expect criticism and suggestions from
readers for this paper, so this paper later will become a better. Then if there are many mistakes in
this paper the authors apologize for that. Thus, the authors hope this paper can be useful. Thank
you.
Group7
ii
TABLE OF CONTENTS
PREFACE ................................................................................................................................... ii
A. Background ...................................................................................................................... 1
B. Formulation of the Problem ............................................................................................. 1
C. Writing Purpose ............................................................................................................... 2
A. Definition of Reliability.................................................................................................... 3
B. Types of Reliability .......................................................................................................... 5
C. Standart Error of Measurement......................................................................................... 8
D. How to Calculate Reliability and Example....................................................................... 9
A. Conclusion .......................................................................................................................15
REFRENCES
iii
CHAPTER 1
PREVIEW
A. Background
Validity and reliability are the main indicators of a test. The word reliability in
Indonesian is taken from the word reliability in English, which comes from the word
reliability which means trustworthy. Frequent errors in the use of the terms validity and
valid, also occur in the use of the terms reliability and reliable. The term reliability is a
noun, and the word reliable is an adjective or condition.
A person can be trusted if the person always talks about something, does not change the
content of the conversation from time to time and is always consistent. The test also has
constancy, a test that is declared to have constancy if the test can provide the same
(similar) information even though it is tested on different occasions, and can measure the
ability of the test according to reality. The big surprise is that the information can be
trusted" can be seen in the magnitude of the reliability value with various methods that
are suitable for testing the needs and conditions as well as the supporting factors of the
measurement.
Along with the development of science, technology and art (IPTEKS), there are several
methods that can be used to find the value of test reliability and how to calculate it with
various programs or software. Reliability is one component in the item analysis process.
Item analysis can be done using the classical test theory approach (Classical Test Theory
or CTT) and modern test theory known as item response theory (Item Response Theory
or IRT). One of the programs based on the Classical Test Theory approach is Iteman.
Several programs based on the Item Response Theory approach include: Quest, Ascal,
Rascal, Bilog Rigstep. Dil Reliability tests can also be analyzed using the SPSS
(Statistical Program for Social Science) program.
B. Formulation of Problems
1. What is the definition of reliability?
2. What are types of reliability?
3. How are measurement of reliability?
4. How to Calculate Reability?
C. Writing Purpose
1. To know the definition of reliability.
2. To know types of reliability.
3. To know the measurement of reliability.
4. To know how to calculate reliability.
2
CHAPTER II
DISCUSSION
A. Definition of Reliability
According to Anastasi (1957), the reliability of test refers to the consistency of scores
obtained by the individual on different occasions or with different sets of equivalent.
According to Stodola and Stordahl (1972), the reliability of test can be defined as the
correlation between two or more sets of scores of equivalent tests from the same group of
individuals.
The reliability of test is also defined from another angle. Whenever we measure
something, measurement involves some kind of measure. Error of measurement is generally
between true scores and the observed score. However, in psychological term, word error does
not imply the mistake has been made. In other words, error in psychological testing implies
that there is always some inaccuracy in measurement. Hence, goal of psychological
measurement remains to find out the magnitude of such error and develop ways to minimize
them.
3
errors occur the less reliable the test (Fraenkel & Wallen, 2003; McMillan & Schumacher,
2001, 2006; Moss, 1994; Neuman, 2003). In the same way, Maree and Fraser (2004) ask how
far the test would produce the same results if it was administered to the same children under
the same conditions. This helps the researcher and educator to make comparisons that are
reliable. Reliability is a very important factor in assessment, and is presented as an aspect
contributing to validity and not opposed to validity.
Reliability is the extent to which test scores are not affected by chance factors-by the
luck of the draw. It is the extent to which the test take’s score does not depend on:
The specific day and time of the test (as compared with other possible days and
times of testing)
The specific questions or problems that were on the edition of the test that the test
taker took (as compared with those on other editions), and
The specific raters who rated the test take’s responses (if the scoring process
involved any judgment).
Reliability is consistency
Test scores are reliable to the extent that they are consistent over:
How does we account for an individual who does not get exactly the same test score
every time he or she takes the test? Some possible reasons are the following:
4
Environmental factors. Differences in the testing environment, such as room
temperature, lighting, noise, or even the test administrator, can influence an
individual’s test performance.
Test form. Many tests have more than one version or form. Items differ on each
form, but each form is supposed to measure the same thing. Different forms of a
test are known as parallel forms or alternate forms. These forms are designed to
have similar measurement characteristics, but they contain different items.
Because the forms are not exactly the same, a test taker might do better on one
form than on another.
Multiple raters. In certain test, scoring is determined by a rater’s judgments of
the test taker’s performance or responses.
B. Types of reliability
5
occur over time and cause a change from the initial measurements to the later
measurements. During the time between the two tests, the respondents could have been
exposed to things which changed their opinions, feelings or attitudes about the behavior
under study.
2. Alternate-Forms reliability is computed through correlation. “It is the degree of
relatedness of different forms of the same test” (Ralph & Robert, 2007). This is requires
using in a different way word questions in order to measure the same factor or construct.
That is, two identical tests in every method except for the real factors included. Factors
shall emphasis on the same exact item of conduct with the same terminology and
difficulty levels. The alternative forms technique to estimate reliability is similar to the
test retest method, except that different measures of a behavior (rather than the same
measure) are collected at different times (Bollen, 1989). If the correlation between the
alternative forms is low, it could indicate that considerable measurement error is
present, because two different scales were used. For example, when testing for general
spelling, one of the two independently composed tests might not test general spelling
but a more subject-specific type of spelling such as business vocabulary. This type of
measurement error is then attributed to the sampling of items on the test. Several of the
limits of the test-retest method are also true of the alternative forms technique.
3. Split-Half reliability it can be difficult to administer a test two times to estimate its
reliability. Practice other changes between two times: time 1 and time 2 might
undermine stability estimates of reliability. Another method is to divide the factors into
two sets, compute each subject’s value on the ach half, and correlate the two groups of
values (Karl, 2012).
There are several aspects that make the split-halves approach more desirable than the
test-retest and alternative forms methods. First, the effect of memory discussed
previously does not operate with this approach. Also, a practical advantage is that the
split-halves are usually cheaper and more easily obtained than over time data (Bollen,
1989). A disadvantage of the split-half method is that the tests must be parallel
measures – that is, the correlation between the two halves will vary slightly depending
on how the items are divided. Nunnally (1978) suggests using the split-half method
when measuring variability of behaviours over short periods of time when alternative
6
forms are not available. For example, the even items can first be given as a test and,
subsequently, on the second occasion, the odd items as the alternative form. The
corrected correlation coefficient between the even and odd item test scores will indicate
the relative stability of the behaviour over that period of time.
4. Internal Consistency reliability. Internal consistency evaluates the consistency of
results across factors within a test. Cronbach's alpha is the most used internal
consistency measure, which is generally founded as the mean of all possible split-half
coefficients (Cortina, 1993). It is an overall of an earlier procedure of estimating
internal consistency.
Internal Consistency Reliability test determine how all factors on the test relate to all
other factors. It is applied to sets of factors proposed to measure different features of the
same concept. Its process works as a single component hits only one feature of a
concept. If many different factors are employed to gain information about a specific
construct, then the data set is more reliable.
Cronbach’s alpha shows degree of internal consistency. It is a meaning of the number of
factors in the scale and the degree of their inter-correlations. It ranges from zero to one
exclusively, and it measures the proportion of variability that is shared among factors
(in another words it is the covariance among factors). Moreover, if all factors dispose to
measure the same entity, then they are highly related and the value of alpha must be
high. On the other hand, if all factors dispose to measure different entities, then the
correlation among each other is very low, and the value of alpha is low, too. Note that
the main cause of measurement error is because of the sampling’s content.
Where K is number of factors and r is the average correlation among all factors (the mean
of the K (K-1)/ 2 non-redundant correlation coefficients (i.e., the mean of an upper triangular, or
lower triangular, correlation matrix).
7
C. Standard Error of Measurement.
Test manuals report a statistic called the standard error of measurement
(SEM). It gives the margin of error that you should expect in an individual test score
because of imperfect reliability of the test. The SEM represents the degree of confidence
that a person's "true" score lies within a particular range of scores. The SEM is a useful
measure of the accuracy of individual test scores. The smaller the SEM, the more
accurate the measurements.
The SEM is a useful measure of the accuracy of individual test scores. The smaller the
SEM, the more accurate the measurements.
Types of reliability used. The manual should indicate why a certain type of
reliability coefficient was reported. The manual should also discuss sources of
random measurement error that are relevant for the test.
How reliability studies were conducted. The manual should indicate the conditions
under which the data were obtained, such as the length of time that passed between
administrations of a test in a test-retest reliability study. In general, reliabilities tend
to drop as the time between test administrations increases.
The characteristics of the sample group. The manual should indicate the important
characteristics of the group used in gathering reliability information, such as
education level, occupation, etc. This will allow you to compare the characteristics of
the people you want to test with the sample group. If they are sufficiently similar,
then the reported reliability estimates will probably hold true for your population as
well.
8
For more information on reliability, consult the APA Standards, the SIOP Principles, or
any major textbook on psychometrics or employment testing. Appendix A lists some possible
source.
D. How to Calculate Reliability and Examples
The following describes how to calculate reliability and examples using SPSS and
Quest programs.
In the SPSS window, click AnalyzeScaleReliability Analysis enter all Items into
Items Select the desired method in the Model toolOk
9
After that, the following output will appear:
Reliability Statistics
Cronbach's
Alpha N of Items
.248 30
The model chosen in this test is by using the Alpha model/method so that the interpretation of
the output that appears is by looking at the numbers in the Cronbach's Alpha column where in
this test, the reliability value of the data being tested is 0.248. Because the reliability value does
not meet the reliability standard (number of questions 30, reliability standard = 0.55), the
data/test package is declared less reliable.
10
differ in characteristics from the operational sample (the actual population sample as the
target), the psychometric properties of the measurement results will change dramatically.
b. The measurement accuracy of a test (standard error or standard error of measurement) is
implicitly averaged to all levels of ability being measured. Thus, the accuracy of
measurement at certain score levels is unknown.
Therefore, in this paper, it is shown how to find the value of reliability using the Quest
program which is one of the IRT-based item analysis programs.
Reliability estimation according to IRT is calculated based on the item called the item
separation index and based on the testee (case/person) and is called the person separation index.
The higher the estimated item separation index, the more accurate the overall item analyzed
according to the model used. The higher the person separation index, the more consistent each
measuring item is used to measure the testee concerned. The reliability estimate based on the
testee (case/person) is the same as the reliability according to CTT—that is, reliability according
to Cronbach's alpha for polytomic data and reliability according to Kuder-Richardson-20 for
dichotomous data. The item separation index (RI) is referred to as "sample reliability", the
person separation index is referred to as "test reliability".
How to calculate reliability using the Quest program is described as follows:
a. Create data to be analyzed in notepad with the save code .txt or .dat as shown
below:
11
b. Make the syntax on Windows Quest as shown below:
12
With the following information:
Next, several outputs and selected outputs will appear to see the reliability value, namely the
output coded XXXXsh.out. The following is an example of the output for interpreting test
reliability scores:
13
Based on the reliability of estimate obtained a reliability value of 0.00 which means that
the data is less reliable. The reliability value based on the estimated case or testee is called test
reliability. The higher the value, the more convincing that the measurement gives consistent
results. This result is also determined by the characteristics of the sample. The lower it means
also the more samples for the trial that do not provide the expected information. (not working, or
doing carelessly). The data is the result of a multiple choice test in the form of data with a
dichotomous scale.
14
CHAPTER III
CLOSING
A. Conclusion
Based on the description of the material above, the contents of this paper can be
concluded as follows:
15
REFERENCES
Bollen, K. A. (1989). Structural Equations with Latent Variables (pp. 179-225). John Wiley &
Sons.
Guilford, J.P. (1953). Psychometric Methods. New Delhi: Tata McGraw Hill.
Karl L. Wuensch, (2012). A Brief Introduction to Reliability, Validity, and Scaling. Reliability-
Validity- Scaling. docx.URL: http://www.core.ecu.edu/psyc/.../Reliability-Validity-
Scaling.docx
Stodola, Q. and Stordahl, K. (1972). Basic Educational Tests and Measurement. New Delhi:
Thomson (India).
Rosenthal, R. and Rosnow, R.L. (1991). Essential of Behavioral Research: Methods and Data
Analysis. Second Edition. McGraw-Hill Publishing Company, pp. 46-65.
Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications.
Journal of Applied Psychology, 78(1), 98–104.
Ralph Rosnow & Robert Rosenthal (2008). Essentials of Behavioral Research: Methods and
Data Analysis. McGraw-Hill Humanities/Social Sciences/Languages; 3rd edition, April 4,
2007.
https://www.ets.org/Media/Research/pdf/RM-18-01.pdf
https://hr-guide.com/Testing_and_Assessment/Reliability_and_Validity.htm
https://www.scribd.com/doc/244346720/Makalah-Reliabilitas-Tes
16