You are on page 1of 9

ESTIMATION TO EXAMINEE PARAMETER USING NEWTON-RAPHSON

METHOD: AN APPLICATION FOR TEST CONSTRUCTION

Widiatmoko
e.: moko.geong@gmail.com
w.: http://mokogeong.multiply.com
Widyaiswara di PPPPTKB, Ditjen PMPTK, Depdiknas

Abstrak
Teori tes klasik (TTK) diketahui sebagai teori tes yang memiliki banyak kelemahan. TTK
memungkinkan adanya interdependensi butir-peserta ujian. Di samping itu, TTK juga
memiliki butir-butir yang multidimensi. Kelemahan-kelemahan itu tidak ditemukan di
dalam teori responsi butir (TRB). TRB terbukti memberikan solusi atas kelemahan-
kelemahan TTK. TRB mengandung beberapa karakteristik, seperti: pemenuhan butir unidimensi, independensi
lokal antara butir dan peserta ujian, dan invariansi parameter butir-peserta ujian. Untuk mengetahui
karakteristik butir dan karakteristik peserta ujian di dalam konstruksi tes, persyaratan tersebut perlu dipenuhi.
Di dalam banyak tes, metoda analisis yang digunakan bertalian dengan TRB. Ini bermakna bahwa butir-
peserta ujian berkait-rapat dengan unidimensi, independensi lokal, dan invariansi parameter. Untuk
mengestimasi parameter butir-peserta ujian, TRB menggunakan beberapa metoda. Salah satu di antara mereka
adalah metoda Newton-Raphson (N-R). Metoda ini digunakan untuk mengestimasi parameter peserta ujian
manakala parameter butir telah diketahui. Berpijak pada data sekunder, parameter peserta ujian diestimasi
secara marginal dengan metoda N-R. Disimpulkan bahwa kurva karakteristik butir untuk model logistik satu
parameter tidak dapat dipenuhi. Hal ini disebabkan oleh keterbatasan jumlah responden, metoda yang dipilih,
atau faktor-faktor lainnya.

Kata kunci: examinee parameter, Newton-Raphson method.

Introduction
Education is held to bring about some kinds of social changes. Its origin lies in
the movements for the scientific management of education and the work of behavi-
oral psychologists in defining learning as a process of observable changes in behavi-
or, which can be measured. Primarily, it is reflected in the curriculum. The cur-
riculum concerns its objectives and aims of learning. The aims refer to statements of
general changes that a program seeks to bring about in students; the objectives refer
to statement of specific changes a program seeks to bring about and results from an
analysis of the aim into different components (Richards, 2002: 8-11). In addition, cur-
riculum focuses on determining what knowledge, skills, and values students learn in
schools, what experiences should be provided to bring about intended learning out-
comes, and how teaching and learning in schools or educational systems can be
planned, measured, and evaluated (Richards, 2001: 2). In ELT curriculum, it is clear
enough to cope with all aspects of the planning, implementation, and evaluation of
an ELT program. Evaluation involves all students and teachers. The primary pur-
pose of this evaluation is to determine whether or not the curriculum goals have
been met. Another purpose is to determine the effectiveness of the curriculum and to

1
evaluate the ELT program itself (Finney, 2002: 77). In essence, the students have pri-
ority for attention over those coped in the curriculum. In microscope, it is implemen-
ted in undertaking regular checking of what students know and can do. It is through
some ways, i.e. questionnaires, interview, discussions, observations, tasks, journals
or logs, and test (Cotterall & Reinders, 2004: 29-30). Test is however of primary con-
cern.
Test as it relates to educational and psychological concerns is a set of ques-
tions to measure examinee’s trait in a certain situation. It is also defined as a group of
systematic questions, statements, tasks that need to be answered or responded by ex-
aminees for the sake of measuring their skills, knowledge, intelligence, ability, or in-
terest of a particular subject (Widiatmoko, 2004: 5). Ying (2005: 7) defines test as an
instrument or procedural application to measure a variable’s quantity or quality. Test
then is used to measure examinee’s skills, knowledge, or psychological attributes. To
this point, test is designed in a simple way. It means that an examiner only focuses
on what the examinees are like in their norm groups. He or she does not pay atten-
tion to the examinee’s real trait in favor of reference groups. In addition, the exam-
iner designs test items based on the aforementioned judgment for which the items
are intended. When it comes about, the test items are dependent upon the examin-
ees. It means that when the high achievers respond the items correctly, these items
are deemed easy, and vice versa. In other words, when the test is easy, the examinee
will appear to have higher ability; when the test is hard, the examinee will appear to
have low ability (Hambleton, Swaminathan, & Rogers, 1991: 2; Naga, 1992: 158-159).
For that reason, item statistics are subject to change or inconsistent depending upon
the groups’ traits of examinees. It is actually the failure to notice. The other short-
comings with the test designs are that item difficulty (i.e., proportion of examinees
passing the item) and item discrimination (i.e., item-total test biserial or point biseri-
al) are group-dependent. It implies that the values of these statistics depend on the
examinee group in which they are acquired (Magnusson, 1967: 209-212; Hambleton,
1989: 147). Therefore, if the examinee sample does not closely reflect the population
to whom the test is proposed, the item statistics obtained in the sample are limited in
the usefulness. Item response theory (IRT) thus becomes widely known. It is as
proved that nowadays IRT is being used by nearly all of the largest test publishers,
many state departments of education, and industrial and professional organizations
(Hambleton & Murray, 1983: 71; Hambleton, 1989: 149).
Thus far, IRT figures out the test design problems. First, item is actually inde-
pendent of the examinee, and examinee is independent of the item. It is named as
local independence. Local is assumed as a point in continuum of examinee’s trait
parameter θ, which can be an interval form containing homogenous subpopulation
of examinees. Independence is interpreted as independence of all examinees from
items in the subpopulation. Local independence is defined as the composite scores of
items responded by the homogeneous subpopulation of examinees which are inde-
pendent (Naga, 1992 cited in Widiatmoko, 2005: 76). It implies that responses to any
two items are uncorrelated in a homogeneous subpopulation with a particular level

2
of (Hulin, Drasgow, & Parsons, 1983: 43). Hambleton, Swaminathan, & Rogers
(1991: 10) state that when the traits influencing test performance are held constant,
examinees’ responses to any pair of items are statistically independent. In other
words, after taking examinees’ traits into account, no relationship exists between ex-
aminees’ responses to different items. It suggests that the traits specified in the mod-
el be the only factors influencing examinees’ responses to test items. Lord & Novick
(1968: 361) and McDonald (1999: 255) state that local independence is that within any
group of examinees all characterized by the same values θ1, θ2, ..., θk, the (condition-
al) distributions of the item scores are all independent of each other. Secondly, para-
meter invariance is construed as a function of the single measure or the item char-
acteristic bi, which does not change across subpopulation whenever the subpopula-
tion changes. It is then interpreted as an examinee’s trait, which does not change
whenever the item chosen alters (Hulin, Drasgow, & Parsons, 1983; Naga, 1992 cited
in Widiatmoko, 2005: 76). Additionally, it means that the parameters characterising
an item do not depend on the trait distribution of the examinees and the parameter
characterising an examinee does not depend on the set of test items (Hambleton,
Swaminathan, & Rogers, 1991: 18). Therefore, invariance is an important property of
IRT as well. When invariance does not hold, the item is considered to be drifting
from its original parameter value (Wells, Subkovlak, & Serlin, 2002: 77). Thirdly, uni-
dimensionality is defined as the presence of a dominant component or factor that in-
fluences test performance. This dominant component or factor is referred to as the
trait measured by the test (Hambleton, Swaminathan, & Rogers, 1991: 9-10). Unidi-
mension is also interpreted as an item that measures one trait or characteristic over
the examinees (Traub, 1983: 58; Naga, 1992: 164). It implies that the probability of an
item response is as a function of a single latent characteristic of the examinee θ
(Hulin, Drasgow, & Parsons, 1983 cited in Widiatmoko, 2005: 77). Since every charac-
teristic is determined by one measure, one type of measure can also be interpreted as
the requirement to measure only one dimension of examinee’s latent trait over sub-
population.
Hence, local independence, parameter invariance, and unidimension are IRT
primary characteristics, which are not found in the classical test theory (CTT). These
are the solution to CTT drawbacks. And, these are of course the advantages of IRT.
Based on the considerations, IRT is then widely used for obtaining the good quality
of items in test battery. In this case, the test constructor fairly designs test items,
which are not on purpose for certain examinees. Items are constructed for all exam-
inees in reference groups. Language tests are so far designed using the concept of
IRT. It has concerned IRT since the paradigm in discrete-points test emerged. In this
view, discrete-points test yields data, which are easily quantifiable (Weir, 1990: 2).
The discrete-points test is known as one that attempts to focus on one point of gram-
mar at a time. Each test item is aimed at one and only one element of a particular
component of a grammar. It purports to assess only one skill at a time and only one
aspect of a skill (Oller, 1979: 37). Therefore, the psychometric structuralist paradigm
initiates language test item analysis in accordance with IRT concept. Mostly, the

3
concept is employed in TOEFL items. The items are up to now the object of research
in psychometrics. Of course, the research has implications for wider studies on lan-
guage testing.
Up to now, it is known that there are lots of language testing researches. One
of them is related to the estimation of item-examinee parameters. The estimates may
be jointly or marginally carried out. As far as it concerns, some ways of estimation
include joint maximum likelihood, marginal maximum likelihood, conditional max-
imum likelihood, joint and marginal Bayesian, nonlinear factor analysis, and heurist-
ic (Hambleton, 1989: 166; Naga, 1992: 250; Swaminathan, Hambleton, Sireci, Xing, &
Rizavi, 2003: 29). The joint estimate of item-examinee parameters employs the Prox
method, whereas, the marginal estimate of item-examinee parameters employs
Newton-Raphson method. The question formulated in this study then is: Is the test
characteristic curve generated by Newton-Raphson method satisfied in one-parameter logist-
ic model?

Estimation to Examinee Parameter Using Newton-Raphson Method


Parameter estimation is an essential part in IRT. Its purposes include determ-
ining the value of an examinee’s trait with adequate precision and classifying an
examinee into trait categories with small probabilities of misclassification (Lord &
Novick, 1968: 405). This however incorporates item parameter and examinee para-
meter. Estimation to the examinee parameter is the ultimate goal of testing in which
this goal cannot be achieved without determining the parameters that characterize
the items. These include item difficulty and discrimination. The items estimated are
well kept. Usually, these items are administered to examinees. It then increases the
exposure rate of the items, particularly those which are in the item pools. In other
words, these items can compromise test security. Adding the new items into the item
pools, it requires experimenting them to examinees. Estimating items using a small
sample of examinees concerns the new discussion. On one hand, it relates to para-
meters of new items, which are unknown. On the other hand, it relates to parameters
of experimental items, which are well known. Therefore, estimation to item paramet-
er is interesting to investigate (Swaminathan, Hambleton, Sireci, Xing, & Rizavi,
2003: 27-28).
So far, item parameter and examinee parameter concern the IRT basic consid-
eration of estimation method, i.e., maximum likelihood estimation in which the cent-
ral idea is that parameter estimates are chosen by selecting the values that make an
observed data set appear most likely in light of a particular model (Hulin, Drasgow,
& Parsons, 1983: 46). Item parameter estimation may include estimation to item diffi-
culty and discrimination. Examinee parameter estimation includes estimation to ex-
aminee’s trait. To this point, estimation to parameters is associated with the model of
response or item characteristic intended. IRT is so far concerned with some models,
but not all the models are commonly intended. Only a few models are in current use.
As known, the models include one-parameter logistic (1PL) model, two-parameter
logistic (2PL) model, three-parameter logistic (3PL) model, and four-parameter lo-

4
gistic (4PL) model. What these models have in common is a systematic procedure for
considering and quantifying the probability or improbability of individual item and
examinee’s response patterns given the overall pattern of responses in a set of test
data (Henning, 1987: 107-108). Similarly, these models are appropriate for dichotom-
ous item response data. On the contrary, a primary distinction among the models is
in the numbers of parameters used to describe the items. The first parameter is a
scale of examinee’s trait and item difficulty. The second parameter is a continuous es-
timate of discriminability. The third parameter is an index of pseudo chance-level
(guessing) (Henning, 1987: 108; Swaminathan, Hambleton, Sireci, Xing, & Rizavi,
2003: 29) and this guessing may affect the score matrix, the total test variance, and
the test reliability (Magnusson, 1967: 225). Then, the fourth parameter is an index of
carelessness by the high achiever (Hambleton, 1989: 157). Mostly, the consideration
on the model used involves assumptions about the data that can be verified by ex-
amining how well the model explains the observed test results (Hambleton, Swam-
inathan, & Rogers, 1991: 12). In addition, the model should fit the real condition.
When the model is decided, all the calculation is undergone based on the model in-
tended. In order to avoid the error of probability in the decision of the model, the
item requirements of unidimension, invariance, and local independence are satisfied.
After that, data are collected. Using the data collected, item-examinee parameter can
be estimated. The estimation is done until the results are appropriate to the model
(Naga, 1992: 175-176).
1PL model is hitherto a widely used IRT model. This model is also named as
Rasch model. The Rasch model is actually probabilistic in nature where the examin-
ees and items are not only graded for trait and difficulty, but also judged according
to the probability or likelihood of their response patterns given the observed examin-
ee’s trait and item difficulty (Henning, 1987: 117). The considerations why the model
is implemented are the assumption that all items are equally discriminating, that the
importance of the intended application is for relatively easy tests, and that the model
is convenient with small samples. Hulin et al. cited in Crocker & Algina (1986: 355)
suggest that much smaller sample sizes are required if the main purpose is to estim-
ate θ. Estimation to parameters quite often occurs in 1PL model. It implies that item
difficulty bi and examinee’s latent trait θ are employed. Moreover, this 1PL is
strongly recommended to use since the other model, particularly 3PL model, is not
known whether the estimates of the item parameters are consistent. A procedure
yields constant estimates if it can be shown that as the sample size gets larger and
larger, the estimates tend to get closer and closer to the true parameter values
(Crocker & Algina, 1986: 355).
It is well known that the estimation employs various methods. Two main es-
timation situations arising in practice are however taken into account: estimation of
trait with item parameters known and estimation of item and trait parameters
(Hambleton, 1989: 166-167). The estimation of latent trait with item parameters
known is the simple estimation. This estimation employs Newton-Raphson (N-R)
method. This method is appointed to estimate examinee’s latent trait θ marginally in

5
which the item difficulty bi is known beforehand. Additionally, this method finds
zeros of the next derivatives of maximized function (Krass, 2005: 7). Furthermore,
this method obtains promising results in which the drift of parameter estimates is ar-
rested and the parameters are estimated more accurately than with the joint maxim-
um likelihood procedure (Swaminathan, Hambleton, Sireci, Xing, & Rizavi, 2003:
29). Accordingly, the item difficulty bi is the only item characteristic that influences
examinee performance. The item difficulty bi is the point on the ability scale in which
the probability of a correct response is 0.5. It indicates the position of the item char-
acteristic curve (ICC) in line with the trait scale. Item difficulty bi in IRT is in the
same scale as examinee’s latent trait θ (Setiadi, 1999: 6). In other words, the item
parameter displays ICC, which is related to trait scale. Therefore, ICC is recognized
as an item response model specifying a relationship between the observable examin-
ee-item performance (correct or incorrect responses) and the unobservable traits as-
sumed to underlie performance on the test (Hambleton, 1989: 149). The greater the
value of bi parameter, the greater the trait that is needed for an examinee to have a
50% chance of getting the item right. Hence, the items are harder. When the trait val-
ues of a group are transformed so that the means are 0 and the standard deviation is
1, the values of bi vary from -3 to +3 (Hulin, Drasgow, & Parsons, 1983: 101) or from
around –4 to +4 empirically (Naga, 1992: 224) or from -∞ to +∞ theoretically (Hamb-
leton, 1989: 161; Naga, 1992: 224), telling from the easiest to the most difficulty items.
Therefore, bi in 1PL model plays a pivotal role in conjunction with the estimation us-
ing N-R method.
N-R method has some steps to estimate examinee parameter. The method em-
N

∑[ X i − Pi (θ )]
ploys the equation (Naga, 2003: 7): θ S +1 = θ S + i =1
N
;
D∑ Pi (θ ).Qi (θ )
i =1

in which θs is the initial examinee’s latent trait; θs+1 is the following examinee’s latent
trait; N is a number of items in the test; Xi is examinee’s response; Pi(θ) is the probab-
ility of examinee with trait θ answering the items i correctly; Qi(θ) is the probability
of examinee with trait θ answering the items i incorrectly; and D is a scaling con-
stant, i.e., 1.7.
The estimation of parameter θ goes like this. Firstly, the first examinee’s re-
sponses are put in line with the item numbers and the item difficulties. The correct
response is expressed X=1, and the incorrect response is expressed X=0. Secondly, the
initial examinee’s trait θs is calculated considering the natural logarithm n between
the success probability Pi(θ) and the failure probability Qi(θ). Then, the success prob-
ability Pi(θ) is calculated for all items using the formula:
e D (θ −bi )
Pi (θ ) = ; where e is the exponential number whose value is 2.718.
1 + e D (θ −bi )
Next, the failure probability is calculated using the formula Qi(θ) = 1 – Pi(θ).
Then, the examinee’s response Xi is subtracted by the success probability Pi(θ). The
sum of this Xi – Pi(θ) is calculated. After that, the scaling constant D, the success
probability Pi(θ), and the failure probability Qi(θ) are multiplied. The sum of this

6
DPi(θ)Qi(θ) is calculated as well. In order to obtain the next iteration of examinee’s
trait θ1, the calculation is done using the N-R formula. The distance between θ0 and
θ1 is primarily used for the decision on the next iteration. When the distance is equal
to or less than 0.001, it is considered sufficient to get maximum likelihood and the
curve gets convergent. According to Krass (2005: 7), in the sense of convergence,
finding zeros of the following derivatives of maximized function is undertaken. Fi-
nally, the θ estimation is done for all examinees.

Methodology and Analysis


This is a survey-based study, which employs the secondary data, i.e. the sub-
population of item difficulties bi and subpopulation of examinees responding items.
The subpopulation is taken by using purposive random sampling, which requires
some steps. Firstly, the population of 45 items along with their item difficulties and
the population of 2000 examinees responding the items are targeted. After that, 40
item difficulties bi are data taken randomly as the subpopulation of items. Then, only
examinees respond the items correctly and incorrectly are purposively selected data;
those who have no responses are not identified. Finally, considering 1PL model, only
70 examinees responding the items are data taken randomly as the subpopulation of
examinees. Therefore, the research analysis units are 40 item difficulties and 70 ex-
aminees responding the items. Based on the research analysis units, the values of ex-
aminees’ latent traits are analyzed.
The analysis goes like this. The initial θ of examinees’ latent traits extend from
-1.735 to +3.664. The estimation doing for the first iteration includes the examinees
10, 20, 30, and 50. The estimation doing for the second iteration includes the examin-
ees 5, 9, and 39. The estimation doing for the third iteration includes the examinees 1,
2, 4, 7, 8, 11, 12, 13, 14, 15, 17, 18, 19, 21, 22, 24, 25, 27, 28, 31, 32, 34, 35, 37, 41, 42, 43,
44, 49, 51, 53, 54, 55, 57, 59, 61, 63, 64, 65, 67, 68, and 69. The estimation doing for the
fourth iteration includes the examinees 3, 6, 16, 23, 26, 29, 36, 38, 45, 46, 47, 56, 58,
and 66. The estimation doing for the fifth iteration includes the examinees 33, 40, 48,
52, 62, and 70. And, the estimation doing for the sixth iteration includes the examin-
ee 60. From the examinee estimation, it results in the examinees’ traits θ varying
from -1.735 to +2.912.

Conclusion
Based on the estimation employed by N-R method, it can be concluded that
the 1PL model is not sufficiently satisfied. Hypothetically, it may be due to the num-
ber of examinees, the method employed, the model chosen, the test length, and the
other factors.
The result of the study quite often does not answer the questions formulated.
It implies that the observed hypothesis is accepted. However, the continuous study
needs undertaking. Concerning with the items-examinees parameter estimates in
language testing, it is recommended to employ some methods of estimation for the
widely ranged test items using 2PL model, 3PL model, and other models. In addi-

7
tion, computer programs for item-examinees parameter estimation are strongly re-
commended for the sake of the accurate and quick iteration.

References
Cotterall, S. & Reinders, H. (2004). Learner strategies: A guide for teachers. Singapore:
SEAMEO Regional Language Centre.
Crocker, L. & Algina, J. (1986). Introduction to classical and modern test theory. Florida:
Holt, Rinehart and Winston.
Finney, Denise. (2002). The ELT curriculum: A flexible model for a changing world.
In Jack C.R. & Willy A.R. (Eds.). Methodology in language teaching: An anthology
of current practice, (pp. 69-79). Cambridge: CUP.
Hambleton, R.K. & Muray, L.N. (1983). Some goodness of fit investigations for item
response models. In R.K. Hambleton (Ed.). Applications of item response theory
(pp. 71-94). Vancouver, B.C.: Educational Research Institute of British Insti-
tute.
Hambleton, R.K. (1989). Principles and selected applications of item response theory.
In R.L. Linn (Ed.). Educational measurement (pp. 147-200). New York: American
Council on Education and Macmillan Publishing.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item re-
sponse theory. Newbury Park, California: Sage Publications.
Henning, G. (1987). A guide to language testing: Development, evaluation, research. Cam-
bridge, Massachusetts: Newbury House Publishers.
Hulin, C.L., Drasgow, F., & Parsons, C.K. (1983). Item response theory: Application to
psychological measurement. Homewood, Illinois: Dow Jones-Irwin.
Krass, I.A. (2005). Application of direct optimization for item calibration in computerised
adaptive testing (pp. 1-45). Available: http://www.psych.umn.edu/psylabs/
CATCentral/PDF Files/KR98-01.pdf.
Lord, F.M. & Novick, M.R. (1968). Statistical theories of mental test scores. Canada: Ad-
dison-Wesley Publishing Company.
Magnusson, D. (1967). Test theory. Don Mills, Ontario: Addison-Wesley.
McDonald, R.P. (1999). Test theory: A unified treatment. New Jersey: Lawrence Erlbaum
Associates.
Naga, D.S. (1992). Pengantar teori sekor pada pengukuran pendidikan. Jakarta: Gunadar-
ma.
Naga, D.S. (2003). Teori responsi butir: Estimasi parameter secara terpisah. Paper
discussed in the lecture of psychometrics. Post Graduate Program of Educa-
tional Research and Evaluation, State University of Jakarta.
Oller, J.W. (1979). Language tests at school. London: Longman Group.
Richards, J.C. (2001). Curriculum development in language teaching. Cambridge: CUP.
Richards, J.C. (2002). Planning aims and objectives in language programs. Singapore:
SEAMEO Regional Language Centre.
Setiadi, H. (1999). Kegunaan dan keunggulan mendesain perangkat tes dengan
menggunakan konsep item response theory (IRT). Wawasan, January, 3-9.

8
Swaminathan, H., Hambleton, R.K., Sireci, S.G., Xing, D., & Rizavi, S.M. (2003). Small
sample estimation in dichotomous item response models: Effect of priors
based on judgmental information on the accuracy of item parameter estim-
ates. Applied Psychological Measurement, 27(1), 27-51.
Traub, R.E. (1983). A priori considerations in choosing an item response model. In
R.K. Hambleton (Ed.). Applications of item response theory (pp. 57-70). Van-
couver, B.C.: Educational Research Institute of British Institute.
Weir, C.J. (1990). Communicative language testing. Hertfordshire: Prentice Hall Interna-
tional (UK).
Wells, C.S., Subkovlak, M.J., Serlin, R.C. (2002). The effect of item parameter drift on
examinee ability estimates. Applied Psychological Measurement, 26(1), 77-87.
Widiatmoko. (2004). Language Assessment: Bahan ajar diklat tingkat dasar guru bahasa
Inggris sekolah menengah atas. Jakarta: PPPG Bahasa.
Widiatmoko. (2005). Joint maximum likelihood estimates on items-examinees using
the prox method: A study on the reading subtest of TOEFL. Indonesian JELT,
1(1), 73-90.
Ying, B.P. (2005). Testing and evaluation in second language teaching. Paper presen-
ted at the MTCP Course. Institut Perguruan Bahasa-bahasa Antarabangsa,
Kuala Lumpur, 5-30 September.

***

You might also like