You are on page 1of 6

Available online at www.sciencedirect.

com

Procedia - Social and Behavioral Sciences 46 (2012) 135 – 140

WCES 2012

Comparison of unidimensional and multidimensional models based


on item response theory in terms of both variables of test length and
sample size
Ibrahim Alper Kose, Nukhet C. Demirtasli
Abstract

The purpose of this study is to compare unidimensional and multidimensional item response models under different test length
and sample size situations based on Turkish language test data. Sample size and test length were manipulated as independent
variables in this study. Test data collected from eight grade primary school students (n=1516). The findings suggested that item
and ability parameters estimated under multidimensional IRT, have less error scores and reached more precise measurement and
model data fits are in favour of multidimensional IRT. Also, manipulation of sample size and test length has no positive effect on
model data fit under unidimensional IRT, but longer tests and larger samples decreases error estimation and more sensitive
measurement results under multidimensional IRT.

Keywords: Item Response Theory; Multidimensional IRT, Model Data Fit.

1. Problem

Providing the most useful and valid inferences from examinee responses is the basic point for researchers,
who study at the education and psychology area, Furthermore, correct modelling as a basis for measurement tools is
very fundamental for various decisions (selection, placement, assessment of achievement). For these reasons, the
purpose of this study is to investigate Turkish language test data, developed by researcher, under unidimensional and
multidimensional item response theory models and to submit the best model which fits the data.

In the measurement history, the leading theory to explain latent trait underlying examinee’s test performance
is Classic test theory (CTT) . CTT is a simple model which states that the observed score on a test is the sum of the
true score and measurement error. Group dependency of test and item characteristics, providing information about
examinee performance from whole test and having no information about examinee’s performance on a singe test
item are crucial shortcomings of CTT (Hambleton, Swaminathan and Rogers, 1991).

One of the most important improvements of the last century is IRT in psychological measurement. IRT is a
modern test theory which explains examinee’s ability level by using responses to test items with strong assumptions
against CTT’s weak assumptions with mathematical models (Bobcock, 2009). The most important advantages of
IRT are placing the ability of the respondent and the difficulty of the item on the same measurement scale (Spencer,
2004). Additionally, the estimated item parameters are invariant with regard to which are sampled from the
population, the estimated proficiency level remains constant regardless of which items are administrated and also
IRT can estimate examinee ability with more precision of measurement and less measurement errors (Lee, 2007).

1877-0428 © 2012 Published by Elsevier Ltd. Selection and/or peer review under responsibility of Prof. Dr. Hüseyin Uzunboylu
Open access under CC BY-NC-ND license. doi:10.1016/j.sbspro.2012.05.082
136 Ibrahim Alper Kose and Nukhet C. Demirtasli / Procedia - Social and Behavioral Sciences 46 (2012) 135 – 140

Multidimensional IRT (MIRT) are models which explain the relationship between two or more unobservable
variables, conceptualized as constructs or dimensions, and the probability of the examinee who is correctly
answering any particular test item by the mathematical model (Ackerman, Gierl and Walker, 2003). Like
unidimensional models, multidimensional models have some assumptions. These are monotonicity and local
independence. Monotonicity means, as the examinee ability level increase, the probability of the examinee correctly
answering any particular test item increase (Smith, 2009). Local independence is defined as the probability of
solving any item i is independent of the outcome of any other item, controlling for person parameters and item
parameters (Embretson and Reise, 2000).

In multidimensional IRT, there are two types of MIRT models, depending on whether compensation of high
proficiency on one trait to the low proficiency on other traits is available or not.These models are compensatory and
non-compensatory MIRT models (Sijtsma and Junker, 2006). In compensatory MIRT models, the latent abilities
interact such that a deficiency in an ability can be offset by an increase in other abilities. By contrast, in non-
compensatory MIRT models, sufficient levels of each measured ability are required, and a deficiency in one ability
cannot be completely offset through an increase others. Compensatory models may be most appropriate for items
having disjunctive component process. For example, items having multiple solution strategies likely possess
compensatory multidimensionality as a deficiency in one ability naturally compensates for the other. By contrast,
noncompensatory models may be appropriate for items that have conjunctive component process. For example, a
word problem on a mathematics test may require reading ability to interpret the question and then mathematics
ability to solve it. For such an item, it is unlikely that either ability will be able to compensate for a lack of the other
(Bolt and Lall, 2003). Although both compensatory and non-compensatory models have been applied in educational
research, applications of compensatory models dominate the literature (Drasgow and Parsons, 1983). The preference
of compensatory models in the literature may be due to the fact that there are no efficient algorithms for the
estimation of item parameters in non-compensatory models (Knoll and Berger, 1991).

In multidimensional IRT, item characteristic surface (ICS) is used to represent the probability that an
examinee with a given ability composite will correctly answer an item. This terminology is valid for two
dimensional items. The term of “item response hyper surface” should be used if more than two latent traits are
necessary for an item. ICS is limited for two dimensions because of the graphical representation restrictions for
more than two dimensions in multidimensional latent space (Ackerman, Gierl and Walker, 2003, Ackerman, 2005).

ICS has changed some concepts in MIRT. Reckase (1985, 1997) has defined these concepts as
multidimensional item difficulty (MDIFF) and multidimensional item discrimination (MDISC). MDISC is an
overall measure of multidimensional discrimination. In the graphical representation, MDISC is represented by the
length of the vector; the longer the vector, the more discriminating the item is. MDIFF is the equivalent form of b i in
unidimensional IRT, but cannot be interpreted in the same manner. MDIFF has been described as the distance from
the origin to the vector’s point of steepest slope (Smith, 2009).

Related researches (Chang ,1992; Spencer, 2004; de la Tore and Patz, 2005, Seungho Yang, 2007, Köse,
2010) comparing unidimensional and multidimensional models showed that as the number of latent traits underlying
item performance increase, item and ability parameters estimated under multidimensional IRT, have less error
scores and reached more precise measurement and model data fits are in the favour of multidimensional IRT.

2. Method:

In this study, examinee data, has been analysed by research questions under unidimensional and
multidimensional IRT models and item and ability parameters have been estimated, model-data fits and error rates
have been presented. As a research group, eighth grade primary school students (n=1516) have participated.24 item
Turkish test, developed by researchers, has been used as a research instrument. Every test item measurestwo latent
traits, semantical in language and knowledge of punctuation, which compensates each other. Sum of squares of
residuals (SSR), Root mean square of residuals (RMSR) and Tanaka index of goodness of fit indexes have been
used for assessing model-data fit for item parameter estimations. The root mean square standard deviations (RMSD)
and empirical reliability have been used to assess model data-fit for ability estimations. Two software NOHARM
for item parameter estimation and TESTFACT for ability parameter estimation have been used to analyse test data.
Ibrahim Alper Kose and Nukhet C. Demirtasli / Procedia - Social and Behavioral Sciences 46 (2012) 135 – 140 137

3. Findings and Results:

In this section findings and results have been summarized after research questions.

Has item parameters estimation been affected from different test length and sample size under unidimensional
and multidimensional IRT?

Model-data fit indices have been compared to investigate both test length and sample size effect on item
parameters estimation under unidimensional and multidimensional IRT models For this aim, fit indices which were
obtained from two tests with sample size=500/test length=12 items and sample size=1500/test length=24 items were
compared. Results have been given on the Table 1.

Table 1.The effect of both sample size and test length on item parameter estimation under unidimensional and multidimensional IRT models

Unidimensional Multidimensional
N=500/k=12 N=1500/k=24 N=500/k=12 N=1500/k=24
SSR 0,007 0,003 0,003 0,001
RMSR 0,01 0,01 0,009 0,007
Tanaka 0,979 0,97 0,97 0,985

As seen from table 1. under unidimensional IRT increasing both sample size and test length have no effect on
item parameter estimation. In contrast with unidimensional IRT, if we examine fit indices obtained from
multidimensional IRT model, increasing both sample size and test length has affected fit indices clearly. Namely,
under multidimensional models, it is necessary to work with long tests and large sample sizes to get more precious
item parameters with less error rates. This results are in contradiction with Seungho-Yang (2007)’s study. In that
study, test length, sample sizeand theta correlation has been chosen as variables. Item parameter estimation has been
affected only sample size and theta correlation together. As the reason for this contradiction, Seungho-Yang
(2007)’s different test lengths (15-30-45) and sample size (N=100-500 ve 1000) can be stated. Namely, in that study
small samples and long tests have been used.

Has ability parameter estimation been affected from both variables of test length and sample size under
unidimensional and multidimensional IRT?

Test data have been divided into three different samples (500-1000-1500) randomly and the whole test has
been divided into two testlets (12 items and 24 items) to investigate both test length and sample size effect under
undimensional and multidimensional IRT. For this aim, construction validity for two testlets has been ensured. For
the assessment of the effect of both two variables on ability estimation, RMSD and reliability indexwhich are
obtained from the smallest sample/shortest test and the largest sample/longest test, have been used. Results have
been summarized on Table 2.

Table 2.Descriptive statistics and model-data fit indexes of ability estimations obtained from different lest lengths and sample sizes.
138 Ibrahim Alper Kose and Nukhet C. Demirtasli / Procedia - Social and Behavioral Sciences 46 (2012) 135 – 140

Table 2 shows that model-data fit index, RMSD, and reliability index have been better for longer tests and
larger samples than shorter tests and smaller samples under two IRT models. In other words, in the case of long
tests and large samples, ability estimation involves less error and more reliable under multidimensional IRT.
However, If we examine the Table 2. carefully, test length variable is the main variable affecting ability estimation
positively. The common effect of two variable together is meaningless according to the this results.

Has model-data fit been affected from both variables of test length and sample size under unidimensional and
multidimensional IRT?

Model-data fit indices have been calculated for three sample sizes and two test lengthsseparately to investigate
test length and test sample size effect together on model-data fit. For the assessment of the effect of two variables
(test length and sample size) both on model-data fit, model-data fit indices, obtained from each unit have been
compared. Results have been summarized on Table 3.

Table 3. The effect of both sample size and test length on model-data fit under unidimensional and multidimensional IRT models

Unidimensional Multidimensional

Sample 500 1000 1500 500 1000 1500


Size

Test 12 24 12 24 12 24 12 24 12 24 12 24
Length
SSR 0,007 0,003 0,007 0,003 0,005 0,003 0,006 0,003 0,003 0,018 0,002 0,001

RMSR 0,010 0,010 0,010 0,010 0,009 0,010 0,009 0,010 0,007 0,008 0,006 0,007

Tanaka 0,975 0,960 0,979 0,970 0,986 0,970 0,982 0,970 0,992 0,982 0,994 0,985
Index

When the Table 3. analysed, it has been observed that increasing test length and sample size has affected only
SSR positivelyunder unidimensional IRT. RMSR and Tanaka Index have not been affected from that manipulation.
As a result of this findings, it can be said that manipulation of sample size and test length has no positive effect on
model-data fit under unidimensional IRT.

When the findings of model data fit under multidimensional IRT analysed, SSR index has been fallen from
0.006 to 0.001, RMSR index has been fallen from 0.009 to 0.007, Tanaka Index has been risen from 0.982 to
0.985.From these findings, increasing sample size and test length together have a positive effect on model-data fit
indexes under multidimensional IRT. In other words, it can be stated that studying with longer tests and larger
samples decreases error estimation and more sensitive measurement results under multidimensional IRT.

4. Conclusion:

Alan Greenspan, who has worked as the chairman of the Federal Reserve Bank of USA (1987-206), has stated
in his book “The Age of Turbulence (2007)” that economic models which has more specific and having more
parameters have given more valid inferences. Similarly, Walker and Beretvas (2003) has compared unidimensional
and multidimensional IRT models in his study, and concluded that studying with more complex models gives less
estimation errors, more reliable and better model-data fit indexes.

In the area of education and psychology, unidimensional IRT has been used for a long time, it is very difficult
to meet unidimensionality assumption as a whole in achievement and ability tests (Hambleton, Swaminathan and
Rogers, 1991; Reckase, 1997). In the case of circumstances in which unidimensionality assumption cannot be met,
application of undimensional models to multidimensional data matrix, brings validity problems on ability and item
parameters estimation and also cause model-data fit problems. The main purpose of researchers who work in
Ibrahim Alper Kose and Nukhet C. Demirtasli / Procedia - Social and Behavioral Sciences 46 (2012) 135 – 140 139

education, is to reach valid and reliable results that characterize examinee by using item responses. Improper
analysis of examinee responses will bring about wrong decisions for individuals. Multidimensional extension of
IRT, namely MIRT, has been using by researchers and it’s development will continue with researcher’s concern,
expectations and improving software technology.

Based on the findings of this research, it can be suggested that this study can be improved by using large scale
tests with various test lengths and sample sizes. Also, this research can be replicated by focusing on estimating
outlier item and person parameters under unidimensional and multidimensional IRT models.

References

Ackerman, T.A. (1992). Assessing construct validity using multidimensional item response theory. Paper Presented at the Annual

Meeting of American Educational Research Association. San Fransisco, CA, USA.


Ackerman, T.A.(1994a). Graphical Representation of Multidimensional Item Response Theory Analyses. Paper Presented at the
Annual Meeting of American Educational Research Association. New Orleans, LA.
Ackerman, T.A. (1994b). Using multidimensional item response theory to understand what items and tests are
measuring. Applied Measurement in Education, 7(4), 255-278.
Ackerman, T.A., Gierl, M.J., Walker, C.M. (2003). Using multidimensional item response theory to evaluate educational and
psychological tests. Educational Measurement: Issues and Practice: MIRT Instructional Module.

Ackerman, T.A. (2005). Multidimensional item response theory modelling. In J.J. McArdle (Ed). Contemporary Psychometrics
(p.3-24).
Antal, T. (2007). On multidimensional item response theory a coordinate free approach. Electronic Journal of Statistics, 1, 290-
306.

Bobcock, B.G.E.(2009). Estimating a Noncompensatory IRT Model Using a modified Metropolis algorithm. Unpublished Doctoral
Dissertation. The University of Minesota.
Bock, R.D. (1997). A brief history of item response theory. Educational Measurement: Issues and Practice. Winter 1997.

Bock, D.R., Thissen, D. and Zimowski, M.F. (1997). IRT estimation of domain scores. Journal of Educational Measurement, 34
(3),197-211
Bolt, D.M. and Lall, V.F. (2003). Estimation of compensatory and non-compensatory multidimensional item response models using
Markov Chain Monte Carlo. Applied Psychological Measurement, 27, 395.

Chang, Y.W. (1992). A comparison of unidimensional and multidimensional IRT approaches to test information in a test battery.
Unpublished Doctoral Dissertation. University of Minnesota.
de la Tore, J. and Patz, R.J. (2005). Making the most of what we have: A practical application of multidimensional item response
theory test scoring. Journal of Educational and Behavioral Statistics, 30(3), 295-311.

Drasgow and Parsons (1983). Application of unidimensional item response theory models to multidimensional data. Applied
Psychological Measurement, 7, 189-199.
Embretson, S.E. and Reise, S.P. (2000). Item Response Theory For Psychologists. Lawrence Erlbaum Associate, Inc.
Greenspan, A. (2007). The age of turbulence. The Penguin Pres. New York, 2007.
Hambleton, R.K. and Swaminathan, H. (1989). Item Response Theory. Principles And Applications. Kluwer-Nijhoff Publishing.
Boston- USA.

Hambleton, R.K. (1994). Item response theory: A broad psychometric framework for measurement advances. Physicothema, 6,
535-536.
Hambleton, R. K., Swaminathan, H. and Rogers, H. (1991). Fundamentals of Item Response Theory. Newbury Park CA: Sage.
Kao, S.C. (2007). The new goodness of fit index multidimensional item response model. Unpublished Doctoral Dissertation. Michigan
State University.
Köse, İ.A. (2010). Comparison of unidimensional and multidimensional models based on item response theory in terms of test length
and sample size. Unpublished Doctoral Dissertation. University of Ankara.
Kreiter, C.D. (1993). An emprical investigation of compensatory and noncompensatory test items in simulated and real data.
Unpublished Doctoral Dissertation. The University of Iowa.
Lee, S. H. (2007). Multidimensional item response theory: A SAS MDIRT MACRO and emprical study of PIAT MATH Test
Unpublished Doctoral Dissertation. The University of Oklahoma.

Li, Y. H. and Schafer, W.D (2005). Trait parameter recovery using multidimensional computerized adaptive testing in reading and
mathematics. Applied Psychological Measurement, 29, 3-25.
McDonald, R.P. (1982). Linear versus models in item response theory Applied Psychological Measurement, 6, 379-396.
McDonald, R.P. (2000). A basis for multidimensional item response theory. Applied Psychological Measurement, 24, 99.
Pomplun, M.R. (1988). Effects of local dependence in achievement tests on IRT ability estimation. Unpublished Doctoral
140 Ibrahim Alper Kose and Nukhet C. Demirtasli / Procedia - Social and Behavioral Sciences 46 (2012) 135 – 140

Dissertation. The Florida State University.

Reckase, M.D. (1997). Models for multidimensional tests and hierarchically structured training materials. Technical Report. The
American Colage Testing Program. Iowa City, Iowa.

Seungho Yang, M. A. (2007). A Comparison of unidimensional and multidimensional rasch models using parameter
estimates and fit indices when assumption of unidimensionality is violated. Unpublished Doctoral Dissertation. The Ohio
State University.
Sijtsma, K. and Junker, B.W. (2006). Item response theory: past performance. Present developments and future expectations.
Behaviormetrika, 1, 75-102.
Smith, J. (2009). Some issues in item response theory: Dimensionality assessment and models for guessing. Unpublished Doctoral
Dissertation. University of South California.
Spencer, G.S. (2004). The strength of multidimensional item response theory in exploring construct space that is multidimensional
and correlated. unpublished doctoral dissertation. Brigam Young University.

Walker, C.M. and Beretvas, S.N. (2003). Comparing multidimensional and unidimensional proficiency classifications:
Multidimensional IRT.
Zhang, B. (2008). Application of unidimensional item response models to tests with items sensitive to secondary dimension. The
Journal of Experimental Education, 77 (2), 147-166.

Zhao,Y. (2008). Approaches for addressing the fit of iItem response theory model to educational test data. .Unpublished Doctoral
Dissertation

You might also like