10 1016@j Sbspro 2015 04 348

Available online at www.sciencedirect.
com
ScienceDirect
Procedia - Social and Behavioral Sciences 191 (2015) 121 – 125
WCES 2014
A study of Quality Assessment of Science Instructional

Management in Thailand: An analysis of Differential Item
Functioning and Test Functioning in Mixed format Tests
Panat Chanpleng a, Nuttaporn Lawthong b*, Sungwon Ngudgratokec
a
Ph.D. Student in the Department of Educational Research and Psychology, Chulalongkorn University, Bangkok 10330, Thailand
b
Assistant Professor in the Department of Educational Research and Psychology, Chulalongkorn University, Bangkok 10330, Thailand
c
Lecturer in the School of Educational Studies, Sukhothai Thammathirat Open University, Nonthaburi 11120,Thailand
Abstract
The international education assessment is important for educational improvement when it is used appropriately by participating
countries to identify weaknesses of educational systems. However, its usefulness is dependent on validity of the test itself. One
aspect of validity issues worth being investigated is the fairness which assesses whether or not assessment items/test is fair to all
subgroups of examinees. Differential item functioning (DIF) and test functioning (DTF) methods are usually used to assess
fairness. Test items that are not fair will be flagged as DIF. Similarly, tests that are not fair will be flagged as DTF. The objects of
this research were to apply differential item functioning and test functioning methods to analyze the extent of differential item
functioning and test functioning in science assessment data. The data that were explored in this research was the secondary
data from the Programme for International Student Assessment (PISA) in 2009 and was analyzed using Differential item
functioning analysis system (DIFAS) version 5.0. It was found that mixed format tests used by PISA favored some groups of
examinees over other groups which indicate the degree of unfairness across groups of students with different backgrounds. The
recommendation for the assessment of student performance is that DIF items be removed before the score reporting is calculated
and that the value added model be used to remove factors that are unfair to different groups of test takers.
© 2015
© 2014Published
The Authors. Published
by Elsevier Ltd. by Elsevier
This Ltd.access article under the CC BY-NC-ND license
is an open
Selection and peer-review under responsibility of the Organizing Committee of WCES 2014.
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Selection and peer-review under responsibility of the Organizing Committee of WCES 2014
Keywords: differential item functioning, differential test functioning, mixed format tests;
* Nuttaporn Lawthong. Tel.: +6-684-134-5933; fax: +6-622-182-801.

E-mail address: Nuttaporn.l@chula.ac.th
1877-0428 © 2015 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Selection and peer-review under responsibility of the Organizing Committee of WCES 2014
doi:10.1016/j.sbspro.2015.04.348
122 Panat Chanpleng et al. / Procedia - Social and Behavioral Sciences 191 (2015) 121 – 125
1. Introduction
The Programme for International Student Assessment (PISA) was established by the cooperation and expertise
of OECD and non-OECD countries to create a common set of tests with capacity to indicate the quality of education
in countries participating. The goal of PISA identify a quality index for member countries to determine the future
rather than to determine only the quality of students at any particular level. The PISA assessment take place every
three years in a literacy perspective which emphasizes on the knowledge and skills of students learning and
practiced at school(OECD, 2012). According to the assessment results, Thai students participating in the program in
2009 were found to have scientific literacy scores below the OECD average. When the aforementioned assessment
results are compared to current Thailand’s target on scientific literacy, which focuses on cognitive thinking and
creativity, we find a vital need to bring the assessment results under consideration for greater instructional
management in scientific subject(OECD, 2012). Nevertheless, scientific literacy assessment is based on test scores
alone. It is also necessary to consider the context of the fundamental cultures of students in each country.
Furthermore, test fairness should be given top priority. Test quality is important for educational assessment and
evaluation. Tests need to be composed of reliability and validity meeting set standards. Consideration should also
be given to potential bias during test administration. Differential Item Functioning (DIF) is one statistical method for
investigating item bias and is very useful for educational testing. DIF occurs when examinees from different groups
(e.g., gender, demographic, socioeconomic status, wealth) with the same abilities have a probability of giving a
specific item response differently(Zumbo, 1999; Millsap & Everson, 1993; Dodeen & Johnson, 2003; Kamata &
Vaughn, 2004). Furthermore, when any item or test set is detected that have differently test function, the item is
removed or revised and put back into the original test set. The goal is so the test set is equally fair for every group of
test examinees. However, if it is found after differential test function of the entire test set that the test set that it is
not suitable to be used, a method suitable for differential function the entire test set is therefore sought, and it will be
useful as information for the judging of instrument effectiveness, in priority ranking or in selecting. For these
reasons, the researcher is interested in studying DIF with dichotomously and polytomously scored items and the
DTF from the scientific literacy test of the Programme for International Student Assessment (PISA) to present the
analysis results on bias due to gender and wealth which indicate the degree of unfairness across groups of students
with different backgrounds. The findings can be used in conjunction with scientific subject’s administration in the
context of Thailand.
2. Literature review
The review of documents and researches related to differential item functioning and differential test functioning
can be summarized as follows:
2.1. Differential Item Functioning (DIF) Analysis
DIF detection is the study of test content validity and fairness (AERA, APA, NCME, 1999). A test set that is
used to assess any particular field of ability must have content validity for students with differences and diversity.
The group of examinees representing the main group in the population is called the reference group and the other
group representing the sub-group in the population is called the focal group which is the group of interest for
studying DIF. In each testing, the people taking the test in the sub-groups might have different characteristics
(Swaminathan & Rogers, 1990; Gallagher & Kaufman, 2005). Furthermore, the variety of DIF detection have
been developed for dichotomously and polytomously scored items (Millsap & Everson, 1993; Penfield & Algina,
2006). Some techniques are based on IRT (e.g., SIBTEST, Lord’s F 2 test), others do not use IRT, such as Mantel-
Haenszel method.
2.2. Differential Test Functioning (DTF) Analysis
DTF detection can provide valuable data for the content validity process in a number of ways as follows: 1) All
information related to DIF is provided by combining all test items in the test set; 2) DTF provides a potential impact
index of classification under the differential item functioning results; and 3) Measuring the area where the DTF
Panat Chanpleng et al. / Procedia - Social and Behavioral Sciences 191 (2015) 121 – 125 123
symbol occurs is a key composition of differential bundle function (DBF), which can be used to identify the cause of
DIF. The models presenting for measuring DTF in tests can be classified into three types, namely, DTF with
dichotomous items, polytomous items and mixed format tests. DTF with dichotomous items can be divided into
three concepts, -- 1) the SIBTEST procedure was proposed by Shealy and Stout (1993) 2) the DFITS procedure
and 3) a random effects model. Polytomous items can use the SIBTEST and the DFITS to assist in the checking of
the test. Mixed format tests can use the generalized DTF estimators in the checking of the test containing both
dichotomous and polytomous items (Penfield & Algina, 2006).
3. Research Methodology
In this research, the sample of Thai students was selected by using a multi-stage random sampling method from
the database of PISA 2009. The researcher divided data analysis into the following three steps: 1) categorizing data
by factor to be studied; 2) dividing the tests into nine booklets, namely, Booklets 2, 3, 4, 7, 8, 9, 10, 12 and 13 as
actually tested; and 3) performing DIF and DTF analyses in items and tests according to the variables of “gender”
and “wealth” in using DIFAS version 5.0 software program(Penfield, 2005). The statistics used in DIF detection for
dichotomously scored items is standardized Mantel-Haenszel Log-odds ratio (LOR Z; values greater than 2.0 or less
than -2.0 may indicate the presence of DIF)(Penfield, 2007). For polytomously scored items, the Liu-Agresti
Cummulative Common Log-Odds Ratio is used (L-A LOR; values greater than 2.0 or less than -2.0 may indicate the
presence of DIF) (Penfield & Algina, 2003). Moreover, DTF detection for mixed format testing involve a
generalized DIF effect variance estimator (Q ; small for Q <.07, medium for .07≤Q ≤.14, and large for Q >.14)
2 2 2 2
(Penfield & Algina, 2006).
4. Research Findings
All tables display frequencies of DIF, DIF percentages, and DTF detection for 9 booklets. It is important to
note that each booklet will be flagged both DIF and DTF. The results can be divided into gender and wealth as
follows:
4.1. DIF and DTF analyses: gender
DIF and DTF analyses shown in table 1. Booklets 7, 12, and 9 were found to have the three highest DIF
percentages (27.78%, 25.00% and 22.22%, respectively). Booklet 4 had the lowest DIF percentage (5.56%). For
DTF analysis, all 9 booklets may be classified as small-medium DTF. Booklets 2, 3, 4, 10 and 13, including the
variance estimator of DTF, could be classified as small DTF. This appears to be negligible DIF. Booklets 7, 8, 9
and 12 were classified as medium DTF. When the DTF detection results and DIF percentages of the variable
“gender” were considered. For DIF percentages, seven booklets were found to be between 11.43-27.78%. In
particular, Booklets 3, 10 and 13 were over 10%, but the DTF detection results was found to be small or negligible
DIF.
Table 1. The results of DIF and DTF analyses by gender for a total of nine booklets
booklet Number of Items frequencies of DIF DIF percentages DTF detection
2 17 1 5.88 small
3 35 4 11.43 small
4 18 1 5.56 small
7 18 5 27.78 medium
8 17 3 17.65 medium
9 18 4 22.22 medium
10 35 4 11.43 small
12 36 9 25.00 medium
13 18 2 11.11 small
124 Panat Chanpleng et al. / Procedia - Social and Behavioral Sciences 191 (2015) 121 – 125
4.2. DIF and DTF analyses: wealth
For DIF analysis, Booklets 3, 2 and 10 were found to have the top three percentages (14.29%, 11.76% and
11.43%, respectively). Booklets 7 and 13 was found to have the lowest DIF percentage (5.56%). However as
shown in table 2 below for DTF analysis, all nine booklets appeared to be able to be classified as small-medium
DTF. Booklets 2, 4, 7, 8, 9, 10 and 13, There could be classified as small DTF which to be negligible DIF.
Booklets 3 and 12 were classified as medium DTF. When the DTF detection and DIF percentages results were
considered for the variable “wealth”, the number of test items acquired from DIF percentages was considered, six
booklets were found to be between 11.11-14.29%. In particular, Booklets 2, 4, 9 and 10 were over 10%, but the
DTF detection result was found to be small or negligible DIF.
Table 2. The results of DIF and DTF analyses by wealth for a total of nine booklets
booklet Number of Items frequencies of DIF DIF percentages DTF detection
2 17 2 11.76 small
3 35 5 14.29 medium
4 18 2 11.11 small
7 18 1 5.56 small
8 17 1 5.88 small
9 18 2 11.11 small
10 35 4 11.43 small
12 36 4 11.11 medium
13 18 1 5.56 small
5. Conclusion and Discussion
Recall that the purpose of this study was to illustrate a use of DIF and DTF analyses in PISA. Using DIF and
DTF can reveal impact of bias on item difficulty that based on variables. This study showed the potential of
bias in tests. The researcher studied the variables “gender” and “wealth” acquired from the review of literature in
relation to DIF. Previous studies were generally focused on DIF detection. Very few studies have been conducted
to study DTF detection; therefore, the topics of test and items bias have important implication for educational
improvement. In this research, the researcher viewed the actual test condition with the test set composed of test
items with dichotomously and polytomously scored items. Therefore, the present study covered actual test
conditions, thereby resulting in information necessary for using the findings in scientific assessment and learning
administration. Even though these topics have been the focus of much discussion on educational testing. The
study of item bias and DIF and DTF provide empirical data for the identification and elimination of exam
items which to be difficult for one group of test-takers than another(Zumbo, 1999). The 9 booklets on PISA
2009 in gender, 7 Booklets were found to have high DIF percentages and 4 booklets of the testing displayed DTF
as medium across different groups. For wealth, 6 Booklets were found to have high DIF percentages and 2 booklets
of the testing displayed DTF as medium across different groups. Finally, the recommendation for the assessment of
student performance is that DIF items be removed before the score reporting is calculated. Subsequent research can
also help understand the factors that contribute to DIF. Future studies might focus on value added model be used to
remove factors that are unfair to different groups of test takers.
Acknowledgements
The researchers would like to thank THE 90th ANNIVERSARY OF CHULALONGKORN UNIVERSITY
FUND (Ratchadaphiseksomphot Endowment Fund) for funding this research.
Panat Chanpleng et al. / Procedia - Social and Behavioral Sciences 191 (2015) 121 – 125 125
References
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education.
(1999). Standards for Educational and Psychological Testing. Washington, D.C: American Psychological Association.
Dodeen, H., & Johnson, G. A. (2003). An analysis of sex-related differential item functioning in attitude assessment. Assessment &
Evaluation in Higher Education, 28(2), 129-134
Gallagher, A. M. & Kaufman, J. C. (2005). Gender Differences in Mathematics. United Kingdom: Cambridge University Press.
Kamata, A. & Vaughn, B. (2004). An introduction to differential item functioning analysis. Learning Disabilities: A Contemporary Journal,
2(2):49-69.
Li, H. H. & Stout , W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647-677.
Organisation for Economic Co-operation and Development . (2012). PISA 2009 Technical Report. OECD Publishing.
Millsap, R. E & Everson, H. T. (1993). Methodogy review: Statistical approaches for assessing measurement bias. Applied Psychological
Measurement,17,297-334.
Penfield, R. D. (2005). DIFAS: Differential item functioning analysis system, Applied Psychological Measurement, 29,150-151.
Penfield, R. D. (2007). DIFAS 4.0 user’s manual. Retrieved January 1, 2013,from
http://www.education.miami.edu/facultysites/penfield/index.html.
Penfield, R. D. & Algina, J. (2003). Applying the Liu-Agresti extimator of the cumulative common odds ratio to DIF detection in
polytomous items. Journal of Educational Measurement, 40, 353-370.
Penfield, R. D. & Algina, J. (2006). A generalized DIF effect variance estimator for measuring unsigned differential test functioning in
mixed format tests. Journal of Educational Measurement, 43, 295-312.
Shealy, R. T. & Stout, W. F. (1993). A model based standardization approach that separates true bias/DIF from group ability differences and
detects test bias/DTF as well as item bias/DIF. Psychometrika, 58,197-239.
Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedure. Journal of Educational
Measurement, 27(4), 361-370.
Zumbo, B.D. (1999). A handbook on the theory and methods of differential item functioning(DIF): Logistic regression modeling as a unitary
framework for binary and Likert-like(ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and
Evaluation.

10 1016@j Sbspro 2015 04 348

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1016@j Sbspro 2015 04 348

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

A study of Quality Assessment of Science Instructional

* Nuttaporn Lawthong. Tel.: +6-684-134-5933; fax: +6-622-182-801.

2.1. Differential Item Functioning (DIF) Analysis

2.2. Differential Test Functioning (DTF) Analysis

(Penfield & Algina, 2006).

4.1. DIF and DTF analyses: gender

4.2. DIF and DTF analyses: wealth

5. Conclusion and Discussion

You might also like