You are on page 1of 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321434164

Developing a reading comprehension test for cognitive diagnostic


assessment: A RUM analysis

Article  in  Studies In Educational Evaluation · December 2017


DOI: 10.1016/j.stueduc.2017.10.007

CITATIONS READS

6 329

2 authors, including:

Fatemeh Ranjbaran
Iran University of Medical Sciences
10 PUBLICATIONS   13 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

PhD dissertation View project

All content following this page was uploaded by Fatemeh Ranjbaran on 15 January 2020.

The user has requested enhancement of the downloaded file.


6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

Contents lists available at ScienceDirect

Studies in Educational Evaluation


journal homepage: www.elsevier.com/locate/stueduc

Developing a reading comprehension test for cognitive diagnostic 0$5.


assessment: A RUM analysis

Fatemeh Ranjbaran , Seyyed Mohammed Alavi
Department of Foreign Languages and Literature, University of Tehran, Iran

A R T I C L E I N F O A B S T R A C T

Keywords: A critical issue in cognitive diagnostic assessment (CDA) lies in the dearth of research in developing diagnostic
Attributes tests for cognitive diagnostic purposes. Most research thus far has been mainly carried out on large-scale tests,
CDA e.g., Test of English as a Foreign Language (TOEFL), Michigan English Language Assessment Battery (MELAB),
Formative assessment International English Language Testing System (IELTS), etc. In particular, CDA of formative language assessment
CDM
that aims to inform instruction and to discover strengths and weaknesses of students to guide instruction has not
Fusion model
Q-matrix
been conducted in a foreign (i.e., second) language-learning context. This study explored how developing a
Reading comprehension test reading comprehension test based on a cognitive framework could be used for such diagnostic purposes. To
RUM achieve this, initially, a list of 9 reading attributes was prepared by experts based on the literature, and then the
Second language targeted attributes were used to construct a 20-item reading comprehension test. Second, a tentative “Q-matrix”
that specified the relationships between test items and the target attributes required by each item was developed.
Third, the test was administered to seven language-testing experts who were asked to identify which of the 9
attributes were required by each item of the test. Fourth, on the basis of the overall agreement of the experts’
judgments concerning the choices of attributes, review of the literature and results of student think-aloud
protocols, the tentative Q-Matrix was refined and used for statistical analyses. Finally, the test was administered
to 1986 students of a General English Language Course at the University of Tehran, Iran. To examine the CDA of
the test, the Reparameterized Unified Model (RUM) (also known as the Fusion Model), a type of cognitive
diagnostic measurement model (CDM), was used for further refining the Q-Matrix for future data analyses and,
most importantly, for diagnosing the participants' strengths and weaknesses. Data analysis results confirmed that
the nine proposed reading attributes are involved in the reading comprehension test items. Such diagnostic
information could be helpful for teachers and practitioners to prepare instructional materials that target specific
weaknesses and inform them of the more problematic areas that need to be emphasized in class in order to plan
for better L2 reading instruction. Further, such information could inform individualized student instruction and
produce improved diagnostic tests for future use.

1. Introduction detect student errors, incomplete understandings, and misconceptions


during the learning process, the use of diagnostic tests to improve
Over recent decades, testing and assessment in general have been student’s conceptual understanding has become highly valued and re-
employed to obtain overall, average or individual scores of achieve- cognized in many fields (Hartman, 2001). Hence, researchers and
ment for examining student’s acquired knowledge. However, assess- practitioners are increasingly focusing on the integration of cognitive
ment that can function as formative to inform instruction has currently psychology, educational pedagogy, and educational measurement for
become the focus of attention for diagnostic purposes and is intended to the improvement of learning and instruction (Leighton, Gierl, & Hunka,
assess strengths and weaknesses of students to guide instruction. From 2004; Mislevy, 2006; Tatsuoka, 1995; Snow & Lohman, 1989).
another stance, teacher-made tests and quizzes have recently gained Reading is a fundamental skill for gaining knowledge in all aca-
attention because of their formative assessment role in aiding students’ demic fields. Therefore, it is necessary to determine the various com-
learning (Huang & Wu, 2013). To make the process of learning easier ponents of reading ability for a better understanding of reading and to
and more effective, teachers are expected to be competent in test con- find language learner problems when teaching second language (L2)
struction and learning diagnostics in class. As these tests are expected to reading. If problematic areas of reading proficiency are diagnosed


Corresponding author.
E-mail address: franjbaran@ut.ac.ir (F. Ranjbaran).

http://dx.doi.org/10.1016/j.stueduc.2017.10.007
Received 17 December 2016; Received in revised form 8 October 2017; Accepted 25 October 2017
;‹(OVHYLHU/WG$OOULJKWVUHVHUYHG
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

during the instructional term, sufficient and timely feedback can be inferences in order to fully comprehend a text. These are considered the
provided to students in order to improve learning and eliminate sub-skills of the reading domain, which are called attributes as well,
weaknesses during the learning process. used interchangeably throughout this paper. The most distinct char-
Criticism occurs because the main goal of educational tests is acteristic of this approach is that it is the point where cognitive psy-
usually to provide quantitative assessment of a student’s general chology and psychometric modeling meet within a single framework,
overall, often unidimensional ability and proficiency as compared to therefore it aims to assess the test-takers' knowledge and underlying
other students in a normative group. This type of norm-referenced cognitive processing sub-skills (DiBello, Roussos, & Stout, 2006;
testing has been used extensively for ranking and selecting students for Leighton & Gierl, 2007).
various educational decisions. In addition to merely providing general In the assessment of reading comprehension in a second or foreign
summarizing and usually unidimensional information about students’ language, the many underlying cognitive attributes required for reading
skills and their ability to perform on a test, these assessments are in- ability mastery have made it a complex process. Reading ability is a
variably incapable of providing necessary detailed information about fundamental tool for gaining knowledge and improving learning in
students’ strengths and weaknesses that could possibly help them in everyday academic settings and everyday life in general. Therefore, it
improving their skills or that might also assist the teacher in instruc- comes as no surprise that the nature of reading ability has been the
tional planning. Recently, scholars have suggested that cognitive di- focus of research in applied linguistics, education and psychology for
agnostic assessment has a key role in improving the informational value quite some time (Cohen & Upton, 2006). Regardless of the extensive
of assessment (Alderson, 2010; de la Torre, 2009; Jang, 2005; research on reading ability, there is still some debate as to how second
Leighton & Gierl, 2007; Rupp, Templin, & Henson, 2012). In his com- language reading ability is defined and how its performance should be
mentary on “Cognitive Diagnosis and Q-Matrices in Language Assess- evaluated and reported. It seems that teachers, students, and practi-
ment,” Alderson (2010) shows his disappointment of there being very tioners have not been given diagnostic feedback tools that could be
few truly diagnostic tests in existence. In fact, nearly all studies carried used for improvements in reading ability, specifically for classroom-
out thus far have been on existing large-scale assessments and profi- based profile score reporting. These are issues that mostly need con-
ciency tests, not those developed for low-stakes formative assessment. sideration in the context of L2 reading assessment in the Iranian con-
He argues that far more studies focus on developing and researching text.
high-stakes proficiency tests aimed at placement, achievement, or ap- At times, there is such emphasis on reading strategies that other
titude than are specifically constructed for cognitive diagnosis in the important elements of reading competence such as language knowl-
form of classroom-based formative assessments. The most desirable edge, including pragmatic knowledge and grammatical knowledge,
cognitive diagnostic assessment is the one that is diagnostically de- have been have been given less attention. One aspect of second lan-
signed, constructed, and scored from the initial phase. In such an ap- guage reading ability is the use of language to understand written text.
proach, cognitive attributes are explicitly defined to be targeted in the Therefore, the knowledge of language components and strategic
test construction phase. These predetermined attributes should be in reading competence should both be considered in order to master the
line with the instructional goals. When the attributes are set, the data written text. While the difficulty of defining the construct of reading
are to be analyzed with an appropriate CDM. Afterwards, the scores are ability is clear, other problems have been seen with regards to how L2
to be reported in a fine-grained diagnostic system. While fine-grained reading performance has been analyzed and reported. L2 reading test
cognitive diagnostic assessment is intended to inform instructional scores are often reported using a general test score without any detailed
settings in this way, diagnostically constructed designs have hardly information (Goodman & Hambleton, 2004). When an exam provides
been discussed in the literature. A few tests, however, have been de- only one total score, it can serve the test’s immediate summative pur-
signed in order to fulfill the needs of diagnostic analysis (e.g. DIALANG pose; however, it cannot be easily used to improve reading performance
by Alderson, 2005; Alderson & Huhta, 2005; DELNA (www.delna. (Stiggins, Alter, & Chappius, 2004). Only providing a total score does
auckland.ac.nz/uoa); and DELTA by Urmston, Raquel, & Tsang, 2013), not provide information regarding each student’s specific strengths and
while, none have yet provided individualized score reports to enhance weaknesses (Sheehan & Mislevy, 1990). On the other hand, a detailed
learning and teaching at the classroom level. This study responds to the score report of each individual, including their level on each reading
call for cognitive diagnostic assessment using a specially constructed attribute can be used to both improve individual student reading ability
diagnostic test, one that will attempt to provide detailed information and guide teacher instruction (Snow & Lohman, 1989).
about students’ strengths and weaknesses in L2 reading comprehension,
and perhaps in reading comprehension in general. 2.2. Frameworks for developing cognitive diagnostic tests

2. Literature review There are two measurement driven approaches that have been
widely used for diagnostic test development. One is Embretson’s
2.1. L2 reading ability Cognitive Design System (CDS) (Embretson & Gorin, 2001) and the
other is Mislevy’s Evidence-Centered Design (ECD) (Mislevy, 1994;
In CDA, the different components of a specific domain (in this case, Mislevy, Steinberg, & Almond, 2002). These two approaches focus on
L2 reading) are referred to as attributes. Attributes are the divided the use of cognition in the process of item and test development, con-
components of a general cognitive ability, which can be defined as sidering the issues of construct definition while item writing, and
“procedures, skills, or knowledge a student must possess in order to concluding with validation procedures (Leighton & Gierl, 2007).
successfully complete the target task” (Birenbaum, Kelly, & Tatsuoka, CDS and ECD may differ in their emphasis on the various parts of
1993, p.443). Therefore, L2 reading attributes are composed of dif- assessment design and their details, but both share the three principles
ferent types of language knowledge, and reading strategies, which are of the assessment triangle. The assessment triangle includes three re-
required in comprehending texts (Birenbaum et al., 1993; Templin, lated elements, which are cognition (theories of learning), observation
2004). CDA has been introduced as a new method in educational (test data) and interpretation (the probabilistic model that relates a
measurement that can provide fine-grained diagnostic information student’s multidimensional latent cognitive learning state to his/her
about test-takers’ degree of mastery of domain sub-skills (Lee & Sawaki, test response pattern) (Pellegrino, Chudowsky, & Glaser, 2001). This
2009). Sub-skills are defined as domain-specific knowledge and skills NRC panel of researchers state that cognition is related to a cognitive
that are required to indicate mastery in a specific cognitive domain model about how students represent knowledge and how they develop
(Leighton & Gierl, 2007). Taking reading skill as a cognitive domain, it competence in a certain subject (p.44). A cognitive model provides a
is necessary to have knowledge of vocabulary, grammar, and making description of what should be assessed, but it is different from cognition


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

in a number of ways: According to Leighton and Gierl (2007), a cog- the main elements of expertise in a domain, the core elements in the
nitive model specifies the cognitive components and processes that ECD framework include (1) the student models, which formalize the
constitute the construct (such as reading comprehension) being tested. postulated proficiency structures for different tasks, (2) the task models,
This leads to more detailed specifications that are more applicable for which formalize which aspects of task performance are coded in what
instructional feedback. It should be noted that cognitive theory explains manner, and (3) the evidence models, which are the psychometric
these specifications, meaning that a model about specific cognitive models linking those two elements. These three core components are
processes related to the construct being tested is empirically supported. complemented by (4) the assembly model, which formalizes how these
In this study, the theory of learning underlying the assessment tri- three elements are linked in the assessment, and (5) the presentation
angle includes the reading theories of second language learning, in- model, which formalizes how the assessment tasks are being presented
cluding the information processing theory, constructivism theory of (Rupp et al., 2012). Specifically, the student model is motivated by the
reading, and Bloom’s Revised Taxonomy. It is clear that when reading, learning theory that underlies the diagnostic assessment system. It
engagement in processing at the phonological, morphological, syn- specifies the relevant variables or aspects of learning that we want to
tactic, semantic and discourse levels, as well as in goal setting, text- assess at a grain size that suits the purpose of the diagnostic assessment.
summary building, interpretive elaborating from knowledge resources, As many of the characteristics of learning that we want to assess are not
monitoring and assessment of goal achievement, making various ad- directly observable, the student model provides a probabilistic model
justments to enhance comprehension, and making repairs to compre- for making claims about the state, structure, and development of a more
hension processing are necessary (Carrell & Grabe, 2002, p. 234). Even complex underlying system.
though a great portion of the reading process is somewhat natural, it is
thought to exceed our conscious control. It is believed that readers do 2.3. Cognitive diagnostic models
employ a high level of active control over their reading process through
the use of strategies that are deliberate and purposeful conscious pro- Cognitive Diagnostic Models (CDMs) are data analysis probability
cedures (Urquhart & Weir, 1998). While processes are considered more models and often associated techniques that are designed to link cog-
automatic than intentional, and could even be considered subconscious, nitive theory with the test items’ psychometric (probabilistic) proper-
strategies are more controllable, and used to act upon the processes ties (Leighton & Gierl, 2007). According to Rupp and Templin (2008),
(Cohen & Upton, 2006). Insights into reading strategies explain how CDMs are:
readers interact with the text and to what extent strategies used can
“probabilistic, confirmatory multidimensional latent-variable
influence their reading comprehension rate. Following the use of
models with a simple or complex loading structure. They are sui-
learning theories for the cognition element, which are used to construct
table for modeling observable categorical response variables and
test items, the observation element includes gathering data through
contain unobservable (i.e., latent) categorical predictor variables.
administering the test. The interpretation element makes use of the
The predictor variables are combined in compensatory and non-
CDM (in the case of this study) to relate an examinees multidimensional
compensatory ways to generate latent classes”. (Rupp & Templin,
latent cognitive learning state to his/her test response pattern. These
2008, p.226)
circulating elements of the assessment triangle intermingle and even-
tually provide teachers with means to an imperfect attempt at trans- Examples of CDMs include Tatsuoka’s Rule Space Model (Tatsuoka,
lating the student’s responses into beneficial information. 1995), the Attribute Hierarchy Method (AHM) (Leighton,
The core purpose of diagnostic assessment development from an Gierl, & Hunka, 2004), an enhancement of the Fusion Model termed the
ECD framework perspective is the development of coherent evidentiary RUM (Hartz, 2002; Roussos et al., 2007), and the original RUM
arguments about students that can serve as assessment of and assess- (DiBello, Stout, & Roussos, 1995). Most CDMs are IRT-based latent-class
ment for learning, depending on the desired primary purpose of a models in which the characteristic of discrete multidimensionality of
particular assessment. The structure of the evidentiary arguments that the latent space is the most important. In previous unidimensional IRT-
are used in the assessment narrative can be described with the aid of models, examinee ability was modelled by a single general continuous
terminology first introduced by Toulmin (1958). An evidentiary argu- ability parameter. By contrast, the discrete multidimensionality of
ment is constructed through a series of logically connected claims or CDMs makes it possible to investigate the mental processes underlying
propositions that are supported by data through warrants and backing. the student’s test response by breaking the overall targeted ability, e.g.,
Based on Toulmin’s schema for arguments, a claim about an examinee's reading competency, down into its component parts. The number of
knowledge and skill is situated at the top. At the bottom is the ob- dimensions depends on the number of skill components involved in the
servation of the examinee acting in some situation. An assessor's rea- assessment. Here, α is a vector of dichotomous variables. These prob-
soning moves up from what is important in the examinee's actions abilities can then be used to estimate each αk, that is, to classify an
(which the examinee produces) and the features of the situation that are examinee as either a master (αk = 1) or a non-master (αk = 0). An-
important in eliciting those actions (partly determined by the assess- other feature of CDMs is being compensatory or non-compensatory
ment designer, but also partly determined by the examinees as they (DiBello et al., 1995; Roussos et al., 2007). In non-compensatory models
interact with the task). These data support the claim through a warrant, success in an attribute does not compensate for the deficiency in an-
or rationale, for why examinees with particular capabilities are likely to other attribute in order to correctly respond to an item, as is the case
act in certain ways in the situation at hand. with the RUM in this study. According to Roussos et al., 2007, a non-
In concrete terms, the ECD framework allows one to distinguish the compensatory interaction of skills occurs when application of the re-
different structural elements and the required pieces of evidence in quired skills is necessary in order to successfully complete the task; that
narratives similar to the following: Ali has most likely mastered basic is, a lack of competency on any one of the skills for the task will result
reading comprehension skills (claim), because he has answered cor- in a serious hindrance to successfully complete that task. Non-com-
rectly to items of a reading passage about the theory of education pensatory models have been preferred for cognitive diagnostic analysis,
(data). It is most likely that he did this because he applied all of the as they can generate more fine-grained diagnostic information.
reading skills and strategies correctly (backing) and the task was de- The latent variables of CDMs consist of dichotomous, such as mas-
signed to force him to do that (backing). He may have used his back- tery or non-mastery, or polytomous levels, such as a rating variable
ground knowledge to understand parts of the text (alternative ex- with values such as excellent, good, fair, poor, etc. The design structure
planation), but that is unlikely since he was new to the topic (refusal) of a CDM is the Q-matrix, which maps the skills necessary to success-
(Rupp et al., 2012). fully answer each item on the test (Li & Suen, 2013). Most of the CDM-
While a theory-driven process of analysing and modeling is led by based diagnostic studies carried out thus far have been mostly limited


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

to analysis of existing summative tests, not the development of new strength of association of each skill/item combination. For example, an
cognitive diagnostically focused formative assessments. By contrast, expert proposed skill required for an item could be dropped because
this study focuses on a cognitive diagnostic assessment of a test de- statistical evidence indicates the hypothesized association is lacking.
veloped that is based on a cognitive diagnostic modeling framework. This inferred Q-matrix is then available for future statistical analyses.
In developing a Q-matrix, a review of the literature indicates that a
2.4. Fusion model number of alternatives exist. One of the less costly but less efficient
approaches is to use existing test specifications, such as provided by the
The Fusion model, also known as the Reparameterized Unified test’s developer; however, the attributes indicated via the test specifi-
Model (RUM), is a cognitive diagnostic model that is used to make cations are usually too general for diagnostic purposes. According to
inferences about the mastery level of each latent space attribute for Leighton and Gierl (2007), relying on existing test specifications for Q-
each examinee, based on the examinee’s item responses (Stout, 2008a, matrices is usually unwarranted, so in this study an attempt was made
2008b; Stout, 2008a, 2008b). In other words, RUM is an IRT discrete at developing a reading comprehension test based on a cognitive di-
multidimensional model, that expresses the stochastic relationship be- agnostic framework, with the help of a rigorous Q-matrix construction.
tween item responses and underlying skills as follows (DiBello et al., Jang (2009) suggests using data from a small number of student verbal
1995): reports of how they answered each item to construct the Q-matrix. Even

P (Xi, j = 1 αj , θj ) = πi*Πkk = 1 rik*


(1 − αjk ) qik
though there are doubts as to the validity of such verbal reports, they
pci (θj ) are considered as fairly reliable and hence useful for reading research
(Leighton & Gierl, 2007). Another approach is to use a group of experts
in which, P(Xij = 1ǀαj, θj; πi*, rik*, ci) is the probability of person j on (cognitive, instructional, curricular, etc.) to describe the underlying
item i scoring a correct response (X = l) instead of an incorrect one cognitive skills needed to answer each question, based on their previous
(X = 0), given person abilities – αj, θj – and item parameters πi*, rik*, ci. experience in this realm (Sawaki, Kim & Gentile, 2009). According to
The RUM takes into account the degree of incompleteness of a Q-matrix Leighton and Gierl (2007), an underlying problem with this approach is
resulting from incomplete knowledge representations probably due to the higher ability level of the experts compared with the students, re-
rather complex cognitive strategies being assessed (for example, sulting in a gap between the skills and processes truly used by the
reading abilities) via two modeling components for examinees ability students and those proposed by the experts. For example, a physics
parameters, αj and θj. The former refers to the vector of indicators of expert might cast an inclined plane problem more abstractly than a
attribute mastery versus non-mastery while the latter refers to the beginning physics student. Nevertheless, studies on Q-matrix con-
continuous ability that summarizes attributes not specified by the Q- struction have indicated that using content expert judgment in Q-matrix
matrix. Regarding the item parameters, πi refers to the probability of construction does increase its reliability (Jang, 2005; , 2009; Kim, 2015;
correctly applying all required Q-matrix skills for item i given αjk = 1, Li and Suen, 2013; Li, 2011; Svetina, Gorin, & Tatsuoka, 2011; Sawaki
so it is similar to the item difficulty based on the Q-matrix. It is expected et al., 2009).
that if examinees mastered all required skills for item i, they would After developing the initial Q-matrix, large-scale data analysis based
probably be able to correctly apply them in solving item i. Another on an administration of the test to a large sample of students can be
important parameter is rik which compares the correct item response used to empirically validate the Q-matrix that was based on the initial
probabilities between mastering skill k and lacking skill k. In the last results of cognitive diagnostic modeling. Often, attributes that are si-
part of the equation, Pci(θj), ci refers to the amount that correct item milar can thus be combined to reduce the number of attributes. For
performance depends on α; the required Q-matrix based skills. There- example, Jang (2009) refined her initial Q-matrix of the LanguEdge
fore, it is indicative of the degree to which item performance depends reading comprehension test by reducing the number of attributes based
on θj, which is an additional residual ability not specified in the Q- on RUM analysis results. Another instance is the Q-matrix construction
matrix. The lower ci is, the more the item depends on θj. done by Kim (2015). This study followed Hartz (2002), in which at-
tributes that were measured by fewer than three items were either
3. Developing the Q-Matrix merged with similar attributes or deleted from the Q-matrix. Thus the
empirical analysis can result in modifying the latent space of attributes
First, as described below and based on a review of the second lan- and modifying which attributes are hypothesized to influence each
guage reading comprehension literature, the latent space of the im- item. This is useful then for future analyses and for drawing cognitive
portant attributes that are combined to constitute reading comprehen- conclusions.
sion mastery is chosen. Next and interactively, as also described below,
a set of test items is constructed to assess the chosen attributes. The 4. Research questions
initial step in generating the student learning model that links attributes
to item responses is to map each test item onto an item-by-skill table The purpose of this study is to develop a reliable and useful second
known as the Q-matrix. It consists of an I × K: matrix of binary in- language reading comprehension test based on an expert-based selec-
formation in ones and zeros, where I is the number of items and K is the tion of the important attributes embedded in a cognitive diagnostic
number of attributes. A Q-matrix is a representation of a working hy- framework. Then, to link the attribute selection and test construction, a
pothesis regarding which skills are necessary to answer each item on Q-matrix of the underlying skills necessary to respond correctly to each
the test (Li & Suen, 2013). Each item will most likely require more than of the test items is constructed. The questions are as follows:
one skill to be answered correctly. According to Buck and Tatsuoka
(1998), developing a Q-Matrix requires following a certain procedure: 1. What skills/attributes are required in second language reading
First a list of skills is developed and then each item is coded as a row in comprehension mastery, and how do they theoretically and em-
the Q matrix based on what skills are required for the item. This Q- pirically interact in CDA?
matrix construction uses the previously referred to attribute develop- 2. Do the results of the RUM and ECD based CDM approach in this
ment literature, content experts’ judgment and student verbal reports study produce reliable and useful diagnostic information about
on the underlying skills required for each item which will be discussed student mastery/nonmastery of the attributes and about the test’s
further in the procedure section. The next step is to analyse the test data measurement characteristics, with an eye towards future applica-
produced by an administration of the test using a Cognitive Diagnostic tions, both in L2 reading and other contexts?
Model, in this case the RUM, with the developed Q-Matrix. Finally, the
Q-Matrix can be modified based on statistical information about the


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

Table 1 passages. On three of the occasions, the participant declined giving


Think-aloud verbal protocol participants. verbal reports for all three passages, so I consented to only one passage
for these specific cases. Then the recordings were transcribed and think-
Student Gender Language TOEFL/IELTS Duration of stay
Background score abroad aloud reports in Persian were translated to English for analysis pur-
poses.
NE Female 8.5 years of English 8.5 IELTS 7 years (London) The procedure for analysing think-aloud verbal protocols of this
SN Female 11 years – 11 years (Dubai)
study is the text-focused process based on the study by Jang (2005). For
SJ Female 5 years – –
BA Male Self-studied – – the text-focused process, the verbal report transcripts were analyzed to
KP Female 6 years, institute – – search distinct thought processes that involved text-focused processing.
MK Female 7 years, institute – – The existing text-focused categories were identified. Two example ca-
MH Male A few terms, institute – – tegories of the text-focused processing strategies are provided for fur-
SM Male 7 years, private tutor – –
ther illustration of the analysis procedure.
PR Female Self-studied – –
SK Female 11 years, institute – – Example 1. Deducing word meaning from context.
AP Female 8 years, institute – –
AT Female 8 years, institute – – Referring to key words or phrases in the questions in order find the
MP Male Institutes since – –
answer in the passage according to those clues. Then the examinee
childhood
refers to explicitly stated facts and statements that can guide them to
the correct response. This might also include summarizing and scanning
5. Methodology strategies. The following verbal report excerpts indicate this processing
category.
5.1. Participants

Nazanin: context clues, anything, its up to many things, but I go to


1986 students from the University of Tehran took part in the
the sentence after or the one before the word to better
reading comprehension test. They were bachelor’s students of various
understand it.
majors taking part in the General English course, a requirement of the
I find a key word in the question and then seek for the
bachelor’s program at the University of Tehran. Response data from the
word in the passage.
1986 examinees to each of the 20 questions of the developed test were
I have to understand the entire passage to answer the
used for empirical validation through RUM analysis.
questions. Because one of these levels are connected to
Thirteen students were recruited to understand their use of the
the other level, its part of the level, its not a particular
reading skills on test items through a think-aloud verbal protocol. The
level.
participants included nine female and four male B.A. students, as shown
in Table 1. Several factors that might effect participants verbal re-
porting have been considered for the recording session. Based on ob-
servations made and previous research studies (e.g. Jang, 2005), an
important point to be considered for think-aloud verbal protocol ac- Mojtaba: At first I focus on the whole structure of reading very very
tivities was that not all students would have sufficient English speaking swiftly and very smooth and after that I focus on each
potential to verbalize their thoughts during the think-aloud activity. paragraph and I try to pull the main idea out, I think. I
According to Gass and Mackey (2000), forcing students to report in summarize it, and I hypothetically highlight the
English might result in great potential differences between their actual important sentences, like history, main details.
thoughts and their verbalization and hence the inability to reach rich Vocabulary, I put them in the context. If Im suspicious I
verbal data. Since all participants are in an EFL environment, it is an- go to the passage.
ticipated that they have difficulties in expressing ideas in L2, so pro- At first I read the question and I figured out it’s a general
duction rates of verbal comments will be hampered a great deal because question, includes all the reading, so at first I should read
they might tend to refrain from commenting to avoid the difficulty. the whole thing, but since Im short of time, I just scanned
Thus, participants’ oral ability is a crucial factor to ensuring sufficient and guessed one is correct.
production rates of verbal reports. An attempt was made to minimize The question says chimpanzee, so I look for the key word.
this dilemma by allowing participants to report in their native lan- I think A, because the passage described that
guage, Persian, the language they feel most comfortable with. 7 parti- chimpanzees can, at first they put the chimpanzee in a
cipants verbalized in their native language, while the others decided to paragraph that explains about how crows use sticks to pry
report in English. Each of the students had previously taken the reading peanuts out of cracks. This is an example of intelligence
comprehension test and was asked to verbalize their thought processes by showing what a stick can do. So the props are
when answering the test items. This included their reporting of what important in this area, so chimpanzee is an instance of
skills and strategies first came to mind when answering the items. The this subject.
reading passages and items were given again as reference and retro-
spective verbal reports were recorded. The retrospective verbal reports
were conducted individually in a classroom secure from external noise
and disruptions. A think-aloud verbal protocol instrument was prepared Ava: The sentence was explicitly stated in the passage, instinct is…
and used by the interviewer in each session. Each participant was asked
to talk about what he or she was thinking while reading the passage and
responding to the items. After a silence of about 10 s, the researcher
asked questions such as, “What are you thinking now?” After answering
Arezoo: Context clues help me with finding the meaning of words.
the questions related to the specific passage, the researcher asked the
For those questions which require reading the paragraph, I
participant to recount the process he/she used in order to clarify the
found key words in the question to search the paragraph.
verbal reports and had a chance to ask related questions. The duration
Using tools; means using instruments; for ex. I find a word
of time spent for each student was between 25 and 30 min for the three


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

which is similar to instinct. I didnt find instinct. Again made annotations about the evidence on which they based their as-
finding key words; animal communication in the passage. sessments.
They examined the extent to which the specified attributes are
The connection between text ideas and key words is evaluated in
distinguishable from each other and whether the raters agree with each
terms of success in solving the items. In this example, Arezoo, Nazanin,
other upon the attributes associated with each test item (a type of inter-
Mojtaba and Ava all make use of context clues to help them deduce the
rater reliability effort).
meaning of the text. Sometimes looking for certain words requires
With reference to Li (2011) and Jang (2005), a coding scheme was
reading the entire passage, or seeking for key words in parts of the
created based on the cognitive framework and think-aloud data. An
passage.
initial Q-matrix was constructed based on evidence from the think-
Example 2. Inferencing aloud verbal report, the expert rating, and the coding scheme extracted
from the literature. However, a frequently encountered problem here is
by referring to background knowledge. With this process readers
that students’ verbal reports may not agree with expert rating (Jang,
make inferences about the authors intentions or lexical word meaning
2005; Leighton & Gierl, 2007). When this discrepancy occurred in the
of the text based on their background knowledge and personal experi-
present study, the think-aloud verbal reports were regarded as the
ence. Inferential comprehension is demonstrated by the reader through
primary evidence, because the verbal reports captured the real-time
the process of referring to personal knowledge, his/her own intuition,
reading process to a certain extent and were thus considered more re-
or personal experiences. The following are some excerpts of this process
liable and authentic. Nevertheless, the value of expert rating should not
from the transcribed and translated verbal reports.
be underestimated because it provides important evidence from a dif-
ferent perspective. Furthermore, when it was difficult to determine
Saeideh: I read something and the concept is highlighted in my whether a certain skill should be retained for an item, the skill was
mind. I remember it from an incident that happened for usually retained. This is because the resulting RUM calibration provides
me. evidence concerning the importance of the skill for the item; that is, if
I didnt know the meaning of the word, but since I knew the calibration showed the skill to be insignificant, it could be dropped
what the passage was talking about and I had heard it at the later stage. Via RUM analyses, the test takers’ (N = 1986) test
before, I guessed the meaning. performance data were used to detect relevant item characteristics and
Because uhh… I know what is happening, I used my own then to refine the Q-matrix by the use of statistical analysis through the
knowledge. RUM.

5.2. L2 reading attributes

Behnam: This one is really easy for me because I learned it before. The list of reading attributes (as shown in Table 2) was developed
based on previous literature (Birch, 2002; Cohen & Upton, 2006;
Fletcher, 2006; Francis et al., 2006; Jang, 2009; Rupp, Ferne, & Choi,
2006), content expert judgment (6 content raters), and examinees’
think-aloud verbal protocols (13 students). This list of attributes largely
Moein: It happened for one of my relatives once and I remembered
consisted of language knowledge and strategies, required when a
that day. It was very bad.
learner's language ability interacted with the written text. Language
Many words I have seen before but I dont know their exact
knowledge was further categorized into grammatical knowledge (i.e.,
meaning, but it helps me to guess from the context.
grammatical form and semantic meaning) and pragmatic knowledge,
while strategic competence was composed of metacognitive strategies
(i.e., assessing the situation, monitoring, evaluating) and cognitive
strategies (i.e., comprehension, memory, retrieval). Each of these ca-
Saman: I read this word exactly yesterday, and I remembered it. tegories consisted of a number of L2 reading attributes (e.g., knowl-
This possibility that an examinee is acquainted with a topic or has edge, strategies). In addition, language knowledge and strategic com-
had previous experience with a topic will make the task easier for him/ petence were presumed to interact with each other (Bachman & Palmer,
her and it will be a benefit to those who have background knowledge of 1996). Considering that strategic competence manages all language
a certain topic. Sometimes there is a bias on reading comprehension use, which includes using language knowledge (Bachman & Palmer,
tests because some students are knowledgeable of a topic while others 1996), test-takers needed to utilize strategies to invoke their language
fall behind on responding because they are clueless and they haven’t knowledge. In this study, the list included those L2 reading attributes
read up on the topic, despite the fact that they are might be knowl- that were considered to be involved in the reading process, comprised
edgeable on other topics. This was the case with those examinees that of four knowledge-based attributes (attributes 1–4 in Table 2) and five
had previous experience of heat stroke and heat exhaustion and were reading strategies (attributes 5–9 in Table 2).
easily able to respond to the items, while those who hadn’t any back-
Table 2
ground knowledge fell behind. Nevertheless, this is the nature of any
Attributes of L2 reading ability.
reading comprehension exam.
In the next phase, a group of content raters served as developers of L2 Reading Attributes
the attributes that reflect the main language skills necessary for suc-
cessful performance on each item. This group included 6 PhD students A1 determining word meaning from context
A2 determining word meaning out of context
at the University of Tehran studying Teaching English as a Foreign A3 comprehending text-explicit info
Language, 3 females and 3 males, who all had experience in applied A4 comprehending text-implicit info
linguistics research and teaching reading comprehension courses. They A5 skimming
reviewed each test item and the previously selected attributes and then A6 summarizing
A7 inferencing
decided whether or not each attribute was necessary for answering the
A8 applying background knowledge
item correctly. Each expert read the passages and performed the rating A9 inferring major ideas or writers purpose
task independently. They identified the skills for each item and also


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

6. Procedure Table 3
Structure of the reading test.
The study was carried out in three stages; 1) Developing the reading
Passage Topic Content structure Length Number of
comprehension test based on a cognitive assessment framework that items
includes constructing items that target the specified attributes. 2)
Constructing and validating a Q-Matrix of i test items by K reading Passage 1 Animal Intelligence Description/ 323 words 7
Informative
attributes; 3) RUM based statistical analysis of data resulting from a
Passage 2 Communication Causation/ 318 words 7
large-scale test administration. The first stage of the study was to carry Satellites Process
out an extensive study on the literature pertaining to cognitive diag- Passage 3 Body Temperature Compare/ 257 words 6
nostic assessment and test development for the purpose of developing a Contrast
L2 reading comprehension test based on a cognitive diagnostic frame-
work. From the review of literature, an initial conceptualization of test
specifications based on Evidence Centered Design put forth by Mislevy items, fourteen of the items were related to identifying semantic
(1996) was put to use in developing the 20 MC-items for the test. The meaning, and six items required identifying the pragmatic meaning of
author spent one term of teaching the general English course to gather the text. In this study, reading for identifying semantic meaning refers
data to create a list of suggested attributes, which where then used to to reading for literal and intended meaning. Reading for literal meaning
develop the test items. Focusing on test specification for developing the refers to identifying information conveyed in the text through para-
test items required an extensive literature survey to better understand phrasing or translating, and reading for intended meaning refers to
the ECD framework, which aims to gather evidence to support claims obtaining the meaning of sentences by making connections between
about an individual. As Mislevy (1994) puts forth, this framework is them. Pragmatic meaning refers to contextualized implied meaning,
composed of three models or components, the student model, the evi- such as contextual, sociocultural or psychological meanings. Therefore,
dence model, and the task model. The student model includes the types reading for pragmatic meaning refers to deriving a deeper under-
of inferences that are made about a student. In this study, after working standing of the text by combining the information with readers' specific
on reading comprehension skills and strategies in class, reading pas- prior knowledge and experiences (Purpura, 2004).
sages were assigned to students in groups and they were asked to re-
spond to reading comprehension questions. Observations were made 8. Data analyses
about the student’s responses and skills and strategies that were more
frequently used in their responses. Evidence was collected through Both qualitative and quantitative analyses were carried out in the
group work exercises and feedback from students. This supports the process of test development and Q-matrix construction. Initially, qua-
claim that the evidence model gives evidence to support the inferences litative analyses were carried out to specify reading skills assessed by
made, and the task model provides tasks that elicit usable pieces of the reading test. For this, various classifications of reading skills and
evidence (Mislevy, 1994). strategies in the literature were studied. Then, think-aloud verbal pro-
It should be emphasized that the process of assessment design is not tocols were analyzed qualitatively to help understand the character-
necessary linear (Roussos et al., 2007); hence, although defining attri- istics of the cognitive processes and skills used by the students and to
butes typically precedes task construction, the feasibility of task con- identify primary reading skills. Six content rater’s judgment was also
struction that measures a set of attributes may necessitate going back to used to examine to what extent the specified skills are necessary to
how the attributes have been defined. In this process, task construction correctly answer the test items. Reading test data were analyzed to-
informs attribute definition, which will then inform the next phase of gether with the Q-matrix using the Arpeggio Suite software
task construction. (DiBello & Stout, 2008b). The first step in the RUM analysis is the
In this study, data gathered from observations of student responses analysis of the Markov Chain Monte Carlo (MCMC) convergence to
during the course, evidentiary reasoning, specifically the task model by guarantee that model parameters had a stable value (Roussos et al.,
using tasks to elicit usable pieces of evidence during the course, were all 2007). Arpeggio software uses a Bayesian modeling approach with a
used in constructing the reading comprehension test items. After the Markov Chain Monte Carlo (MCMC) algorithm. “The MCMC estimation
test was developed, it was administered to 1986 students in general provides a jointly estimated posterior distribution of both the item
English courses at the University of Tehran. Data from this test ad- parameters and the examinee parameters (given the test data), which
ministration was to be used for RUM based statistical analysis. The may provide a better understanding of the true (estimation’s) standard
second stage was constructing a Q-matrix based on the chosen L2 errors involved” (Patz & Junker, 1999). MCMC convergence is mainly
reading attributes and the 20 constructed items. In order to construct evaluated by visually examining the time–series chain plots and esti-
the Q-matrix, initially a list of L2 reading attributes was specified based mated post-burn in posterior probability density plots, which, if con-
on the previous literature, the 13 participants’ think-aloud verbal re- vergence has occurred, should be roughly unimodal and roughly bell-
ports, and the 6 content experts’ judgment. The third and final phase shaped. By using the results from the Arpeggio software, the R statis-
consisted of the reading test data being analyzed using the Q-matrix and tical package (The R Foundation for Statistical Computing, 2010) pro-
the Arpeggio suite software (Stout, 2008a, 2008b; Stout, 2008a, duces both the chain plot (time series plots) and the density plot for
2008b), which implements the RUM. This also resulted in an empirical each parameter. The chain plot graphically indicates the degree to
validation of the proposed Q-matrix. which the chain has converged to a desired value after searching the
space. Ideally, a chain should sample values over the parameter space,
7. Instruments but also around a certain location. That is, a chain should converge to a
specific value across time and reach a stationary state (Johnson, 2006).
The test used was the reading comprehension test developed for this Usually the first portion of the chain is used as a burn-in and is dis-
study. The test was developed based on a cognitive diagnostic frame- carded. For example, in a chain of 5000, the first 1000 steps will be
work, including three different passages along with 20 multiple-choice used as the burn-in.
items. Table 3 shows the structure of the developed reading compre- In this study, since a total of nine L2 reading attributes were ex-
hension test. The passages varied content-wise (descriptive/in- amined, nine pk values were produced from the RUM analysis and a
formative, causation/process, and comparison/contrast), topic-wise chain length of 100,000 was initially used to reach MCMC convergence.
(animal intelligence, communication satellites, and body temperature), It is generally stated that the more complex the model (as determined
and by length (323, 318 and 257 words). In addition, among the 20 by the number of item parameters—which, for RUM, is related to the


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

number of skills and the average number of skills per item), the longer Table 4
the required chain length (Roussos, Templin, & Henson, 2007, p. 299). Q-matrix of Attributes.
Chain plots and density plots of the nine reading attributes indicated
I/A A1 A2 A3 A4 A5 A6 A7 A8 A9
convergence of the parameter estimates.
Among the different parameters, first the convergence of examinees’ 1 1 0 1 0 1 0 0 0 0
probability of mastery for each attribute (pk) was evaluated overall. 2 1 0 1 0 1 0 0 0 1
3 1 0 1 0 1 0 0 0 0
This estimates for each attribute the proportion of students mastering
4 0 0 1 0 1 0 0 1 0
the attribute, information useful for classroom instructional purposes. 5 0 0 0 1 1 0 1 0 0
Also, three item-based parameters that indicate the item difficulty for 6 1 0 1 0 1 0 0 0 0
item masters (πi*), item attribute k specific discrimination power (rik*), 7 1 0 0 0 1 0 1 0 0
and item completeness (ci) were evaluated. In fact, ci is a measure of 8 1 1 0 0 0 0 0 0 1
9 0 1 1 0 1 0 0 0 0
being inversely proportional to the amount of model-based examinee
10 1 1 0 0 0 0 0 0 0
responding not captured by the pi and r parameters. 11 0 0 1 0 1 0 0 0 0
Here an “item master” is defined to be an examinee that possesses 12 1 1 0 0 0 0 0 1 0
all the attributes (skills) required for the item in question. In this sec- 13 0 0 0 1 1 0 1 0 0
14 0 0 1 1 0 1 1 0 1
tion, examinees’ L2 reading performance on the reading comprehension
15 0 0 1 0 1 0 0 0 0
test was evaluated in terms of their mastery and non-mastery of L2 16 1 1 0 0 0 0 0 1 0
reading attributes. 17 0 0 1 0 1 0 0 0 0
Fit statistics were calculated to evaluate the fit of the model to the 18 0 0 0 1 1 0 1 0 0
data. The two types of fit statistics measured are named FUSIONStats 19 1 0 0 0 1 1 0 0 0
20 0 0 0 1 0 1 1 0 1
and IMStats, or item mastery statistics, both explained below. The first
compares the difference between the observed proportion of examinees
getting the item correct and the predicted proportion of examinees obtained are listed in Table 2. Due to the complicated nature of reading,
getting the item correct. A low difference between the two p-values numerous reading attributes are involved in completing each item
indicates a good fit of the data. IMStats indicate a comparison of the (Alderson, 2000; Urquhart & Weir, 1998), as was the case with this
observed performance of item masters and item non-masters (defined as study. Also, it is not claimed that the choice of the 9 attributes is un-
examinees lacking at least one of the item’s required attributes) at the iquely better than some other choice. However, it seems clear that these
item level. Sizable differences between the performance on most items 9 have been well chosen and further do provide good reliability and as
by item masters versus item non-maters indicates a good fit of the such provide cognitively and instruction-wise useful information.
model and the specified Q-matrix to the data. Importantly, a sizable
difference for a specific item is an indication of the high quality of the
item. Also, the reliability of the RUM was examined by analysing the 9.2. Fusion model analysis
Correct Classification Rate (CCR), which is the consistency of classifi-
cation of examinees into masters versus non-masters of attributes. That To statistically assess the 9 identified attributes in the initial Q-
is, CCR (k) measures the proportion of masters of Attribute K that are matrix and evaluate the 20 written items, RUM analysis was conducted
classified as masters and the proportions of nonmasters classified as using Arpeggio software. The item parameter estimates were analyzed
nonmasters, this at the attribute level for each k, CCR denotes the to evaluate the quality of the reading test items. As discussed above,
average of the CCR(k)s. These are calculated in practice via a simula- three types of item parameter estimates were examined in the current
tion process; for details see Roussos et al. (2007) and DiBello and Stout study: item difficulty (π*), item attribute discrimination (rik), and the
(2008a). In the final step, the population of examinees’ strengths and completeness index (ci). These Fusion Model parameters have im-
weaknesses in L2 reading ability at the attribute level were evaluated portant roles because they not only provide diagnostic information
through probability of mastery for each attribute (pk). about each test taker and item, but also highlight the properties of the
Q-matrix and possible Q-matrix misspecifications observed (Aryadoust,
9. Results 2011).
The πi* parameter is the probability that an examinee, having
9.1. Q-Matrix development mastered all the Q-matrix required skills for item i, will correctly apply
all these skills to solving item i. A high item probability value above 0.6
Results from think-aloud verbal protocols and content rater’s judg- indicates that examinees will have a good chance of correctly re-
ment, in conjunction with consultation of the L2 reading literature, sponding to the item if they have mastered all the necessary attributes
were analyzed to develop the list of reading attributes. This list of (Johnson, 2006), thus having a higher probability of successfully ap-
reading attributes was then used to develop the Q-matrix. In the Q- plying the skills required by that item. The item discrimination index
matrix, the rows represent items and the columns correspond to the (rik) shows how well the item discriminates masters from non-masters.
attributes. 1 means that the particular attribute is required for the A reasonably well defined Q matrix and a reasonably well written
completion of the item, whereas 0 means that it is unnecessary. Since item combine to produce estimates r*ik below the 0.8 cutpoint, in-
the test contained 20 items, the Q-matrix for the current study re- dicating sufficient item discrimination of masters from non-masters, as
presented the relationship between 20 items and their corresponding believed by CDM experts. This is while values below 0.5 are regarded as
attributes. skills highly necessary to answer the question correctly. A low item
As shown in Table 4, the rows of the Q-matrix indicate the 20 items discrimination value below 0.3 indicates that the item discriminates
from the reading comprehension test, and the columns indicate the nine very well between masters and non-masters of the attribute, and hence
reading attributes. the item has a strong dependence on the corresponding attribute
Attributes that three or more of the 6 raters agreed upon were (Johnson, 2006).
considered as essential for the item and were included in the Q-matrix. In this study, the average π obtained was 0.833, indicating that the
According to Hartz (2002), those attributes that were measured by set of identified L2 reading comprehension skills for the items were
fewer than three items do not provide statistically meaningful in- generally adequate and reasonable. Results indicate that three items
formation about these attributes and therefore, can be merged with (i.e., items 2, 5, and 13) had π* values lower than 0.6, which means that
similar attributes or deleted from the Q-matrix. The nine attributes they were too difficult and may need some modification. However,


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

Table 5
Item Parameter Estimates.

Item π* r* 1 r* 2 r* 3 r* 4 r* 5 r* 6 r* 7 r* 8 r* 9 ci

1 0.879 0.895 0 0.827 0 0.568 0 0 0 0 1.97


2 0.583 0.870 0.847 0 0 0 0 0 0 0.678 2.12
3 0.963 0.923 0 0.899 0 0.726 0 0 0 0 2.73
4 0.786 0 0 0.835 0 0.631 0 0 0.877 0 2.43
5 0.376 0 0 0 0.843 0.926 0 0.722 0 0 2.76
6 0.804 0.694 0 0.709 0 0.490 0 0 0 0 2.24
7 0.993 0.860 0 0 0 0.809 0 0.944 0 0 2.61
8 0.950 0.423 0.443 0 0 0 0 0 0 0.811 2.68
9 0.962 0 0.694 0.536 0 0.853 0 0 0 0 2.70
10 0.769 0.531 0.580 0 0 0 0 0 0 0 2.02
11 0.974 0 0 0.594 0 0.759 0 0 0 0 1.60
12 0.789 0.828 0.521 0 0 0 0 0 0.757 0 2.47
13 0.557 0 0 0 0.704 0.883 0 0.730 0 0 2.04
14 0.894 0 0 0.544 0.863 0 0.650 0.822 0 0.786 2.54
15 0.979 0 0 0.812 0 0.713 0 0 0 0 0.87
16 0.940 0.939 0.941 0 0 0 0 0 0.878 0 1.05
17 0.661 0 0 0.887 0 0.821 0 0 0 0 0.65
18 0.934 0 0 0 0.909 0.860 0 0.884 0 0 0.74
19 0.997 0.981 0 0 0 0.938 0.980 0 0 0 1.72
20 0.863 0 0 0 0.738 0 0.804 0.826 0 0.719 0.80

items 2 and 13 with a π* value of 0.58 and 0.56, respectively, were less 19. Interestingly, items 16–18 and 20 do suggest the possibility that the
of a concern than item 5 with a attribute space is incomplete and that one or more attributes could be
π* value of 0.38, which should be replaced. This illustrates that added to the 9 attributes. The above discussion demonstrates that item
estimated item parameters can drive decisions to replace one or more parameter estimates do provide substantially meaningful information
items that, when carried out, will definitely improve the diagnostic regarding the quality of the items. Thus the next iteration of the test can
power of the test. The item parameter estimates produced from the be considerably improved, based on the RUM analysis.
current study are presented in Table 5.
Examining the table, it is observed that all r*s were below 0.9, in-
dicating that all the items were discriminating at least somewhat be- 9.3. Analysis of model fit
tween masters versus non-masters of the attributes. It is observed that
most items except 7, 16, 17, 18 and 19 have at least one r < 0.8, hence In order to examine the fit of the RUM to the data, two types of
discriminating at least one attribute sufficiently. This indicates that the goodness of fit measures were used: (1) FUSIONStats and (2) item
majority of the test-takers, including both masters and non-masters of mastery statistics (IMStats). FUSIONStats compares the difference be-
the attributes, were slightly influenced by the specified attributes. tween the observed item p-value (the proportion of examinees getting
When such a result is obtained, we must come to the conclusion that the item correct) and the estimated item p-value (the model estimated
revising or eliminating/replacing these items on the test with replace- proportion of examinees that should get the item correct) for each item.
ment items that better discriminate between the masters and non- A small difference between the two for most or ideally all of the items
masters of the specified attributes will create a better performing di- suggests a good fit of the data. Table 6 shows the observed item p-
agnostic test. Indeed we have 1/5 of the test items essentially providing values and the estimated item p-values for each item. The absolute
no diagnostic power. Thus, replacing them with effective items would difference between each observed item p-value and the estimated item
make a major difference in the discrimination power of the test.
Therefore, the r* parameter could be used to evaluate the quality of an Table 6
Observed and estimated item p-values.
item in terms of its discriminatory power (Leighton & Gierl, 2007).
The completeness index, ci, shows the degree to which the attributes Item Observed p-value Estimated p-value Absolute observed-estimated p-
specified in the Q-matrix are complete in describing the skills needed to value
successfully respond to the ith item, and ranges from 0 to 3. Values
1 0.660 0.670 0.01
close or even fairly close to zero, say ≤1.25 (somewhat arbitrary) in-
2 0.461 0.498 0.037
dicate that the item has low completeness, that is, other unspecified 3 0.825 0.847 0.022
skills are important for answering item i correctly. ci values somewhat 4 0.615 0.649 0.034
close to 3 indicate that the specified attributes in the Q-matrix suffice to 5 0.314 0.332 0.018
explain examinee responding to that item (von Davier, DiBello, and 6 0.554 0.607 0.053
7 0.872 0.898 0.026
Yamamoto, 2006). From a brief glance at the table it is observed that 8 0.629 0.722 0.093
items 1–14 and 19 have an overall high completeness index ≥1.5 and 9 0.713 0.730 0.017
in all but two cases ≥2, while items 15–18 and 20 show a low com- 10 0.537 0.602 0.065
pleteness index of 0.65 to 1.05. This indicates that the specified attri- 11 0.700 0.737 0.037
12 0.580 0.625 0.045
butes selected for items 15–18 and 20 have attributes besides the spe-
13 0.433 0.456 0.023
cified attributes that seriously impact examinee responding. 14 0.603 0.671 0.068
Interestingly, items 16 and 18 are virtually useless in discriminating 15 0.630 0.673 0.043
among the specified attributes. Item 19 is also useless in discriminating 16 0.678 0.719 0.041
among the specified attributes but it is seemingly not influenced by one 17 0.413 0.439 0.026
18 0.595 0.642 0.047
or more attributes lying outside the specified latent space as its 19 0.863 0.887 0.024
c = 1.72 suggests. The result is that the item is both useless for our 20 0.482 0.570 0.088
purposes and also very easy. Item 7 has almost the same story as Item Mean Absolute difference: 0.040


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

Fig. 1. Observed vs. Estimated Item p-value.

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

p-value for each item should be below the suggested value of 0.05 for attributes on average by 45.2%. This high value implies a good fit be-
all items as put forth in (Roussos et al., 2007). However, in this case this tween the estimated model and the observed data, and moreover in-
value was higher than 0.05 for four items out of twenty, in particular dicates the strong diagnostic power of this L2 reading test. Further, it
item 8 at 0.09, item 10 at 0.06, item 14 at 0.06 and item 20 at 0.08. This indicates the usefulness of the Fusion modeling approach for carrying
suggests these items may have moderate lack of fit problems. However, out cognitive diagnostic studies such as the one reported on herein.
the mean absolute difference between the p-values was low at 0.04, In addition, the reliability of the RUM was examined by evaluating
suggesting good overall fit. Fig. 1 graphically depicts the results of the Correct Classification Rate (CCR) index, which as discussed above is
observed versus estimated item p-values, which suggests that the RUM the diagnostic power to correctly classify examinees. The output file
fits the data well. from the Arpeggio Tabulator, called classfile.csv, reports the estimated
In addition, item mastery statistics (IMStats) were used to compare correct classification rate CCR(k) for each skill k with CCR the average
the observed performance of item masters and item non-masters at the over all k. The CCR theoretically ranges between zero and one. In this
item level. Three different values are considered to evaluate IMStats: data set, the CCR was high at 0.826, indicating a high reliability of the
the phat (m), which refers to the observed proportion correctly re- RUM and in particular the capacity to correctly classify examinees over
sponding to an item among item masters; the phat (nm), which refers to 80% of the time.
the observed proportion correctly responding to an item among item Population mastery probability of each of the reading attributes was
nonmasters; and the pdiff, which indicates the average difference be- also analyzed for the student’s L2 reading performance. For the overall
tween phat (m) and phat (nm) across items. In this study, as shown in test-taking group, the population’s probability of mastery for each at-
Table 7, the average phat (m) across all items was 0.774, which in- tribute (pk) was investigated. Table 8 shows the L2 reading attributes,
dicates that the average observed probability of getting a correct re- and their pk values that range from a high of 0.769 (determining word
sponse to an item by item masters was relatively high at 77.4%. On the meaning from context) to 0.627 (comprehending text-implicit info).
other hand, the average phat (nm) was 0.322, indicating that the This indicates that 76.9% of the students had mastery of determining
average observed probability of having a correct response to an item by vocabulary from context, making it the easiest attribute. This is while
item non-masters was much lower at 32.2%. Thus, the pdiff was 0.452, 62.7% of the students had mastery of comprehending text-implicit info,
indicating that item masters outperformed item non-masters of making it the more difficult attribute. The overall average level of
mastery probability for the nine L2 reading attributes was 0.701. These
population level statistics should prove useful for instructional pur-
Table 7 poses.
Probability of Correctly Responding to an Item.
Among the four knowledge-related attributes (the first four in
Item phat(m) phat(nm) phat(m) − phat(nm) Table 8), there were small differences in the overall groups' attributes
mastery probabilities, ranging from 0.627 for comprehending text im-
Item 1 0.841 0.398 0.443 plicit information (pragmatic meaning) to 0.769 for deducing word
Item 2 0.543 0.398 0.214
meaning from context (lexical meaning). This signified that “prag-
Item 3 0.957 0.609 0.348
Item 4 0.772 0.380 0.392 matic” word meaning was the most difficult attribute to master,
Item 5 0.358 0.197 0.161 whereas “lexical” word meaning was the easiest to master on this test.
Item 6 0.788 0.278 0.51
Item 7 0.982 0.565 0.417
Item 8 0.958 0.181 0.777 Table 8
Item 9 0.960 0.312 0.648 Attribute mastery probability for overall group.
Item 10 0.718 0.209 0.509
Item 11 0.919 0.334 0.585 L2 Reading Attribute Mastery probability (pk)
Item 12 0.780 0.128 0.652
Item 13 0.552 0.169 0.383 deducing word meaning from context 0.769
Item 14 0.864 0.091 0.773 determining word meaning out of context 0.676
Item 15 0.812 0.368 0.444 comprehending text-explicit info 0.662
Item 16 0.790 0.426 0.364 comprehending text-implicit info 0.627
Item 17 0.513 0.301 0.212 skimming 0.722
Item 18 0.751 0.393 0.358 summarizing 0.698
Item 19 0.938 0.753 0.185 inferencing 0.754
Item 20 0.684 0.012 0.672 applying background knowledge 0.666
Average 0.774 0.322 pdiff = 0.452 inferring major ideas or writers purpose 0.737


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

Among the five reading strategies, the attribute mastery probability of classifying the examinees in terms of their mastery or nonmastery of
strategies ranged from 0.666 (applying background knowledge)to each attribute (i.e., the correct classification rate: CCR). First of all, the
0.737 (inferring major ideas). In other words, about 67% of examinees model fit the data well for several reasons. First, the mean absolute
had mastery of applying background knowledge, making it the most difference between the observed and estimated item p-values was low
difficult strategy. On the other hand, about 74% of the examinees had at 0.04 (see Table 6). In addition, the average phat (m) across all items
mastery of inferring major ideas, making it the easiest strategy. was 0.774, indicating that the average probability of getting a correct
response to an item by item masters of attributes was relatively high at
10. Discussion and conclusions 77.4%. This suggests that the 9 attributes include much of what is re-
quired for answering the test’s items correctly. The average phat (nm)
Among the four skills of English language proficiency, reading was 0.322, so the average probability of having a correct response to an
ability might well be considered the most essential skill for success in item by non-masters of attributes was much lower at about 32.2%. This
the academic world. Hence, it is crucial to accurately assess learners' L2 indicates the items, on average, are “hard” for non-masters. In addition,
reading ability to gauge their overall progress and in particular help the pdiff was 0.452, indicating that the masters of required attributes on
enhance specific reading skills where lack of mastery is indicated. The an item outperformed non-masters of required attributes on average by
main goal of the current study was to develop a L2 reading test using 45.2% across all items. This high value indicated a good fit between the
the ECD framework and based on a cognitive diagnostic model (in our estimated model and the observed data, and further, suggesting a strong
case the RUM). This is being done in order to diagnose learners' diagnostic power of the test as modelled by the RUM.
strengths and weaknesses in L2 reading ability, with the ultimate goal According to Lumley (1993), identifying implicit information
of providing detailed information that can assist teachers and admin- (equivalent to inferencing) and synthesizing to draw a conclusion
istrators for instructional purposes and to thus improve student per- (equivalent to summarizing) were difficult reading attributes compared
formance. to vocabulary (similar to identifying word meaning) and identifying
In fact, two research questions were carried out in the course of this explicit information (similar to finding information and skimming).
study. First was investigating the L2 reading attributes necessary for This could be attributed to the fact that inferencing and summarizing
successfully completing every item on the reading test, noting that the are higher-level strategies involving more complex cognitive processing
20 item test was then carefully designed to assess mastery of each of than the other three strategies, which require lower-level strategies as
these 9 attributes judged to capture the reading comprehension latent was the case in this study. As an example, summarizing requires readers
space. Raters identified the various attributes, consisting of both to first comprehend the overall text and then extract the gist of meaning
knowledge and strategy attributes, by referring to the L2 reading lit- from it. Understanding the gist involves numerous components, such as
erature and students think-aloud verbal reports. This list of nine reading knowledge of grammar, vocabulary, discourse structure, and various
attributes and the developed test were then organized into an item-by- cognitive processes (Birch, 2002). Thus, the nature of summarizing
attribute Q-matrix. seems quite complex. In a similar vein, the strategy of inferencing has
Second, test-takers' performance on the reading test was examined long been believed to be a challenging one (Fletcher, 2006). In order to
for individual student and group diagnostic purposes. In particular, the make inferences, readers should already have the ability to understand
test scores were then analyzed in conjunction with the initial Q-matrix the literal meaning of the text, which makes inferencing more difficult
(see Table 4) using the Fusion model analysis. to master. However, the results of this study are contrary to this belief,
Findings of the study suggest that a majority of the items on the test whereas for the overall group, the attribute of inferencing stands in
can successfully discriminate between masters and non-masters and are second at 0.754 with less difficulty. On the other hand, between
therefore appropriate for CDA. In addition, the list of L2 reading at- skimming and summarizing, the latter was more difficult with 0.698 as
tributes can be used as a framework for CDA research in the future. compared with skimming at 0.722. In fact, as put forth by Urquhart and
Regarding the frequency of reading attributes for each item on the test, Weir (1998), skimming involves quickly understanding the surface-
the Q-matrix was examined to identify any recurring patterns among level propositional meaning of the text and is considered a less chal-
the L2 reading attributes. Since language knowledge and strategic lenging strategy.
competence are believed to interact with each other in constituting the Furthermore, comprehending text implicit information, compre-
mastery of reading ability (Bachman & Palmer, 1996), it was expected hending text explicit information and applying background knowledge
that each item on the reading test would measure at least one knowl- were the more difficult attributes with 0.627, 0.662 and 0.666, re-
edge-related attribute and one reading strategy. In fact, it was indeed spectively. Similarly, deducing meaning from context the easiest attri-
found that almost all items measure one knowledge-related attribute bute, which is in accordance with the belief that word recognition,
and a minimum of one reading strategy. Since strategies, by definition, which is similar to identifying word meaning, involves lower-level
manage the use of language knowledge (Bachman & Palmer 1996), they processing (Alderson, 2000). Overall, examinees have performed com-
were assumed necessary to reveal mastered knowledge-related attri- paratively well on these three attributes (i.e., deducing meaning from
butes. context, inferring major ideas and skimming) due to the nature of the
Examinees’ performances on the attributes existing in reading attribute, which requires relatively less cognitive processing. The de-
comprehension items were evaluated. As Roussos et al. (2007, p. 293) tailed classification student level reports made possible by the RUM
put forth, “A key issue for mastery/nonmastery of diagnostic models is analysis of the L2 reading assessment of this study strongly indicate that
whether the proportion of examinees estimated as masters on each skill L2 reading assessments using this test should be highly beneficial in
is relatively congruent with the user’s expectations.” RUM analysis was facilitating learning on the part of individual students and in teacher
carried out to obtain relationships among the participant’s perfor- instructional preparation as the course proceeds and finally in curri-
mances on the test items. The Arpeggio suite software provides a culum development on the part of the teacher. With such detailed re-
number of output files that give specific information regarding ex- ports of test results available, teachers can become aware of students’
aminees’ performance on each item of the test. Two of the output files problematic areas and focus on them in lesson planning and providing
that help us respond to research question 2 (essentially studying ex- learning materials. Since the reading test developed here was based on
aminee/test attributes relationships) are the classification file (classfi- a cognitive framework followed by RUM analysis, the problematic
le.csv) and the fit report file (fitreports.csv). The fit report file is an items were identified and could be further modified or even replaced.
output file that provides fit statistics for how well the model fits the test Hence, an item bank can be developed for cognitive diagnostic devel-
data, noting that a good fit is required for usefulness of the RUM based opment items in L2 reading comprehension testing for use during or
statistical analyses. The classification file indicates the consistency of after a reading comprehension unit is carried out, and thus for future


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

studies. cognitively-based assessments requires numerous pilot studies and


carrying out further research for one content area such as L2 reading or
11. Limitations, strengths and future research more generally other L2 areas can only improve the accuracy of items
developed and thus lead to more accurate formative assessment attri-
The current study provides a number of implications for teachers bute level classification accuracy.
and practitioners, both pedagogically and theoretically. First and fore- One of the limitations of the study was the format of the reading
most are the positive pedagogical implications for assessment purposes comprehension test, which was in the form of strictly multiple-choice
implied by the successful attempt at constructing a cognitive diagnostic items. In future studies, other item types such as fill-in-the-blank, short
test to gauge L2 reading proficiency. The type of diagnostic feedback essay type and open-ended questions should be considered. Moreover,
that can be provided to teachers includes attribute mastery probabilities corresponding modifications in the Fusion modeling approach are thus
for overall groups and individual students. Teachers can refer to this called for. Also, another suggestion for future research is to focus on
information to refine their lesson plans and provide L2 reading material item distractors to enhance the diagnostic potential of the test. The
necessary to meet the needs of various mastery levels. proportion of diagnostic information that is obtained from a diagnostic
Developing a test based on a cognitive diagnostic framework re- test greatly depends on the test design and item construction.
quires much focus on the diagnostic components of test design. By using Specifically, in a retrofitting approach, many or even most test items
a RUM statistical analysis, the quality of test items was evaluated. The will fail to diagnostically differentiate examinees based on their un-
item mastery probability difficulty values indicated which items were derlying skill competencies as well as items specifically designed to do
too difficult given the mastery of the attributes involved at the item so, as is the case for the test in this study. However, test/item diagnostic
level, while the item discrimination values indicated which items dis- discrimination for either a retrofitted or a specifically cognitive diag-
criminated between masters and non-masters of a particular attribute, nosis designed test can be enhanced by observing incorrect response
showing that some items were deficient in this regard. The item com- patterns of examinees. This necessitates developing diagnostically
pleteness values indicated that for some items the Q-matrix lacks cer- sensitive distractors in MC items (DiBello et al., 2015; DiBello,
tain skills that are important for answering those items correctly and Roussos, & Stout, 2015).
therefore the Q-matrix should be refined either by modifying entries or Another limitation of the study is that the Q-matrix, which is es-
perhaps adding a skill or two. sential in the RUM analysis, could be developed by a greater number of
Generally, item parameter estimates could be used for improving content experts, including their undergoing training sessions prior to
the quality of reading test items. Indeed, in this case, four of the items rating. In the current study, six context experts chose the L2 reading
need either serious modification or replacement. By providing in- attributes and Q-matrix coded them, which is sufficient for the purpose
formation regarding item mastery probability difficulty estimates to of the study, but having a greater number of experts could possibly
teachers, they can use the detailed information to modify or replace the make the Q-matrix more effective and the latent space more complete.
problematic items to enhance the quality of reading comprehension test Since the quality of the Q-matrix determines the quality of the RUM
items. Also, since item discrimination estimates indicate which items analysis, it is important to take extreme precautions in developing it
discriminated better between masters and non-masters of certain at- (Jang, 2005).
tributes, items similar to those with high discriminatory power can be While this study was an attempt at diagnostic assessment of L2
added to the test to perhaps considerably improve it. reading attributes, it also demonstrates a need for continued research in
As the process of all research faces some limitations, the present the area of cognitive diagnostic assessment in the Iranian context,
study might also suffer from some limitations. While determining the particularly with regard to constructing diagnostic tests for different L2
attributes that constitute the latent knowledge space of a CDM model language skills including writing, speaking and listening.
necessitates a deep understanding of the nature of these cognitive skills In conclusion, now is the time to focus attention on designing and
and the resulting test items, the complexity of reading comprehension developing educational assessments that are based on a CDM frame-
space does not allow a full understanding of its cognitive processes work, perhaps using the Fusion modeling approach, for the Iranian
(Lee & Sawaki, 2009). In addition, there is a lack of consensus on the context, the L2 language acquisition context, and for other learning
skill components of reading comprehension (Alderson, 2000). Further, contexts, such as the STEM disciplines for instance. Carrying through
although a great number of attributes may be identified as related to such an endeavour necessitates the cooperation of various experts from
reading comprehension mastery, not all such attributes can, or should, different fields (i.e., subject matter, learning sciences, measurement,
be kept in the Q-Matrix for RUM analysis. As a result, the purpose is not pedagogy). By succeeding in such an effort, educational assessments
to identify all the attributes that could be involved in correctly/in- will become more instructionally oriented and more relevant to the
correctly responding to the reading comprehension items, but to find needs of present day classrooms. This paper establishes that such efforts
the major attributes required to successfully complete each item. should be highly successful, probably in a great many subject areas, not
Therefore, the reading attributes found are not exhaustive to capture all merely L2 English reading comprehension for first language Farsi col-
of reading comprehension, but are specifically related to the reading lege students.
comprehension test, which indicates there may be a need for more at-
tributes, or simply that an item might be measuring inappropriate at- References
tributes. Of course, the better the test, the closer the specified test-de-
pendent attributes are to capturing most of the reading comprehension Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.
latent space. Meanwhile, those attributes that were identified for this Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. A & C Black.
study together with their Q-matrix were not effectively discriminated Alderson, J. C., & Huhta, A. (2005). The development of a suite of computer-based di-
by all the non-easy items, specifically not for items 15–18, 20. Thus, agnostic tests based on the Common European Framework. Language Testing, 22(3),
exploration of additional attributes to be used for Q-matrix construction 301–320.
Alderson, J. C. (2010). Cognitive diagnosis and Q-Matrices in language assessment: A com-
is recommended for future studies of L2 reading test construction for mentary. 96–103.
formative assessment purposes. Aryadoust, V. (2011). Cognitive diagnostic assessment as an alternative measurement
In addition, the reading comprehension test that was herein devised model. SHIKEN: JALT Testing & Evaluation SIG Newsletter, 15(1), 2–6.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. London: Oxford.
based on a cognitive diagnostic framework is only at relatively early Birch, B. M. (2002). English L2 reading: Getting to the bottom. Routledge.
stages of experimentation. Further research should be carried out to Birenbaum, M., Kelly, A. E., & Tatsuoka, K. K. (1993). Diagnosing knowledge states in
devise more cognitively based L2 language tests, not only for the algebra using the rule-space model. Journal of Research in Mathematics Education, 24,
442–459.
reading skill, but also for writing and listening assessments. Devising


F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ  ²

Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language Li, H. (2011). Evaluating language group differences in the subskills of reading using a cognitive
testing: Examining attributes of a free response listening test. Language Testing, 15(2), diagnostic modeling and differential skill functioning approach. The Pennsylvania State
119–157. University [Doctoral dissertation].
Carrell, P. L., & Grabe, W. (2002). Reading. In N. Schmitt (Ed.). An introduction to applied Lumley, T. (1993). The notion of sub-skills in reading comprehension test: An EAP ex-
linguistics (pp. 233–250). London: Arnold. ample. Language Testing, 10(3), 211–234.
Cohen, A. D., & Upton, T. A. (2006). Strategies in responding to the new TOEFL reading tasks. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based
Princeton, NJ: ETS [TOEFL Monograph No. MS-33]. language assessment. Language Testing. Special Issue: Interpretations, Intended Uses, and
DiBello, L. V., Stout, W. F., & Roussos, L. (1995). Unified cognitive psychometric as- Designs in Task-based Language, 19(4), 477–496.
sessment likelihood-based classification techniques. In P. D. Nichols, S. F. Chipman, & Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika,
R. L. Brennan (Eds.). Cognitively diagnostic assessment (pp. 361–390). Hillsdale, NJ: 59, 439–483.
Erlbaum. Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4),
DiBello, L. V., Roussos, L. A., & Stout, W. (2006). 31a review of cognitively diagnostic 379–416.
assessment and a summary of psychometric models. Handbook of statistics, 26, Mislevy, R. J. (2006). Cognitive psychology and educational assessment. Educational
979–1030. Measurement, 4, 257–305.
DiBello, L. V., Henson, R. A., & Stout, W. F. (2015). A family of generalized diagnostic Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple
classification models for multiple choice option-based scoring. Applied Psychological item types, missing data: And rated responses. Journal of Educational and Behavioral
Measurement, 39(1), 62–79. Statistics, 24, 342–366.
DiBello, L., & Stout, W. (2008a). Arpeggio documentation and analyst manual. Chicago: Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.). (2001). Knowing what students know:
Applied informative assessment research enterprises (AIARE)–LLC. The science and design of educational assessment (National research council’s committee
DiBello, L., & Stout, W. (2008b). Arpeggio suite, version 3.1. 001 [Computer program]. on the foundations of assessment. Washington, DC: National Academy Press.
Chicago: Applied informative assessment research enterprises (AIARE)–LLC. Purpura, J. E. (2004). Assessing grammar. John Wiley Sons, Inc.
Embretson, S. E., & Gorin, J. (2001). Improving construct validity with cognitive psy- R Development CORE TEAM (2010). R: A language and environment for statistical com-
chology principles. Journal of Educational Measurement, 38(4), 343–368. puting. Vienna, Austria: R Foundation for Statistical Computing ISBN
Fletcher, J. M. (2006). Measuring reading comprehension. Scientific Studies of Reading, 3–900051–07–0, URL: http://www. R-project.org.
10(3), 323–330. Roussos, L. A., DiBello, L. V., Stout, W. F., Hartz, S. M., Henson, R. A., & Templin, J. H.
Francis, D. J., Snow, C. E., August, D., Carlson, C. D., Miller, J., & Iglesias, A. (2006). (2007). The fusion model skills diagnostic system. In J. Leighton, & M. Gierl (Eds.).
Measures of reading comprehension: A latent variable analysis of the diagnostic as- Cognitive diagnostic assessment for education: Theory and applications (pp. 275–318).
sessment of reading comprehension. Scientific Studies of Reading, 10(3), 301–322. New York, NY: Cambridge University Press.
Gass, S. M., & Mackey, A. (2000). Stimulated recall methodology in second language research. Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification
Routledge. models: A comprehensive review of the current state-of-the-art. Measurement, 6(4),
Goodman, D. P., & Hambleton, R. K. (2004). Student test score reports and interpretive 219–262.
guides: Review of current practices and suggestions for future research. Applied Rupp, A. A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with
Measurement in Education, 7(2), 145–220. multiple-choice questions shapes the construct: A cognitive processing perspective.
Gu, Z. (2011). Maximizing the potential of multiple-choice items for cognitive diagnostic as- Language Testing, 23(4), 441–474.
sessment. University of Toronto [Doctoral dissertation]. Rupp, A. A., Templin, J., & Henson, R. A. (2012). Diagnostic measurement: Theory, methods,
Hartman, H. J. (2001). Developing students’ metacognitive knowledge and skills. In H. J. and applications. Guilford Press.
Hartman (Ed.). Metacognition in learning and instruction: Theory, Research and Practice Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-matrix construction: Defining the link
(pp. 33–68). Dordrecht, the Netherlands: Kluwer. between constructs and test items in large-scale reading and listening comprehension
Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing cognitive assessments. Language Assessment Quarterly, 6(3), 190–209.
abilities: Blending theory with practicality. Dissertation Abstracts International: Section Sheehan, K., & Mislevy, R. (1990). Integrating cognitive and psychometric models to
B: The Sciences and Engineering, 63(2-B), 864. measure document literacy. Journal of Educational Measurement, 27, 255–272.
Huang, T. W., & Wu, P. C. (2013). Classroom-based cognitive diagnostic model for a Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational
teacher-made fraction- decimal test. Educational Technology & Society, 16(3), measurement. American Council on Education.
347–361. Stiggins, R., Arter, J., & Chappuis, S. (2004). Classroom assessment for student learning:
Jang, E. E. (2005). A validity narrative: Effects of reading skills diagnosis on teaching and Doing it right–using it well. Dover, NH: Assessment Training Institute.
learning in the context of NG TOEFL. [Available from ProQuest Dissertations and Svetina, D., Gorin, J. S., & Tatsuoka, K. K. (2011). Defining and comparing the reading
Theses database. (AAT 3182288)]. comprehension construct: A cognitive-Psychometric modelling approach.
Jang, E. E. (2009). Demystifying a Q-matrix for making diagnostic inferences about L2 International Journal of Testing, 11(1), 1–23.
reading skills. Language Assessment Quarterly, 6(3), 210–238. Tatsuoka, K. K. (1995). Architecture of knowledge structure and cognitive diagnosis: A
Johnson, J. F. (2006). Diagnosing skill mastery in the national assessment of educational statistical pattern recognition and classification approach. In P. D. Nichols, S. F.
progress: Applications of the Fusion Model. [Available from ProQuest Dissertations and Chipman, & R. L. Brennan (Eds.). Cognitively diagnostic assessment (pp. 327–361).
Theses database. (AAT 3223309)]. Hillsdale, NJ: Lawrence Erlbaum Associates.
Kim, A. Y. A. (2015). Exploring ways to provide diagnostic feedback with an ESL pla- Templin, L. (1958). Generalized linear mixed proficiency models for cognitive diagnosis.
cement test: Cognitive diagnostic assessment of L2 reading ability. Language Testing [Available from ProQuest Dissertations and Theses database. (AAT 3160960)].
[0265532214558457]. Toulmin, S. (1958). The uses of argument. Cambridge, UK: Cambridge University Press.
Lee, Y. W., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL Urmston, A., Raquel, M., & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiary
reading and listening assessments. Language Assessment Quarterly, 6(3), 239–263. students’ English language proficiency: The development and validation of DELTA.
Leighton, J. P., & Gierl, M. J. (2007). Cognitive diagnostic assessment for education: Theory Hong Kong Journal of Applied Linguistics, 14(2), 60–82.
and applications. Cambridge University Press. Urquhart, S., & Weir, C. J. (1998). Reading in a second language: Process, product and
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy method for practice. New York: Longman.
cognitive assessment: A variation on tatsuoka's rule-Space approach. Journal of de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice
Educational Measurement, 41(3), 205–237. options. Applied Psychological Measurement, 33(3), 163–183.
Li, H., & Suen, H. K. (2013). Constructing and Validating a Q-Matrix for Cognitive von Davier, M., DiBello, L., & Yamamoto, K. Y. (2006). Reporting test outcomes with models
Diagnostic Analyses of a Reading TestConstructing and validating a Q-Matrix for for cognitive diagnosis (ETS Research Rep. No. RR-06-28)Princeton, NJ: ETS.
cognitive diagnostic analyses of a reading test. Educational Assessment, 18(1), 1–25.



View publication stats

You might also like