Professional Documents
Culture Documents
net/publication/321434164
CITATIONS READS
6 329
2 authors, including:
Fatemeh Ranjbaran
Iran University of Medical Sciences
10 PUBLICATIONS 13 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Fatemeh Ranjbaran on 15 January 2020.
A R T I C L E I N F O A B S T R A C T
Keywords: A critical issue in cognitive diagnostic assessment (CDA) lies in the dearth of research in developing diagnostic
Attributes tests for cognitive diagnostic purposes. Most research thus far has been mainly carried out on large-scale tests,
CDA e.g., Test of English as a Foreign Language (TOEFL), Michigan English Language Assessment Battery (MELAB),
Formative assessment International English Language Testing System (IELTS), etc. In particular, CDA of formative language assessment
CDM
that aims to inform instruction and to discover strengths and weaknesses of students to guide instruction has not
Fusion model
Q-matrix
been conducted in a foreign (i.e., second) language-learning context. This study explored how developing a
Reading comprehension test reading comprehension test based on a cognitive framework could be used for such diagnostic purposes. To
RUM achieve this, initially, a list of 9 reading attributes was prepared by experts based on the literature, and then the
Second language targeted attributes were used to construct a 20-item reading comprehension test. Second, a tentative “Q-matrix”
that specified the relationships between test items and the target attributes required by each item was developed.
Third, the test was administered to seven language-testing experts who were asked to identify which of the 9
attributes were required by each item of the test. Fourth, on the basis of the overall agreement of the experts’
judgments concerning the choices of attributes, review of the literature and results of student think-aloud
protocols, the tentative Q-Matrix was refined and used for statistical analyses. Finally, the test was administered
to 1986 students of a General English Language Course at the University of Tehran, Iran. To examine the CDA of
the test, the Reparameterized Unified Model (RUM) (also known as the Fusion Model), a type of cognitive
diagnostic measurement model (CDM), was used for further refining the Q-Matrix for future data analyses and,
most importantly, for diagnosing the participants' strengths and weaknesses. Data analysis results confirmed that
the nine proposed reading attributes are involved in the reading comprehension test items. Such diagnostic
information could be helpful for teachers and practitioners to prepare instructional materials that target specific
weaknesses and inform them of the more problematic areas that need to be emphasized in class in order to plan
for better L2 reading instruction. Further, such information could inform individualized student instruction and
produce improved diagnostic tests for future use.
⁎
Corresponding author.
E-mail address: franjbaran@ut.ac.ir (F. Ranjbaran).
http://dx.doi.org/10.1016/j.stueduc.2017.10.007
Received 17 December 2016; Received in revised form 8 October 2017; Accepted 25 October 2017
;(OVHYLHU/WG$OOULJKWVUHVHUYHG
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
during the instructional term, sufficient and timely feedback can be inferences in order to fully comprehend a text. These are considered the
provided to students in order to improve learning and eliminate sub-skills of the reading domain, which are called attributes as well,
weaknesses during the learning process. used interchangeably throughout this paper. The most distinct char-
Criticism occurs because the main goal of educational tests is acteristic of this approach is that it is the point where cognitive psy-
usually to provide quantitative assessment of a student’s general chology and psychometric modeling meet within a single framework,
overall, often unidimensional ability and proficiency as compared to therefore it aims to assess the test-takers' knowledge and underlying
other students in a normative group. This type of norm-referenced cognitive processing sub-skills (DiBello, Roussos, & Stout, 2006;
testing has been used extensively for ranking and selecting students for Leighton & Gierl, 2007).
various educational decisions. In addition to merely providing general In the assessment of reading comprehension in a second or foreign
summarizing and usually unidimensional information about students’ language, the many underlying cognitive attributes required for reading
skills and their ability to perform on a test, these assessments are in- ability mastery have made it a complex process. Reading ability is a
variably incapable of providing necessary detailed information about fundamental tool for gaining knowledge and improving learning in
students’ strengths and weaknesses that could possibly help them in everyday academic settings and everyday life in general. Therefore, it
improving their skills or that might also assist the teacher in instruc- comes as no surprise that the nature of reading ability has been the
tional planning. Recently, scholars have suggested that cognitive di- focus of research in applied linguistics, education and psychology for
agnostic assessment has a key role in improving the informational value quite some time (Cohen & Upton, 2006). Regardless of the extensive
of assessment (Alderson, 2010; de la Torre, 2009; Jang, 2005; research on reading ability, there is still some debate as to how second
Leighton & Gierl, 2007; Rupp, Templin, & Henson, 2012). In his com- language reading ability is defined and how its performance should be
mentary on “Cognitive Diagnosis and Q-Matrices in Language Assess- evaluated and reported. It seems that teachers, students, and practi-
ment,” Alderson (2010) shows his disappointment of there being very tioners have not been given diagnostic feedback tools that could be
few truly diagnostic tests in existence. In fact, nearly all studies carried used for improvements in reading ability, specifically for classroom-
out thus far have been on existing large-scale assessments and profi- based profile score reporting. These are issues that mostly need con-
ciency tests, not those developed for low-stakes formative assessment. sideration in the context of L2 reading assessment in the Iranian con-
He argues that far more studies focus on developing and researching text.
high-stakes proficiency tests aimed at placement, achievement, or ap- At times, there is such emphasis on reading strategies that other
titude than are specifically constructed for cognitive diagnosis in the important elements of reading competence such as language knowl-
form of classroom-based formative assessments. The most desirable edge, including pragmatic knowledge and grammatical knowledge,
cognitive diagnostic assessment is the one that is diagnostically de- have been have been given less attention. One aspect of second lan-
signed, constructed, and scored from the initial phase. In such an ap- guage reading ability is the use of language to understand written text.
proach, cognitive attributes are explicitly defined to be targeted in the Therefore, the knowledge of language components and strategic
test construction phase. These predetermined attributes should be in reading competence should both be considered in order to master the
line with the instructional goals. When the attributes are set, the data written text. While the difficulty of defining the construct of reading
are to be analyzed with an appropriate CDM. Afterwards, the scores are ability is clear, other problems have been seen with regards to how L2
to be reported in a fine-grained diagnostic system. While fine-grained reading performance has been analyzed and reported. L2 reading test
cognitive diagnostic assessment is intended to inform instructional scores are often reported using a general test score without any detailed
settings in this way, diagnostically constructed designs have hardly information (Goodman & Hambleton, 2004). When an exam provides
been discussed in the literature. A few tests, however, have been de- only one total score, it can serve the test’s immediate summative pur-
signed in order to fulfill the needs of diagnostic analysis (e.g. DIALANG pose; however, it cannot be easily used to improve reading performance
by Alderson, 2005; Alderson & Huhta, 2005; DELNA (www.delna. (Stiggins, Alter, & Chappius, 2004). Only providing a total score does
auckland.ac.nz/uoa); and DELTA by Urmston, Raquel, & Tsang, 2013), not provide information regarding each student’s specific strengths and
while, none have yet provided individualized score reports to enhance weaknesses (Sheehan & Mislevy, 1990). On the other hand, a detailed
learning and teaching at the classroom level. This study responds to the score report of each individual, including their level on each reading
call for cognitive diagnostic assessment using a specially constructed attribute can be used to both improve individual student reading ability
diagnostic test, one that will attempt to provide detailed information and guide teacher instruction (Snow & Lohman, 1989).
about students’ strengths and weaknesses in L2 reading comprehension,
and perhaps in reading comprehension in general. 2.2. Frameworks for developing cognitive diagnostic tests
2. Literature review There are two measurement driven approaches that have been
widely used for diagnostic test development. One is Embretson’s
2.1. L2 reading ability Cognitive Design System (CDS) (Embretson & Gorin, 2001) and the
other is Mislevy’s Evidence-Centered Design (ECD) (Mislevy, 1994;
In CDA, the different components of a specific domain (in this case, Mislevy, Steinberg, & Almond, 2002). These two approaches focus on
L2 reading) are referred to as attributes. Attributes are the divided the use of cognition in the process of item and test development, con-
components of a general cognitive ability, which can be defined as sidering the issues of construct definition while item writing, and
“procedures, skills, or knowledge a student must possess in order to concluding with validation procedures (Leighton & Gierl, 2007).
successfully complete the target task” (Birenbaum, Kelly, & Tatsuoka, CDS and ECD may differ in their emphasis on the various parts of
1993, p.443). Therefore, L2 reading attributes are composed of dif- assessment design and their details, but both share the three principles
ferent types of language knowledge, and reading strategies, which are of the assessment triangle. The assessment triangle includes three re-
required in comprehending texts (Birenbaum et al., 1993; Templin, lated elements, which are cognition (theories of learning), observation
2004). CDA has been introduced as a new method in educational (test data) and interpretation (the probabilistic model that relates a
measurement that can provide fine-grained diagnostic information student’s multidimensional latent cognitive learning state to his/her
about test-takers’ degree of mastery of domain sub-skills (Lee & Sawaki, test response pattern) (Pellegrino, Chudowsky, & Glaser, 2001). This
2009). Sub-skills are defined as domain-specific knowledge and skills NRC panel of researchers state that cognition is related to a cognitive
that are required to indicate mastery in a specific cognitive domain model about how students represent knowledge and how they develop
(Leighton & Gierl, 2007). Taking reading skill as a cognitive domain, it competence in a certain subject (p.44). A cognitive model provides a
is necessary to have knowledge of vocabulary, grammar, and making description of what should be assessed, but it is different from cognition
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
in a number of ways: According to Leighton and Gierl (2007), a cog- the main elements of expertise in a domain, the core elements in the
nitive model specifies the cognitive components and processes that ECD framework include (1) the student models, which formalize the
constitute the construct (such as reading comprehension) being tested. postulated proficiency structures for different tasks, (2) the task models,
This leads to more detailed specifications that are more applicable for which formalize which aspects of task performance are coded in what
instructional feedback. It should be noted that cognitive theory explains manner, and (3) the evidence models, which are the psychometric
these specifications, meaning that a model about specific cognitive models linking those two elements. These three core components are
processes related to the construct being tested is empirically supported. complemented by (4) the assembly model, which formalizes how these
In this study, the theory of learning underlying the assessment tri- three elements are linked in the assessment, and (5) the presentation
angle includes the reading theories of second language learning, in- model, which formalizes how the assessment tasks are being presented
cluding the information processing theory, constructivism theory of (Rupp et al., 2012). Specifically, the student model is motivated by the
reading, and Bloom’s Revised Taxonomy. It is clear that when reading, learning theory that underlies the diagnostic assessment system. It
engagement in processing at the phonological, morphological, syn- specifies the relevant variables or aspects of learning that we want to
tactic, semantic and discourse levels, as well as in goal setting, text- assess at a grain size that suits the purpose of the diagnostic assessment.
summary building, interpretive elaborating from knowledge resources, As many of the characteristics of learning that we want to assess are not
monitoring and assessment of goal achievement, making various ad- directly observable, the student model provides a probabilistic model
justments to enhance comprehension, and making repairs to compre- for making claims about the state, structure, and development of a more
hension processing are necessary (Carrell & Grabe, 2002, p. 234). Even complex underlying system.
though a great portion of the reading process is somewhat natural, it is
thought to exceed our conscious control. It is believed that readers do 2.3. Cognitive diagnostic models
employ a high level of active control over their reading process through
the use of strategies that are deliberate and purposeful conscious pro- Cognitive Diagnostic Models (CDMs) are data analysis probability
cedures (Urquhart & Weir, 1998). While processes are considered more models and often associated techniques that are designed to link cog-
automatic than intentional, and could even be considered subconscious, nitive theory with the test items’ psychometric (probabilistic) proper-
strategies are more controllable, and used to act upon the processes ties (Leighton & Gierl, 2007). According to Rupp and Templin (2008),
(Cohen & Upton, 2006). Insights into reading strategies explain how CDMs are:
readers interact with the text and to what extent strategies used can
“probabilistic, confirmatory multidimensional latent-variable
influence their reading comprehension rate. Following the use of
models with a simple or complex loading structure. They are sui-
learning theories for the cognition element, which are used to construct
table for modeling observable categorical response variables and
test items, the observation element includes gathering data through
contain unobservable (i.e., latent) categorical predictor variables.
administering the test. The interpretation element makes use of the
The predictor variables are combined in compensatory and non-
CDM (in the case of this study) to relate an examinees multidimensional
compensatory ways to generate latent classes”. (Rupp & Templin,
latent cognitive learning state to his/her test response pattern. These
2008, p.226)
circulating elements of the assessment triangle intermingle and even-
tually provide teachers with means to an imperfect attempt at trans- Examples of CDMs include Tatsuoka’s Rule Space Model (Tatsuoka,
lating the student’s responses into beneficial information. 1995), the Attribute Hierarchy Method (AHM) (Leighton,
The core purpose of diagnostic assessment development from an Gierl, & Hunka, 2004), an enhancement of the Fusion Model termed the
ECD framework perspective is the development of coherent evidentiary RUM (Hartz, 2002; Roussos et al., 2007), and the original RUM
arguments about students that can serve as assessment of and assess- (DiBello, Stout, & Roussos, 1995). Most CDMs are IRT-based latent-class
ment for learning, depending on the desired primary purpose of a models in which the characteristic of discrete multidimensionality of
particular assessment. The structure of the evidentiary arguments that the latent space is the most important. In previous unidimensional IRT-
are used in the assessment narrative can be described with the aid of models, examinee ability was modelled by a single general continuous
terminology first introduced by Toulmin (1958). An evidentiary argu- ability parameter. By contrast, the discrete multidimensionality of
ment is constructed through a series of logically connected claims or CDMs makes it possible to investigate the mental processes underlying
propositions that are supported by data through warrants and backing. the student’s test response by breaking the overall targeted ability, e.g.,
Based on Toulmin’s schema for arguments, a claim about an examinee's reading competency, down into its component parts. The number of
knowledge and skill is situated at the top. At the bottom is the ob- dimensions depends on the number of skill components involved in the
servation of the examinee acting in some situation. An assessor's rea- assessment. Here, α is a vector of dichotomous variables. These prob-
soning moves up from what is important in the examinee's actions abilities can then be used to estimate each αk, that is, to classify an
(which the examinee produces) and the features of the situation that are examinee as either a master (αk = 1) or a non-master (αk = 0). An-
important in eliciting those actions (partly determined by the assess- other feature of CDMs is being compensatory or non-compensatory
ment designer, but also partly determined by the examinees as they (DiBello et al., 1995; Roussos et al., 2007). In non-compensatory models
interact with the task). These data support the claim through a warrant, success in an attribute does not compensate for the deficiency in an-
or rationale, for why examinees with particular capabilities are likely to other attribute in order to correctly respond to an item, as is the case
act in certain ways in the situation at hand. with the RUM in this study. According to Roussos et al., 2007, a non-
In concrete terms, the ECD framework allows one to distinguish the compensatory interaction of skills occurs when application of the re-
different structural elements and the required pieces of evidence in quired skills is necessary in order to successfully complete the task; that
narratives similar to the following: Ali has most likely mastered basic is, a lack of competency on any one of the skills for the task will result
reading comprehension skills (claim), because he has answered cor- in a serious hindrance to successfully complete that task. Non-com-
rectly to items of a reading passage about the theory of education pensatory models have been preferred for cognitive diagnostic analysis,
(data). It is most likely that he did this because he applied all of the as they can generate more fine-grained diagnostic information.
reading skills and strategies correctly (backing) and the task was de- The latent variables of CDMs consist of dichotomous, such as mas-
signed to force him to do that (backing). He may have used his back- tery or non-mastery, or polytomous levels, such as a rating variable
ground knowledge to understand parts of the text (alternative ex- with values such as excellent, good, fair, poor, etc. The design structure
planation), but that is unlikely since he was new to the topic (refusal) of a CDM is the Q-matrix, which maps the skills necessary to success-
(Rupp et al., 2012). fully answer each item on the test (Li & Suen, 2013). Most of the CDM-
While a theory-driven process of analysing and modeling is led by based diagnostic studies carried out thus far have been mostly limited
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
to analysis of existing summative tests, not the development of new strength of association of each skill/item combination. For example, an
cognitive diagnostically focused formative assessments. By contrast, expert proposed skill required for an item could be dropped because
this study focuses on a cognitive diagnostic assessment of a test de- statistical evidence indicates the hypothesized association is lacking.
veloped that is based on a cognitive diagnostic modeling framework. This inferred Q-matrix is then available for future statistical analyses.
In developing a Q-matrix, a review of the literature indicates that a
2.4. Fusion model number of alternatives exist. One of the less costly but less efficient
approaches is to use existing test specifications, such as provided by the
The Fusion model, also known as the Reparameterized Unified test’s developer; however, the attributes indicated via the test specifi-
Model (RUM), is a cognitive diagnostic model that is used to make cations are usually too general for diagnostic purposes. According to
inferences about the mastery level of each latent space attribute for Leighton and Gierl (2007), relying on existing test specifications for Q-
each examinee, based on the examinee’s item responses (Stout, 2008a, matrices is usually unwarranted, so in this study an attempt was made
2008b; Stout, 2008a, 2008b). In other words, RUM is an IRT discrete at developing a reading comprehension test based on a cognitive di-
multidimensional model, that expresses the stochastic relationship be- agnostic framework, with the help of a rigorous Q-matrix construction.
tween item responses and underlying skills as follows (DiBello et al., Jang (2009) suggests using data from a small number of student verbal
1995): reports of how they answered each item to construct the Q-matrix. Even
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
which is similar to instinct. I didnt find instinct. Again made annotations about the evidence on which they based their as-
finding key words; animal communication in the passage. sessments.
They examined the extent to which the specified attributes are
The connection between text ideas and key words is evaluated in
distinguishable from each other and whether the raters agree with each
terms of success in solving the items. In this example, Arezoo, Nazanin,
other upon the attributes associated with each test item (a type of inter-
Mojtaba and Ava all make use of context clues to help them deduce the
rater reliability effort).
meaning of the text. Sometimes looking for certain words requires
With reference to Li (2011) and Jang (2005), a coding scheme was
reading the entire passage, or seeking for key words in parts of the
created based on the cognitive framework and think-aloud data. An
passage.
initial Q-matrix was constructed based on evidence from the think-
Example 2. Inferencing aloud verbal report, the expert rating, and the coding scheme extracted
from the literature. However, a frequently encountered problem here is
by referring to background knowledge. With this process readers
that students’ verbal reports may not agree with expert rating (Jang,
make inferences about the authors intentions or lexical word meaning
2005; Leighton & Gierl, 2007). When this discrepancy occurred in the
of the text based on their background knowledge and personal experi-
present study, the think-aloud verbal reports were regarded as the
ence. Inferential comprehension is demonstrated by the reader through
primary evidence, because the verbal reports captured the real-time
the process of referring to personal knowledge, his/her own intuition,
reading process to a certain extent and were thus considered more re-
or personal experiences. The following are some excerpts of this process
liable and authentic. Nevertheless, the value of expert rating should not
from the transcribed and translated verbal reports.
be underestimated because it provides important evidence from a dif-
ferent perspective. Furthermore, when it was difficult to determine
Saeideh: I read something and the concept is highlighted in my whether a certain skill should be retained for an item, the skill was
mind. I remember it from an incident that happened for usually retained. This is because the resulting RUM calibration provides
me. evidence concerning the importance of the skill for the item; that is, if
I didnt know the meaning of the word, but since I knew the calibration showed the skill to be insignificant, it could be dropped
what the passage was talking about and I had heard it at the later stage. Via RUM analyses, the test takers’ (N = 1986) test
before, I guessed the meaning. performance data were used to detect relevant item characteristics and
Because uhh… I know what is happening, I used my own then to refine the Q-matrix by the use of statistical analysis through the
knowledge. RUM.
Behnam: This one is really easy for me because I learned it before. The list of reading attributes (as shown in Table 2) was developed
based on previous literature (Birch, 2002; Cohen & Upton, 2006;
Fletcher, 2006; Francis et al., 2006; Jang, 2009; Rupp, Ferne, & Choi,
2006), content expert judgment (6 content raters), and examinees’
think-aloud verbal protocols (13 students). This list of attributes largely
Moein: It happened for one of my relatives once and I remembered
consisted of language knowledge and strategies, required when a
that day. It was very bad.
learner's language ability interacted with the written text. Language
Many words I have seen before but I dont know their exact
knowledge was further categorized into grammatical knowledge (i.e.,
meaning, but it helps me to guess from the context.
grammatical form and semantic meaning) and pragmatic knowledge,
while strategic competence was composed of metacognitive strategies
(i.e., assessing the situation, monitoring, evaluating) and cognitive
strategies (i.e., comprehension, memory, retrieval). Each of these ca-
Saman: I read this word exactly yesterday, and I remembered it. tegories consisted of a number of L2 reading attributes (e.g., knowl-
This possibility that an examinee is acquainted with a topic or has edge, strategies). In addition, language knowledge and strategic com-
had previous experience with a topic will make the task easier for him/ petence were presumed to interact with each other (Bachman & Palmer,
her and it will be a benefit to those who have background knowledge of 1996). Considering that strategic competence manages all language
a certain topic. Sometimes there is a bias on reading comprehension use, which includes using language knowledge (Bachman & Palmer,
tests because some students are knowledgeable of a topic while others 1996), test-takers needed to utilize strategies to invoke their language
fall behind on responding because they are clueless and they haven’t knowledge. In this study, the list included those L2 reading attributes
read up on the topic, despite the fact that they are might be knowl- that were considered to be involved in the reading process, comprised
edgeable on other topics. This was the case with those examinees that of four knowledge-based attributes (attributes 1–4 in Table 2) and five
had previous experience of heat stroke and heat exhaustion and were reading strategies (attributes 5–9 in Table 2).
easily able to respond to the items, while those who hadn’t any back-
Table 2
ground knowledge fell behind. Nevertheless, this is the nature of any
Attributes of L2 reading ability.
reading comprehension exam.
In the next phase, a group of content raters served as developers of L2 Reading Attributes
the attributes that reflect the main language skills necessary for suc-
cessful performance on each item. This group included 6 PhD students A1 determining word meaning from context
A2 determining word meaning out of context
at the University of Tehran studying Teaching English as a Foreign A3 comprehending text-explicit info
Language, 3 females and 3 males, who all had experience in applied A4 comprehending text-implicit info
linguistics research and teaching reading comprehension courses. They A5 skimming
reviewed each test item and the previously selected attributes and then A6 summarizing
A7 inferencing
decided whether or not each attribute was necessary for answering the
A8 applying background knowledge
item correctly. Each expert read the passages and performed the rating A9 inferring major ideas or writers purpose
task independently. They identified the skills for each item and also
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
6. Procedure Table 3
Structure of the reading test.
The study was carried out in three stages; 1) Developing the reading
Passage Topic Content structure Length Number of
comprehension test based on a cognitive assessment framework that items
includes constructing items that target the specified attributes. 2)
Constructing and validating a Q-Matrix of i test items by K reading Passage 1 Animal Intelligence Description/ 323 words 7
Informative
attributes; 3) RUM based statistical analysis of data resulting from a
Passage 2 Communication Causation/ 318 words 7
large-scale test administration. The first stage of the study was to carry Satellites Process
out an extensive study on the literature pertaining to cognitive diag- Passage 3 Body Temperature Compare/ 257 words 6
nostic assessment and test development for the purpose of developing a Contrast
L2 reading comprehension test based on a cognitive diagnostic frame-
work. From the review of literature, an initial conceptualization of test
specifications based on Evidence Centered Design put forth by Mislevy items, fourteen of the items were related to identifying semantic
(1996) was put to use in developing the 20 MC-items for the test. The meaning, and six items required identifying the pragmatic meaning of
author spent one term of teaching the general English course to gather the text. In this study, reading for identifying semantic meaning refers
data to create a list of suggested attributes, which where then used to to reading for literal and intended meaning. Reading for literal meaning
develop the test items. Focusing on test specification for developing the refers to identifying information conveyed in the text through para-
test items required an extensive literature survey to better understand phrasing or translating, and reading for intended meaning refers to
the ECD framework, which aims to gather evidence to support claims obtaining the meaning of sentences by making connections between
about an individual. As Mislevy (1994) puts forth, this framework is them. Pragmatic meaning refers to contextualized implied meaning,
composed of three models or components, the student model, the evi- such as contextual, sociocultural or psychological meanings. Therefore,
dence model, and the task model. The student model includes the types reading for pragmatic meaning refers to deriving a deeper under-
of inferences that are made about a student. In this study, after working standing of the text by combining the information with readers' specific
on reading comprehension skills and strategies in class, reading pas- prior knowledge and experiences (Purpura, 2004).
sages were assigned to students in groups and they were asked to re-
spond to reading comprehension questions. Observations were made 8. Data analyses
about the student’s responses and skills and strategies that were more
frequently used in their responses. Evidence was collected through Both qualitative and quantitative analyses were carried out in the
group work exercises and feedback from students. This supports the process of test development and Q-matrix construction. Initially, qua-
claim that the evidence model gives evidence to support the inferences litative analyses were carried out to specify reading skills assessed by
made, and the task model provides tasks that elicit usable pieces of the reading test. For this, various classifications of reading skills and
evidence (Mislevy, 1994). strategies in the literature were studied. Then, think-aloud verbal pro-
It should be emphasized that the process of assessment design is not tocols were analyzed qualitatively to help understand the character-
necessary linear (Roussos et al., 2007); hence, although defining attri- istics of the cognitive processes and skills used by the students and to
butes typically precedes task construction, the feasibility of task con- identify primary reading skills. Six content rater’s judgment was also
struction that measures a set of attributes may necessitate going back to used to examine to what extent the specified skills are necessary to
how the attributes have been defined. In this process, task construction correctly answer the test items. Reading test data were analyzed to-
informs attribute definition, which will then inform the next phase of gether with the Q-matrix using the Arpeggio Suite software
task construction. (DiBello & Stout, 2008b). The first step in the RUM analysis is the
In this study, data gathered from observations of student responses analysis of the Markov Chain Monte Carlo (MCMC) convergence to
during the course, evidentiary reasoning, specifically the task model by guarantee that model parameters had a stable value (Roussos et al.,
using tasks to elicit usable pieces of evidence during the course, were all 2007). Arpeggio software uses a Bayesian modeling approach with a
used in constructing the reading comprehension test items. After the Markov Chain Monte Carlo (MCMC) algorithm. “The MCMC estimation
test was developed, it was administered to 1986 students in general provides a jointly estimated posterior distribution of both the item
English courses at the University of Tehran. Data from this test ad- parameters and the examinee parameters (given the test data), which
ministration was to be used for RUM based statistical analysis. The may provide a better understanding of the true (estimation’s) standard
second stage was constructing a Q-matrix based on the chosen L2 errors involved” (Patz & Junker, 1999). MCMC convergence is mainly
reading attributes and the 20 constructed items. In order to construct evaluated by visually examining the time–series chain plots and esti-
the Q-matrix, initially a list of L2 reading attributes was specified based mated post-burn in posterior probability density plots, which, if con-
on the previous literature, the 13 participants’ think-aloud verbal re- vergence has occurred, should be roughly unimodal and roughly bell-
ports, and the 6 content experts’ judgment. The third and final phase shaped. By using the results from the Arpeggio software, the R statis-
consisted of the reading test data being analyzed using the Q-matrix and tical package (The R Foundation for Statistical Computing, 2010) pro-
the Arpeggio suite software (Stout, 2008a, 2008b; Stout, 2008a, duces both the chain plot (time series plots) and the density plot for
2008b), which implements the RUM. This also resulted in an empirical each parameter. The chain plot graphically indicates the degree to
validation of the proposed Q-matrix. which the chain has converged to a desired value after searching the
space. Ideally, a chain should sample values over the parameter space,
7. Instruments but also around a certain location. That is, a chain should converge to a
specific value across time and reach a stationary state (Johnson, 2006).
The test used was the reading comprehension test developed for this Usually the first portion of the chain is used as a burn-in and is dis-
study. The test was developed based on a cognitive diagnostic frame- carded. For example, in a chain of 5000, the first 1000 steps will be
work, including three different passages along with 20 multiple-choice used as the burn-in.
items. Table 3 shows the structure of the developed reading compre- In this study, since a total of nine L2 reading attributes were ex-
hension test. The passages varied content-wise (descriptive/in- amined, nine pk values were produced from the RUM analysis and a
formative, causation/process, and comparison/contrast), topic-wise chain length of 100,000 was initially used to reach MCMC convergence.
(animal intelligence, communication satellites, and body temperature), It is generally stated that the more complex the model (as determined
and by length (323, 318 and 257 words). In addition, among the 20 by the number of item parameters—which, for RUM, is related to the
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
number of skills and the average number of skills per item), the longer Table 4
the required chain length (Roussos, Templin, & Henson, 2007, p. 299). Q-matrix of Attributes.
Chain plots and density plots of the nine reading attributes indicated
I/A A1 A2 A3 A4 A5 A6 A7 A8 A9
convergence of the parameter estimates.
Among the different parameters, first the convergence of examinees’ 1 1 0 1 0 1 0 0 0 0
probability of mastery for each attribute (pk) was evaluated overall. 2 1 0 1 0 1 0 0 0 1
3 1 0 1 0 1 0 0 0 0
This estimates for each attribute the proportion of students mastering
4 0 0 1 0 1 0 0 1 0
the attribute, information useful for classroom instructional purposes. 5 0 0 0 1 1 0 1 0 0
Also, three item-based parameters that indicate the item difficulty for 6 1 0 1 0 1 0 0 0 0
item masters (πi*), item attribute k specific discrimination power (rik*), 7 1 0 0 0 1 0 1 0 0
and item completeness (ci) were evaluated. In fact, ci is a measure of 8 1 1 0 0 0 0 0 0 1
9 0 1 1 0 1 0 0 0 0
being inversely proportional to the amount of model-based examinee
10 1 1 0 0 0 0 0 0 0
responding not captured by the pi and r parameters. 11 0 0 1 0 1 0 0 0 0
Here an “item master” is defined to be an examinee that possesses 12 1 1 0 0 0 0 0 1 0
all the attributes (skills) required for the item in question. In this sec- 13 0 0 0 1 1 0 1 0 0
14 0 0 1 1 0 1 1 0 1
tion, examinees’ L2 reading performance on the reading comprehension
15 0 0 1 0 1 0 0 0 0
test was evaluated in terms of their mastery and non-mastery of L2 16 1 1 0 0 0 0 0 1 0
reading attributes. 17 0 0 1 0 1 0 0 0 0
Fit statistics were calculated to evaluate the fit of the model to the 18 0 0 0 1 1 0 1 0 0
data. The two types of fit statistics measured are named FUSIONStats 19 1 0 0 0 1 1 0 0 0
20 0 0 0 1 0 1 1 0 1
and IMStats, or item mastery statistics, both explained below. The first
compares the difference between the observed proportion of examinees
getting the item correct and the predicted proportion of examinees obtained are listed in Table 2. Due to the complicated nature of reading,
getting the item correct. A low difference between the two p-values numerous reading attributes are involved in completing each item
indicates a good fit of the data. IMStats indicate a comparison of the (Alderson, 2000; Urquhart & Weir, 1998), as was the case with this
observed performance of item masters and item non-masters (defined as study. Also, it is not claimed that the choice of the 9 attributes is un-
examinees lacking at least one of the item’s required attributes) at the iquely better than some other choice. However, it seems clear that these
item level. Sizable differences between the performance on most items 9 have been well chosen and further do provide good reliability and as
by item masters versus item non-maters indicates a good fit of the such provide cognitively and instruction-wise useful information.
model and the specified Q-matrix to the data. Importantly, a sizable
difference for a specific item is an indication of the high quality of the
item. Also, the reliability of the RUM was examined by analysing the 9.2. Fusion model analysis
Correct Classification Rate (CCR), which is the consistency of classifi-
cation of examinees into masters versus non-masters of attributes. That To statistically assess the 9 identified attributes in the initial Q-
is, CCR (k) measures the proportion of masters of Attribute K that are matrix and evaluate the 20 written items, RUM analysis was conducted
classified as masters and the proportions of nonmasters classified as using Arpeggio software. The item parameter estimates were analyzed
nonmasters, this at the attribute level for each k, CCR denotes the to evaluate the quality of the reading test items. As discussed above,
average of the CCR(k)s. These are calculated in practice via a simula- three types of item parameter estimates were examined in the current
tion process; for details see Roussos et al. (2007) and DiBello and Stout study: item difficulty (π*), item attribute discrimination (rik), and the
(2008a). In the final step, the population of examinees’ strengths and completeness index (ci). These Fusion Model parameters have im-
weaknesses in L2 reading ability at the attribute level were evaluated portant roles because they not only provide diagnostic information
through probability of mastery for each attribute (pk). about each test taker and item, but also highlight the properties of the
Q-matrix and possible Q-matrix misspecifications observed (Aryadoust,
9. Results 2011).
The πi* parameter is the probability that an examinee, having
9.1. Q-Matrix development mastered all the Q-matrix required skills for item i, will correctly apply
all these skills to solving item i. A high item probability value above 0.6
Results from think-aloud verbal protocols and content rater’s judg- indicates that examinees will have a good chance of correctly re-
ment, in conjunction with consultation of the L2 reading literature, sponding to the item if they have mastered all the necessary attributes
were analyzed to develop the list of reading attributes. This list of (Johnson, 2006), thus having a higher probability of successfully ap-
reading attributes was then used to develop the Q-matrix. In the Q- plying the skills required by that item. The item discrimination index
matrix, the rows represent items and the columns correspond to the (rik) shows how well the item discriminates masters from non-masters.
attributes. 1 means that the particular attribute is required for the A reasonably well defined Q matrix and a reasonably well written
completion of the item, whereas 0 means that it is unnecessary. Since item combine to produce estimates r*ik below the 0.8 cutpoint, in-
the test contained 20 items, the Q-matrix for the current study re- dicating sufficient item discrimination of masters from non-masters, as
presented the relationship between 20 items and their corresponding believed by CDM experts. This is while values below 0.5 are regarded as
attributes. skills highly necessary to answer the question correctly. A low item
As shown in Table 4, the rows of the Q-matrix indicate the 20 items discrimination value below 0.3 indicates that the item discriminates
from the reading comprehension test, and the columns indicate the nine very well between masters and non-masters of the attribute, and hence
reading attributes. the item has a strong dependence on the corresponding attribute
Attributes that three or more of the 6 raters agreed upon were (Johnson, 2006).
considered as essential for the item and were included in the Q-matrix. In this study, the average π obtained was 0.833, indicating that the
According to Hartz (2002), those attributes that were measured by set of identified L2 reading comprehension skills for the items were
fewer than three items do not provide statistically meaningful in- generally adequate and reasonable. Results indicate that three items
formation about these attributes and therefore, can be merged with (i.e., items 2, 5, and 13) had π* values lower than 0.6, which means that
similar attributes or deleted from the Q-matrix. The nine attributes they were too difficult and may need some modification. However,
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
Table 5
Item Parameter Estimates.
Item π* r* 1 r* 2 r* 3 r* 4 r* 5 r* 6 r* 7 r* 8 r* 9 ci
items 2 and 13 with a π* value of 0.58 and 0.56, respectively, were less 19. Interestingly, items 16–18 and 20 do suggest the possibility that the
of a concern than item 5 with a attribute space is incomplete and that one or more attributes could be
π* value of 0.38, which should be replaced. This illustrates that added to the 9 attributes. The above discussion demonstrates that item
estimated item parameters can drive decisions to replace one or more parameter estimates do provide substantially meaningful information
items that, when carried out, will definitely improve the diagnostic regarding the quality of the items. Thus the next iteration of the test can
power of the test. The item parameter estimates produced from the be considerably improved, based on the RUM analysis.
current study are presented in Table 5.
Examining the table, it is observed that all r*s were below 0.9, in-
dicating that all the items were discriminating at least somewhat be- 9.3. Analysis of model fit
tween masters versus non-masters of the attributes. It is observed that
most items except 7, 16, 17, 18 and 19 have at least one r < 0.8, hence In order to examine the fit of the RUM to the data, two types of
discriminating at least one attribute sufficiently. This indicates that the goodness of fit measures were used: (1) FUSIONStats and (2) item
majority of the test-takers, including both masters and non-masters of mastery statistics (IMStats). FUSIONStats compares the difference be-
the attributes, were slightly influenced by the specified attributes. tween the observed item p-value (the proportion of examinees getting
When such a result is obtained, we must come to the conclusion that the item correct) and the estimated item p-value (the model estimated
revising or eliminating/replacing these items on the test with replace- proportion of examinees that should get the item correct) for each item.
ment items that better discriminate between the masters and non- A small difference between the two for most or ideally all of the items
masters of the specified attributes will create a better performing di- suggests a good fit of the data. Table 6 shows the observed item p-
agnostic test. Indeed we have 1/5 of the test items essentially providing values and the estimated item p-values for each item. The absolute
no diagnostic power. Thus, replacing them with effective items would difference between each observed item p-value and the estimated item
make a major difference in the discrimination power of the test.
Therefore, the r* parameter could be used to evaluate the quality of an Table 6
Observed and estimated item p-values.
item in terms of its discriminatory power (Leighton & Gierl, 2007).
The completeness index, ci, shows the degree to which the attributes Item Observed p-value Estimated p-value Absolute observed-estimated p-
specified in the Q-matrix are complete in describing the skills needed to value
successfully respond to the ith item, and ranges from 0 to 3. Values
1 0.660 0.670 0.01
close or even fairly close to zero, say ≤1.25 (somewhat arbitrary) in-
2 0.461 0.498 0.037
dicate that the item has low completeness, that is, other unspecified 3 0.825 0.847 0.022
skills are important for answering item i correctly. ci values somewhat 4 0.615 0.649 0.034
close to 3 indicate that the specified attributes in the Q-matrix suffice to 5 0.314 0.332 0.018
explain examinee responding to that item (von Davier, DiBello, and 6 0.554 0.607 0.053
7 0.872 0.898 0.026
Yamamoto, 2006). From a brief glance at the table it is observed that 8 0.629 0.722 0.093
items 1–14 and 19 have an overall high completeness index ≥1.5 and 9 0.713 0.730 0.017
in all but two cases ≥2, while items 15–18 and 20 show a low com- 10 0.537 0.602 0.065
pleteness index of 0.65 to 1.05. This indicates that the specified attri- 11 0.700 0.737 0.037
12 0.580 0.625 0.045
butes selected for items 15–18 and 20 have attributes besides the spe-
13 0.433 0.456 0.023
cified attributes that seriously impact examinee responding. 14 0.603 0.671 0.068
Interestingly, items 16 and 18 are virtually useless in discriminating 15 0.630 0.673 0.043
among the specified attributes. Item 19 is also useless in discriminating 16 0.678 0.719 0.041
among the specified attributes but it is seemingly not influenced by one 17 0.413 0.439 0.026
18 0.595 0.642 0.047
or more attributes lying outside the specified latent space as its 19 0.863 0.887 0.024
c = 1.72 suggests. The result is that the item is both useless for our 20 0.482 0.570 0.088
purposes and also very easy. Item 7 has almost the same story as Item Mean Absolute difference: 0.040
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
p-value for each item should be below the suggested value of 0.05 for attributes on average by 45.2%. This high value implies a good fit be-
all items as put forth in (Roussos et al., 2007). However, in this case this tween the estimated model and the observed data, and moreover in-
value was higher than 0.05 for four items out of twenty, in particular dicates the strong diagnostic power of this L2 reading test. Further, it
item 8 at 0.09, item 10 at 0.06, item 14 at 0.06 and item 20 at 0.08. This indicates the usefulness of the Fusion modeling approach for carrying
suggests these items may have moderate lack of fit problems. However, out cognitive diagnostic studies such as the one reported on herein.
the mean absolute difference between the p-values was low at 0.04, In addition, the reliability of the RUM was examined by evaluating
suggesting good overall fit. Fig. 1 graphically depicts the results of the Correct Classification Rate (CCR) index, which as discussed above is
observed versus estimated item p-values, which suggests that the RUM the diagnostic power to correctly classify examinees. The output file
fits the data well. from the Arpeggio Tabulator, called classfile.csv, reports the estimated
In addition, item mastery statistics (IMStats) were used to compare correct classification rate CCR(k) for each skill k with CCR the average
the observed performance of item masters and item non-masters at the over all k. The CCR theoretically ranges between zero and one. In this
item level. Three different values are considered to evaluate IMStats: data set, the CCR was high at 0.826, indicating a high reliability of the
the phat (m), which refers to the observed proportion correctly re- RUM and in particular the capacity to correctly classify examinees over
sponding to an item among item masters; the phat (nm), which refers to 80% of the time.
the observed proportion correctly responding to an item among item Population mastery probability of each of the reading attributes was
nonmasters; and the pdiff, which indicates the average difference be- also analyzed for the student’s L2 reading performance. For the overall
tween phat (m) and phat (nm) across items. In this study, as shown in test-taking group, the population’s probability of mastery for each at-
Table 7, the average phat (m) across all items was 0.774, which in- tribute (pk) was investigated. Table 8 shows the L2 reading attributes,
dicates that the average observed probability of getting a correct re- and their pk values that range from a high of 0.769 (determining word
sponse to an item by item masters was relatively high at 77.4%. On the meaning from context) to 0.627 (comprehending text-implicit info).
other hand, the average phat (nm) was 0.322, indicating that the This indicates that 76.9% of the students had mastery of determining
average observed probability of having a correct response to an item by vocabulary from context, making it the easiest attribute. This is while
item non-masters was much lower at 32.2%. Thus, the pdiff was 0.452, 62.7% of the students had mastery of comprehending text-implicit info,
indicating that item masters outperformed item non-masters of making it the more difficult attribute. The overall average level of
mastery probability for the nine L2 reading attributes was 0.701. These
population level statistics should prove useful for instructional pur-
Table 7 poses.
Probability of Correctly Responding to an Item.
Among the four knowledge-related attributes (the first four in
Item phat(m) phat(nm) phat(m) − phat(nm) Table 8), there were small differences in the overall groups' attributes
mastery probabilities, ranging from 0.627 for comprehending text im-
Item 1 0.841 0.398 0.443 plicit information (pragmatic meaning) to 0.769 for deducing word
Item 2 0.543 0.398 0.214
meaning from context (lexical meaning). This signified that “prag-
Item 3 0.957 0.609 0.348
Item 4 0.772 0.380 0.392 matic” word meaning was the most difficult attribute to master,
Item 5 0.358 0.197 0.161 whereas “lexical” word meaning was the easiest to master on this test.
Item 6 0.788 0.278 0.51
Item 7 0.982 0.565 0.417
Item 8 0.958 0.181 0.777 Table 8
Item 9 0.960 0.312 0.648 Attribute mastery probability for overall group.
Item 10 0.718 0.209 0.509
Item 11 0.919 0.334 0.585 L2 Reading Attribute Mastery probability (pk)
Item 12 0.780 0.128 0.652
Item 13 0.552 0.169 0.383 deducing word meaning from context 0.769
Item 14 0.864 0.091 0.773 determining word meaning out of context 0.676
Item 15 0.812 0.368 0.444 comprehending text-explicit info 0.662
Item 16 0.790 0.426 0.364 comprehending text-implicit info 0.627
Item 17 0.513 0.301 0.212 skimming 0.722
Item 18 0.751 0.393 0.358 summarizing 0.698
Item 19 0.938 0.753 0.185 inferencing 0.754
Item 20 0.684 0.012 0.672 applying background knowledge 0.666
Average 0.774 0.322 pdiff = 0.452 inferring major ideas or writers purpose 0.737
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
Among the five reading strategies, the attribute mastery probability of classifying the examinees in terms of their mastery or nonmastery of
strategies ranged from 0.666 (applying background knowledge)to each attribute (i.e., the correct classification rate: CCR). First of all, the
0.737 (inferring major ideas). In other words, about 67% of examinees model fit the data well for several reasons. First, the mean absolute
had mastery of applying background knowledge, making it the most difference between the observed and estimated item p-values was low
difficult strategy. On the other hand, about 74% of the examinees had at 0.04 (see Table 6). In addition, the average phat (m) across all items
mastery of inferring major ideas, making it the easiest strategy. was 0.774, indicating that the average probability of getting a correct
response to an item by item masters of attributes was relatively high at
10. Discussion and conclusions 77.4%. This suggests that the 9 attributes include much of what is re-
quired for answering the test’s items correctly. The average phat (nm)
Among the four skills of English language proficiency, reading was 0.322, so the average probability of having a correct response to an
ability might well be considered the most essential skill for success in item by non-masters of attributes was much lower at about 32.2%. This
the academic world. Hence, it is crucial to accurately assess learners' L2 indicates the items, on average, are “hard” for non-masters. In addition,
reading ability to gauge their overall progress and in particular help the pdiff was 0.452, indicating that the masters of required attributes on
enhance specific reading skills where lack of mastery is indicated. The an item outperformed non-masters of required attributes on average by
main goal of the current study was to develop a L2 reading test using 45.2% across all items. This high value indicated a good fit between the
the ECD framework and based on a cognitive diagnostic model (in our estimated model and the observed data, and further, suggesting a strong
case the RUM). This is being done in order to diagnose learners' diagnostic power of the test as modelled by the RUM.
strengths and weaknesses in L2 reading ability, with the ultimate goal According to Lumley (1993), identifying implicit information
of providing detailed information that can assist teachers and admin- (equivalent to inferencing) and synthesizing to draw a conclusion
istrators for instructional purposes and to thus improve student per- (equivalent to summarizing) were difficult reading attributes compared
formance. to vocabulary (similar to identifying word meaning) and identifying
In fact, two research questions were carried out in the course of this explicit information (similar to finding information and skimming).
study. First was investigating the L2 reading attributes necessary for This could be attributed to the fact that inferencing and summarizing
successfully completing every item on the reading test, noting that the are higher-level strategies involving more complex cognitive processing
20 item test was then carefully designed to assess mastery of each of than the other three strategies, which require lower-level strategies as
these 9 attributes judged to capture the reading comprehension latent was the case in this study. As an example, summarizing requires readers
space. Raters identified the various attributes, consisting of both to first comprehend the overall text and then extract the gist of meaning
knowledge and strategy attributes, by referring to the L2 reading lit- from it. Understanding the gist involves numerous components, such as
erature and students think-aloud verbal reports. This list of nine reading knowledge of grammar, vocabulary, discourse structure, and various
attributes and the developed test were then organized into an item-by- cognitive processes (Birch, 2002). Thus, the nature of summarizing
attribute Q-matrix. seems quite complex. In a similar vein, the strategy of inferencing has
Second, test-takers' performance on the reading test was examined long been believed to be a challenging one (Fletcher, 2006). In order to
for individual student and group diagnostic purposes. In particular, the make inferences, readers should already have the ability to understand
test scores were then analyzed in conjunction with the initial Q-matrix the literal meaning of the text, which makes inferencing more difficult
(see Table 4) using the Fusion model analysis. to master. However, the results of this study are contrary to this belief,
Findings of the study suggest that a majority of the items on the test whereas for the overall group, the attribute of inferencing stands in
can successfully discriminate between masters and non-masters and are second at 0.754 with less difficulty. On the other hand, between
therefore appropriate for CDA. In addition, the list of L2 reading at- skimming and summarizing, the latter was more difficult with 0.698 as
tributes can be used as a framework for CDA research in the future. compared with skimming at 0.722. In fact, as put forth by Urquhart and
Regarding the frequency of reading attributes for each item on the test, Weir (1998), skimming involves quickly understanding the surface-
the Q-matrix was examined to identify any recurring patterns among level propositional meaning of the text and is considered a less chal-
the L2 reading attributes. Since language knowledge and strategic lenging strategy.
competence are believed to interact with each other in constituting the Furthermore, comprehending text implicit information, compre-
mastery of reading ability (Bachman & Palmer, 1996), it was expected hending text explicit information and applying background knowledge
that each item on the reading test would measure at least one knowl- were the more difficult attributes with 0.627, 0.662 and 0.666, re-
edge-related attribute and one reading strategy. In fact, it was indeed spectively. Similarly, deducing meaning from context the easiest attri-
found that almost all items measure one knowledge-related attribute bute, which is in accordance with the belief that word recognition,
and a minimum of one reading strategy. Since strategies, by definition, which is similar to identifying word meaning, involves lower-level
manage the use of language knowledge (Bachman & Palmer 1996), they processing (Alderson, 2000). Overall, examinees have performed com-
were assumed necessary to reveal mastered knowledge-related attri- paratively well on these three attributes (i.e., deducing meaning from
butes. context, inferring major ideas and skimming) due to the nature of the
Examinees’ performances on the attributes existing in reading attribute, which requires relatively less cognitive processing. The de-
comprehension items were evaluated. As Roussos et al. (2007, p. 293) tailed classification student level reports made possible by the RUM
put forth, “A key issue for mastery/nonmastery of diagnostic models is analysis of the L2 reading assessment of this study strongly indicate that
whether the proportion of examinees estimated as masters on each skill L2 reading assessments using this test should be highly beneficial in
is relatively congruent with the user’s expectations.” RUM analysis was facilitating learning on the part of individual students and in teacher
carried out to obtain relationships among the participant’s perfor- instructional preparation as the course proceeds and finally in curri-
mances on the test items. The Arpeggio suite software provides a culum development on the part of the teacher. With such detailed re-
number of output files that give specific information regarding ex- ports of test results available, teachers can become aware of students’
aminees’ performance on each item of the test. Two of the output files problematic areas and focus on them in lesson planning and providing
that help us respond to research question 2 (essentially studying ex- learning materials. Since the reading test developed here was based on
aminee/test attributes relationships) are the classification file (classfi- a cognitive framework followed by RUM analysis, the problematic
le.csv) and the fit report file (fitreports.csv). The fit report file is an items were identified and could be further modified or even replaced.
output file that provides fit statistics for how well the model fits the test Hence, an item bank can be developed for cognitive diagnostic devel-
data, noting that a good fit is required for usefulness of the RUM based opment items in L2 reading comprehension testing for use during or
statistical analyses. The classification file indicates the consistency of after a reading comprehension unit is carried out, and thus for future
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
F. Ranjbaran, S.M. Alavi 6WXGLHVLQ(GXFDWLRQDO(YDOXDWLRQ²
Buck, G., & Tatsuoka, K. (1998). Application of the rule-space procedure to language Li, H. (2011). Evaluating language group differences in the subskills of reading using a cognitive
testing: Examining attributes of a free response listening test. Language Testing, 15(2), diagnostic modeling and differential skill functioning approach. The Pennsylvania State
119–157. University [Doctoral dissertation].
Carrell, P. L., & Grabe, W. (2002). Reading. In N. Schmitt (Ed.). An introduction to applied Lumley, T. (1993). The notion of sub-skills in reading comprehension test: An EAP ex-
linguistics (pp. 233–250). London: Arnold. ample. Language Testing, 10(3), 211–234.
Cohen, A. D., & Upton, T. A. (2006). Strategies in responding to the new TOEFL reading tasks. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2002). Design and analysis in task-based
Princeton, NJ: ETS [TOEFL Monograph No. MS-33]. language assessment. Language Testing. Special Issue: Interpretations, Intended Uses, and
DiBello, L. V., Stout, W. F., & Roussos, L. (1995). Unified cognitive psychometric as- Designs in Task-based Language, 19(4), 477–496.
sessment likelihood-based classification techniques. In P. D. Nichols, S. F. Chipman, & Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika,
R. L. Brennan (Eds.). Cognitively diagnostic assessment (pp. 361–390). Hillsdale, NJ: 59, 439–483.
Erlbaum. Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4),
DiBello, L. V., Roussos, L. A., & Stout, W. (2006). 31a review of cognitively diagnostic 379–416.
assessment and a summary of psychometric models. Handbook of statistics, 26, Mislevy, R. J. (2006). Cognitive psychology and educational assessment. Educational
979–1030. Measurement, 4, 257–305.
DiBello, L. V., Henson, R. A., & Stout, W. F. (2015). A family of generalized diagnostic Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT: Multiple
classification models for multiple choice option-based scoring. Applied Psychological item types, missing data: And rated responses. Journal of Educational and Behavioral
Measurement, 39(1), 62–79. Statistics, 24, 342–366.
DiBello, L., & Stout, W. (2008a). Arpeggio documentation and analyst manual. Chicago: Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.). (2001). Knowing what students know:
Applied informative assessment research enterprises (AIARE)–LLC. The science and design of educational assessment (National research council’s committee
DiBello, L., & Stout, W. (2008b). Arpeggio suite, version 3.1. 001 [Computer program]. on the foundations of assessment. Washington, DC: National Academy Press.
Chicago: Applied informative assessment research enterprises (AIARE)–LLC. Purpura, J. E. (2004). Assessing grammar. John Wiley Sons, Inc.
Embretson, S. E., & Gorin, J. (2001). Improving construct validity with cognitive psy- R Development CORE TEAM (2010). R: A language and environment for statistical com-
chology principles. Journal of Educational Measurement, 38(4), 343–368. puting. Vienna, Austria: R Foundation for Statistical Computing ISBN
Fletcher, J. M. (2006). Measuring reading comprehension. Scientific Studies of Reading, 3–900051–07–0, URL: http://www. R-project.org.
10(3), 323–330. Roussos, L. A., DiBello, L. V., Stout, W. F., Hartz, S. M., Henson, R. A., & Templin, J. H.
Francis, D. J., Snow, C. E., August, D., Carlson, C. D., Miller, J., & Iglesias, A. (2006). (2007). The fusion model skills diagnostic system. In J. Leighton, & M. Gierl (Eds.).
Measures of reading comprehension: A latent variable analysis of the diagnostic as- Cognitive diagnostic assessment for education: Theory and applications (pp. 275–318).
sessment of reading comprehension. Scientific Studies of Reading, 10(3), 301–322. New York, NY: Cambridge University Press.
Gass, S. M., & Mackey, A. (2000). Stimulated recall methodology in second language research. Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification
Routledge. models: A comprehensive review of the current state-of-the-art. Measurement, 6(4),
Goodman, D. P., & Hambleton, R. K. (2004). Student test score reports and interpretive 219–262.
guides: Review of current practices and suggestions for future research. Applied Rupp, A. A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with
Measurement in Education, 7(2), 145–220. multiple-choice questions shapes the construct: A cognitive processing perspective.
Gu, Z. (2011). Maximizing the potential of multiple-choice items for cognitive diagnostic as- Language Testing, 23(4), 441–474.
sessment. University of Toronto [Doctoral dissertation]. Rupp, A. A., Templin, J., & Henson, R. A. (2012). Diagnostic measurement: Theory, methods,
Hartman, H. J. (2001). Developing students’ metacognitive knowledge and skills. In H. J. and applications. Guilford Press.
Hartman (Ed.). Metacognition in learning and instruction: Theory, Research and Practice Sawaki, Y., Kim, H. J., & Gentile, C. (2009). Q-matrix construction: Defining the link
(pp. 33–68). Dordrecht, the Netherlands: Kluwer. between constructs and test items in large-scale reading and listening comprehension
Hartz, S. M. (2002). A Bayesian framework for the unified model for assessing cognitive assessments. Language Assessment Quarterly, 6(3), 190–209.
abilities: Blending theory with practicality. Dissertation Abstracts International: Section Sheehan, K., & Mislevy, R. (1990). Integrating cognitive and psychometric models to
B: The Sciences and Engineering, 63(2-B), 864. measure document literacy. Journal of Educational Measurement, 27, 255–272.
Huang, T. W., & Wu, P. C. (2013). Classroom-based cognitive diagnostic model for a Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational
teacher-made fraction- decimal test. Educational Technology & Society, 16(3), measurement. American Council on Education.
347–361. Stiggins, R., Arter, J., & Chappuis, S. (2004). Classroom assessment for student learning:
Jang, E. E. (2005). A validity narrative: Effects of reading skills diagnosis on teaching and Doing it right–using it well. Dover, NH: Assessment Training Institute.
learning in the context of NG TOEFL. [Available from ProQuest Dissertations and Svetina, D., Gorin, J. S., & Tatsuoka, K. K. (2011). Defining and comparing the reading
Theses database. (AAT 3182288)]. comprehension construct: A cognitive-Psychometric modelling approach.
Jang, E. E. (2009). Demystifying a Q-matrix for making diagnostic inferences about L2 International Journal of Testing, 11(1), 1–23.
reading skills. Language Assessment Quarterly, 6(3), 210–238. Tatsuoka, K. K. (1995). Architecture of knowledge structure and cognitive diagnosis: A
Johnson, J. F. (2006). Diagnosing skill mastery in the national assessment of educational statistical pattern recognition and classification approach. In P. D. Nichols, S. F.
progress: Applications of the Fusion Model. [Available from ProQuest Dissertations and Chipman, & R. L. Brennan (Eds.). Cognitively diagnostic assessment (pp. 327–361).
Theses database. (AAT 3223309)]. Hillsdale, NJ: Lawrence Erlbaum Associates.
Kim, A. Y. A. (2015). Exploring ways to provide diagnostic feedback with an ESL pla- Templin, L. (1958). Generalized linear mixed proficiency models for cognitive diagnosis.
cement test: Cognitive diagnostic assessment of L2 reading ability. Language Testing [Available from ProQuest Dissertations and Theses database. (AAT 3160960)].
[0265532214558457]. Toulmin, S. (1958). The uses of argument. Cambridge, UK: Cambridge University Press.
Lee, Y. W., & Sawaki, Y. (2009). Application of three cognitive diagnosis models to ESL Urmston, A., Raquel, M., & Tsang, C. (2013). Diagnostic testing of Hong Kong tertiary
reading and listening assessments. Language Assessment Quarterly, 6(3), 239–263. students’ English language proficiency: The development and validation of DELTA.
Leighton, J. P., & Gierl, M. J. (2007). Cognitive diagnostic assessment for education: Theory Hong Kong Journal of Applied Linguistics, 14(2), 60–82.
and applications. Cambridge University Press. Urquhart, S., & Weir, C. J. (1998). Reading in a second language: Process, product and
Leighton, J. P., Gierl, M. J., & Hunka, S. M. (2004). The attribute hierarchy method for practice. New York: Longman.
cognitive assessment: A variation on tatsuoka's rule-Space approach. Journal of de la Torre, J. (2009). A cognitive diagnosis model for cognitively based multiple-choice
Educational Measurement, 41(3), 205–237. options. Applied Psychological Measurement, 33(3), 163–183.
Li, H., & Suen, H. K. (2013). Constructing and Validating a Q-Matrix for Cognitive von Davier, M., DiBello, L., & Yamamoto, K. Y. (2006). Reporting test outcomes with models
Diagnostic Analyses of a Reading TestConstructing and validating a Q-Matrix for for cognitive diagnosis (ETS Research Rep. No. RR-06-28)Princeton, NJ: ETS.
cognitive diagnostic analyses of a reading test. Educational Assessment, 18(1), 1–25.