Professional Documents
Culture Documents
Creating A Brief Rating Scale For The Assessment of Learning Disabilities PDF
Creating A Brief Rating Scale For The Assessment of Learning Disabilities PDF
Abstract
The purpose of the present studies was to provide the means to create brief versions of instruments that can aid the
diagnosis and classification of students with learning disabilities and comorbid disorders (e.g., attention-deficit/hyperactivity
disorder). A sample of 1,108 students with and without a diagnosis of learning disabilities took part in Study 1. Using
information from modern theory methods (i.e., the Rasch model), a scale was created that included fewer than one third of
the original battery items designed to assess reading skills. This best item synthesis was then evaluated for its predictive and
criterion validity with a valid external reading battery (Study 2). Using a sample of 232 students with and without learning
disabilities, results indicated that the brief version of the scale was equally effective as the original scale in predicting reading
achievement. Analysis of the content of the brief scale indicated that the best item synthesis involved items from cognition,
motivation, strategy use, and advanced reading skills. It is suggested that multiple psychometric criteria be employed
in evaluating the psychometric adequacy of scales used for the assessment and identification of learning disabilities and
comorbid disorders.
Keywords
Rasch model, psychometrics, learning disabilities, rating scale, item analysis, classical test theory, modern theory methods,
item response theory
Assessing students with learning disabilities (LD) is an negative cases, psychological distress, and despair. As
important challenge for the field of LD, particularly more Ysseldyke, Burns, Scholin, and Parker (in press) stated,
so given the fact that recent identification criteria (i.e., the “Dramatic reforms in assessment practices within special
responsiveness-to-intervention model; Fuchs & Deshler, education have occurred over the past 30 years, but the
2007; Swanson, 2008; Vaughn & Fuchs, 2003) rely heavily practice has not yet caught up to research.” One purpose of
on the use of curriculum-based measures (Deno, 1989) the present studies is to provide a psychometric technology
compared to normative assessments (as the severe discrep- that will lead to the valid assessment in LD.
ancy model posited; see Stanovich, 1991; Swanson, 1991;
see Note 1). That is, during implementation of the model
that depended on a significant discrepancy between stu- Importance of Valid
dents’ potential and achievement, the gold standard came Assessments in LD
from standardized normative assessments. With the newly Teacher ratings are often used for the assessment and identi-
formed response to intervention (RTI), however, teachers fication of LD (Goodman & Webb, 2006; Ritter, 1989).
need to construct frequent tests based on the curriculum to Teachers have proven to be good and consistent judges of
monitor responsiveness to a valid treatment (Compton,
1
2006). In fact, the whole RTI concept is based on frequent University of Crete, Rethymno, Greece
2
and valid assessments (Burns, Dean, & Klar, 2004). Thus, University of Thessaly,Volos, Greece
the need to have reliable and valid measurements of stu- Corresponding Author:
dents’ competencies at various skills is imperative, whereas Georgios Sideridis, University of Crete, Galou, Rethymno 74100, Greece
their absence will contribute confusion, false positive and Email: sideridis@psy.soc.uoc.gr
116 Journal of Learning Disabilities 46(2)
students’ academic skills and competencies (Podell & ADHD (Barron, Evans, Baranik, Serpell, & Buvinger,
Soodak, 1993; Rescorla et al., 2007). Nevertheless, teachers 2006) and consequently poor achievement outcomes
are as good judges as the instruments they use. That is, use (Barron et al., 2007; Sideridis, 2007). Furthermore, Morgan,
of invalid instruments would nonetheless lead to false posi- Fuchs, Compton, Cordray, and Fuchs (2008) linked early
tive or false negative classifications or diagnoses because of failure in reading to motivational deficits, a fact that has
the potential error contributed by the testing conditions, the been evidenced in recent meta-analyses of reading (Morgan
person’s preconditions, and biases from the part of the teach- & Sideridis, 2006).
ers. Particularly with regard to the latter, Cassidy (2008) A last important reason is the role of valid assessments
argued that teacher subjectivity can potentially contribute to within the RTI framework. RTI requires constant use of
valid assessments and differentiated it from invalid subjec- assessments to monitor and evaluate growth. This evalua-
tivity. Thus, as much as the field relies on teacher ratings for tion will form the basis for subsequent decision making.
valid identification of LD, more so do rating scales need to Thus, students will be placed in certain environments or
be accurate and reflective of true student abilities, rather than will be given (or withdrawn from) interventions based on
measurement error. In the present studies, a rating scale that these assessments. To make things worse, research that
provides a screening for LD is evaluated for its validity using assesses the reliability of those assessments has pro-
a small portion of its full potential. The importance of this duced worrisome findings. For example, Burns, Scholin,
endeavor is discussed below. Kosciolek, and Livingston (2010) reported that the reli-
ability of two similar decision-making models that were
based on frequent assessments within RTI was only .29,
Importance of Creating and that estimate reflected only consistency. It is possible
Brief Scales for Assessing that besides being inconsistent, neither one of these mod-
Skills and Competencies els represented valid assessments of students’ potential
and actual performance. As things are today in the field
There are several reasons that necessitate the need for brief of LD, the success of our current practices (RTI) relies
assessments in LD. First, fatigue is an issue that should be on valid inferences of students’ abilities. Fortunately,
taken into consideration as students with LD may allocate advances in statistics and measurement provide us with
unnecessary resources (oftentimes inappropriate) when the hope that valid assessments are possible at minimum
putting forth effort relating to a skill. In fact, studies have effort and cost. This is particularly more so with the use
pointed to the fact that students with LD tire easily (Kripke, of contemporary test theory methods, such as the Rasch
Lynn, Madsen, & Gay, 1982; Morgan, 1977). Furthermore, model (Rasch, 1980).
fatigue during a task will likely be associated with effort
withdrawal and poor achievement outcomes.
A second reason for the assessment of brief scales lies The Contribution
with the fact that students with LD oftentimes possess of the Rasch Model
comorbid characteristics of attention deficit and/or attention- The Rasch model (Rasch, 1980) has been particularly
deficit/hyperactivity disorder (ADHD; Mayes, Calhoun, & useful in evaluating the psychometric adequacy of scales
Crowell, 2000). In fact, the comorbidity of the two disorders (Smith & Smith, 2004; Wright & Stone, 1979). Recently,
ranges between 20% and 30% based on U.S. estimates Dimitrov (2003) provided a series of useful criteria for
(National Institute of Mental Health, 2003) or between 35% evaluating scales using information about reliability and
and 46% based on epidemiological studies (Karakas, Turgut, population estimates of variance. The present work relies
& Bakar, 2008). Thus, we hypothesize that prolonged atten- heavily on the psychometric criteria developed by Dimitrov
tion required by lengthy tests will likely lead to failure, (2003) and extends it by providing a 12-index taxonomy
regardless of levels of ability because of the moderating role that can be used for evaluating the psychometric adequacy
of poor attention. Brief tests will certainly “correct” achieve- of scales used for research or practice.
ment levels for the parameter “fatigue.” Thus, the purpose of the present studies was to provide
A third consideration relates to the above two in that the means to create brief instruments for the assessment of
fatigue and lack of attention may relate to a lack of motiva- LD based on information from classical test theory and
tion and consequently effort withdrawal (either at the ante- modern theory analytic methods. Initially, a brief reading
cedent level or as a consequence). For example, Kane rating scale was created based on psychometric standards
(2008) pointed to the fact that the effort provided by stu- and contributions from both classical test theory and mod-
dents with LD and ADHD during the assessment process is ern methods (i.e., the Rasch model). Then, the brief reading
of very poor quality. Furthermore, several studies have scale was evaluated for its validity against the full version
associated effort withdrawal with the employment of mal- of the reading rating scale (Study 1) and an objective “gold
adaptive goals by students with LD (Sideridis, 2005) or standard” of reading ability (Study 2).
Sideridis and Padeliadu 117
Point Test–
Item of Reading Scale Infit-ms Outfit-ms Biserial DIF Discrim. LEV HER LETV Retesta
1. Reads text out loud very slowly 1.32 1.74 .61 No 0.34 .1670 .1485 .0291 .931
2. Reads text silently very slowly 1.36 1.97 .58 No 0.23 .1844 .1603 .0352 .934
3. Substitutes, reverses, adds, or omits letters when reading 1.06 1.09 .63 No 0.90 .1901 .1644 .0374 .928
4. Substitutes, reverses, adds, or omits syllables when reading 1.02 0.92 .61 No 0.99 .1989 .1707 .0409 .908
5. Has difficulty decoding double digit words when reading 0.94 0.86 .68 Yes 1.13 .2007 .1721 .0417 .908
6. Substitutes phonologically similar words when reading 1.06 0.92 .60 No 0.94 .2010 .1723 .0418 .912
7. Commits errors when reading unknown words 1.06 0.97 .74 No 0.93 .2027 .1736 .0426 .901
8. Has difficulties to understand abstract meanings when reading text 0.97 0.96 .70 No 1.04 .2048 .1751 .0435 .910
9. Has difficulties in understanding text when reading compared to when listening to it 1.20 1.34 .69 No 0.65 .2053 .1755 .0437 .922
10. Has difficulties in identifying the main idea of a text 0.92 0.85 .74 No 1.15 .2059 .1759 .0440 .908
11. Has difficulties in answering questions that require deep processing of information 0.91 0.80 .70 No 1.19 .2057 .1758 .0439 .915
12. Has difficulties in predicting the content/plot of a text 0.89 0.78 .71 No 1.22 .2051 .1753 .0436 .918
13. Has difficulties to distinguish salient and important features of a text (from unimportant ones) 0.85 0.71 .75 Yes 1.29 .2047 .1750 .0434 .927
14. Has difficulties in processing information that will aid text comprehension 0.83 0.71 .78 No 1.28 .2041 .1746 .0432 .934
15. Has difficulties to memorize information from a text 0.86 0.79 .75 No 1.25 .2031 .1739 .0427 .928
16. Gives up easily his/her efforts when reading text 1.02 0.99 .71 No 0.96 .2025 .1734 .0425 .931
17. Has difficulties in implementing strategies that would aid understanding of the text 1.01 1.01 .77 No 0.99 .1986 .1705 .0408 .923
18. Uses poor and inefficient strategies that do not belong to his/her age group 0.89 0.80 .73 No 1.22 .1918 .1655 .0380 .927
19. Uses appropriate strategies in inefficient ways 0.93 0.92 .69 No 1.12 .1743 .1534 .0316 .900
20. Has difficulties in implementing a plan of action when reading 0.89 0.83 .75 No 1.20 .1670 .1485 .0291 .926
Note: Shaded items are those selected in the brief reading rating form (i.e., Items 7, 8, 3, 15, 16 and 17). DIF = differential item functioning for two groups of students, those with a diagnosis of
learning disabilities and typical students; LEV = low expected variance; HER = high expected reliability; LETV = low expected true variance.
a
Test–retest estimates of difficulty parameters across items are shown in Figure 1.
Sideridis and Padeliadu 119
1. Acceptable infit mean square statistic. The formula used which tests the hypothesis that all difficulty parameters D
for that assessment came from Wright and Masters (1981), with standard errors SE are equivalent across items L.
N N 5. Presence of nonuniform DIF. As Linacre (1999) pointed
InfitMSi = ∑ Yin2 / ∑ Win (2) out, the presence of nonuniform DIF is indicative of dis-
n =1 n =1
N criminant validity. Nonuniform DIF was run with the two
2
which is the ratio between the observed residuals n∑=1Yin ability groups (typical and LD) to test the hypothesis that
N the slopes of the items are equivalent in shape (i.e., parallel)
and the average expected residuals ∑ Win / N (also see across ability groups.
n =1
6. Discrimination parameter close to the expected value of 1.
Wilson, 2005). The expected value is around unity, and de- As Linacre (1999) pointed out, when a discrimination
viations of 0.5 from those estimates are indicative of dis- parameter value exceeds unity, then that specific item dis-
torted items (i.e., items that did not perform at the level of criminates high and low ability groups more than expected
difficulty it was expected of them). Furthermore, values (for the item’s difficulty level). On the contrary, when a
that deviate more than one unit from 1 suggest severe dis- discrimination value is less than 1, then the item discrimi-
tortion or in other words items that do not provide con- nates high from low ability groups less than expected. Thus,
structive information to the scale of interest. the closer the discrimination values were to 1, the more
2. Acceptable outfit mean square statistic based on likely was the item to behave according to expectations
z score estimates. The following formula was used for that (i.e., Rasch model’s expectations).
estimation, 7. Low expected error variance. Based on Dimitrov (2003),
∑ Z ni
2 the lower the error of measurement, the higher the probabil-
OutfitMSi = (3) ity that a test is valid. The formula for the estimation of
N amounts of error variance is as follows,
which represents the average of the standardized residual n
variance across individuals and items (i.e., Zni). This un- σ2e = ∑ σ2 (e i ) (6)
i =1
weighted estimate gives more impact on responses that are
away from a person’s (or item’s) measures (Bond & Fox, which represents the accuracy of correct versus incorrect
2001; Linacre & Wright, 1994). The expected value of this scorings.
parameter is also around unity. 8. High expected reliability. As error variance goes down,
3. High biserial correlation between an item and the scale’s reliability goes up. Based on the work of Dimitrov (2003)
total score. As Krus and Ney (1978) pointed out, when point the following was assessed,
biserial correlations are positive and high, this is indicative
of high convergent validity for that specific item. The for- ρ xxτ = σ2x / σ2 = στ2 / (στ2e + σ2 ) (7)
mula to transform point biserial correlations into biserial
correlations (because of the presence of normality in the which represents the ratio of true error variance to total
sample) was the following, variance (i.e., observed variance).
9. Low expected true variance. Again following the pio-
RBiserial = PBs ( p(1 − p)) (4) neering work of Dimitrov (2003), expected true variance
was estimated using his conventions,
with PBs being the point biserial correlation and p being the n n
proportion of data with Y = 1. σ2τ = ∑ ∑ [πi (1 − π i ) − σ 2 (e i )][π j (1 − π j ) − σ 2 (e j )] (8)
i =1 j =1
4. Presence of differential item functioning (DIF) between
children with a diagnosis of LD versus typical students. As Lina- which relate to the items’ mean πi and their error variance σ2e.
cre (1999) pointed out, the difference in item difficulties 10. Test–retest reliability of person and difficulty estimates.
between two groups must exceed .50 in logit units to be The repeated measurement of a scale is the best index of its
meaningful. Paek (2002) added that difficulty difference reliability. For this purpose, all items were subjected to
estimates less than .42 are negligible, those between .43 and test–retest reliability using a subsample of 185 participants.
.64 are of medium size, and those exceeding .64 are large. Two types of reliability were estimated: correlations between
Thus, only DIF estimates that met the criterion of medium participants’ scores across two time points and correlations
effect counted in favor of specific items (see Note 2). The between difficulty estimates (b) across the two time points
formula applied in Winsteps is the following, (Luppescu, 1991), a fact that is particularly important for
item calibration (Lunz & Bergstrom, 1995).
2
L D 2j L Dj L 1 11. LZ statistic that is indicative of person misfit. When
χ = ∑
2
j =1 SE j
−
2 ∑ 2
j =1 SE j
/∑
j =1 SE 2j
) (5) comparing two scales, the proportion of participants in
one sample (one version of the scale) is contrasted to the
120 Journal of Learning Disabilities 46(2)
proportion of participants (in the other version of the scale) results indicated that one dimension accounted for 35% of
who misfit the Rasch model. The estimation of misfitting the total variance (explained by items and participants). The
participants was conducted using the LZ statistic (Drasgow, respective amount for the brief scale was 24.2%. Certainly
Levine, & Williams, 1985). This statistic is distributed nor- the heterogeneity of the items that composed the brief scale
mally, it is standardized, and values in the software reflect accounted for the fact that the items contained much more
the probability that this person fit the Rasch model. Thus, information compared to a single reading dimension. Nev-
estimates below 5% reflect misfitted participants. The LD ertheless the amount of variance explained was significant.
statistic was computed as follows, Internal consistency estimates. Using the Kuder–Richardson
formula for dichotomous items we assessed the internal
L0 − E ( L0 )
LZi = (9) consistency of both forms of the scale. Estimates for the full
[Var ( L0 )]1/ 2 scale were equal to .949 and for the brief, six-item form
with L0 being the log peak of the likelihood function, E(L0) equal to .879. Item total correlations ranged between .52
being the expectation of L0 and Var(L0) being the variance and .74 for the full scale and between .61 and .74 for the
of L0. The analysis ran with software developed by Liang, brief scale.
Han, and Hambleton (2008). Further comparisons between Test–retest reliability estimates. Using an interval of 1 week,
full and brief forms implemented the general linear model. the rating scale was subjected to another testing for 185 par-
12. Content balance. As Chang (2004) has nicely pointed ticipating teachers. Results indicated that the test–retest
out, “The set of items selected for each examinee must sat- correlation coefficient was .974 for the full reading scale
isfy certain non-statistical constraints such as content bal- and .960 for the brief rating scale.
ance” (p. 122). This was the final criterion in determining Test characteristic curves (TCC). As Crocker and Algina
inclusion of an item to the brief form of the scale in terms (1986) have stated, TCCs reflect regression curves for pre-
of adequately capturing the latent construct of reading. In dicting observed scores (X) from latent trait scores (θ).
other words, items that represented various aspects of read- Figure 2 shows the TCCs of the full and brief scales. At the
ing (i.e., fluency, comprehension, etc.) were better candi- medium reading trait level (θ), it is obvious that the two
dates for inclusion compared to items representing one area tests are equivalent (having equal thresholds and location
only (e.g., only comprehension). Appendix B displays an on the latent trait). However, the slope of the full scale is
SPSS syntax file for running some of the commands. steeper compared to that of the brief scale, suggesting that
higher ability levels are required to achieve a high proba-
bility of success (different levels of discrimination, i.e.,
Results of Study 1 alpha parameter). That is, the difference between the two
Using the criteria described above, this section provides tests is on discrimination but not in location, suggesting that
comparative information with regard to the psychometric the full scale is better at discriminating individuals in the
adequacy of the full scale compared to the brief scale that latent trait range around the point of inflection (Morizot
was composed using fewer than one third of the total items. et al., 2007).
Test information function (TIF). TIF is extremely important
Construction of Brief Rating Scale. Table 1 provides in test construction because it provides information regarding
item information regarding the evaluative criteria of good the “precision” of a scale across levels of the latent trait
items. Based on that evaluation, a best synthesis of six items (Morizot et al., 2007). Ideally a researcher would want to
was selected that met high standards of psychometric qual- develop a scale that would have equal precision across the
ity and content balance. Given the exploratory nature of the trait range, unless the purpose of the test is to identify sub-
present studies, no attempt was made to differentially weigh groups of individuals with specific levels of performance on
the criteria. Instead, an additive model was used in which the latent trait. A normative test, however, should always aim
the items that met most of the above criteria were included to describe individuals at all levels of ability to be potentially
in the brief form (in an effort to involve as few items as pos- unbiased with regard to any level of ability. Furthermore, the
sible, without loosing content validity). As shown in Table 1, concept of “information” in Rasch modeling is very impor-
the six-item synthesis included items with low measure- tant because it relates to the standard error of measurement
ment error and high internal consistency, discrimination, (SEM), which represents the inverse square root of informa-
item–total correlation, and test–retest reliability (e.g., of tion at each and every point along the trait continuum (Rob-
parameter’s difficulty levels; see Figure 1). Below the full ins, Fraley, & Krueger, 2007; Simms & Watson, 2007),
and brief versions of the reading rating scale are compared 1
across various estimates of psychometric adequacy. SE(θ ) = (10)
I(θ)
Comparisons Between Full and Brief Scales with I(θ) representing TIF and SE(θ) representing the SEM.
Dimensionality indices between full and brief rating scales. Conversely, the information function θ of an item is esti-
Using a principal components analysis of the residuals, mated using the formula by Wilson (2005),
Sideridis and Padeliadu 121
Figure 1. Regression of difficulty parameters between Time 1 and Time 2 measurements for full scale.
Figure 2. Test characteristic curves between brief and full scales using Rasch model estimates.
122 Journal of Learning Disabilities 46(2)
Figure 3. Test information function between brief and full scales using Rasch model estimates.
Figure 4. Standard error curves for the two scales (full with 20 items and brief with 6 items).
Scales were placed on the same metric to be comparable.
Figure 5. Frequency distributions between typical students and students with a learning disability on the full reading rating scale.
Indices of sensitivity and specificity are shown for each level of the rating scale.
discriminate individuals at most of the latent trait range & McGlashan, 2004; Hanley & McNeil, 1982; Hsu, 2002).
except the tails (which is expected). Obviously, for both For this purpose the distributions of the two groups of stu-
scales the greatest precision is evident in theta values that dents are plotted for the full scale (Figure 5) and the brief
range between –2 and +2 (see Figure 3) or between raw scale (Figure 6), and indices of sensitivity and specificity
scores of approximately +4 and +18 (see Figure 4). Although are displayed for each score level (dark and light lines).
low discrimination is expected in the tails of the distribu- Then ROC curves were constructed to test the ability of
tion, Sykes and Hou (2003) suggested that there may be each scale (full and brief forms) to identify students having
some guessing involved at that range of ability, which identified LD (Figure 7). The discriminant validity of each
explains that fact (see the work of Karabatsos, 2003, for scale was evaluated first, and then their areas under the
types of aberrant responding). curve (AUCs) were compared using z difference tests
Discriminant validity. It was evaluated by estimating the (Hanley & McNeil, 1983).
ability of the brief and full forms of the scale to discriminate When looking at the full scale (Figure 5), it is apparent
students with and without identified ld by use of receiver that most of the typical student group had scores around
operating characteristic (ROC) curves (Grilo, Becker, Anez, zero (dark bars above the X line), whereas students with LD
124 Journal of Learning Disabilities 46(2)
Figure 6. Frequency distributions between typical students and students with a learning disability on the brief reading rating scale.
Indices of sensitivity and specificity are shown for each level of the rating scale.
Figure 8. Probability of misfit for each participant in the full scale versus brief scale.
Values were estimated using the LZ statistic.
Person fitting comparison. Using the LZ test described Brief Discussion of Study 1
above, participants were evaluated for their personal contri-
bution to model fit. Figure 8 displays the probability values Study 1 was designed to provide a taxonomy for which
for each participant (small values represent misfitted par- one can construct brief forms of a scale, with an applica-
ticipants, and large values substantial contribution of each tion in LD. This taxonomy was provided using informa-
participant to the Rasch model’s theses). As shown in Fig- tion from classical and modern theory methods and is
ure 8, more participants in the full scale appear to be on the primarily proposed for LD, for which brief scales may be
right of the distribution compared to the brief scale. This particularly more suitable, appropriate, and necessary.
finding was confirmed by use of a means analysis for which Results indicated that the brief form possessed highly
the mean probability of misfit was equal to .531 for the full desirable psychometric properties and was psychometri-
scale participants and .423 for the participants in the brief cally comparable to the full version of the reading rating
scale. When this difference was tested using student’s t test, scale (at times even more so). Study 2 sought to test the
results supported the alternative hypothesis that there was a validity of the brief scale further (compared to the full
significantly smaller amount of misfitted participants in the form) using external evaluative criteria as suggested
brief scale compared to the full scale, t(1960) = 9,830, p < .001. by Simms and Watson (2007). Thus, using a normative
Item fitting comparison. Items were evaluated for model reading scale, both full and brief forms were tested for the
fit by estimating the discrepancy between their empirical presence of external criterion validity and discriminant
estimate and that posited by the Rasch model (theory). Dis- validity.
crepancies were assessed using chi-square tests. These find-
ings are presented in Figure 9 for both the brief form of the
scale and the full scale. As shown in Figure 9, the full scale Study 2
seems to have items that have large discrepancies between Method
empirical estimations and expectations of the model. Visu-
ally speaking, the discrepancies appear to be smaller with Participants and Procedures. There were 232 students,
regard to the brief scale items. The ICCs for the final brief with 22 of them having a diagnosis of LD from state multi-
six-item scale are shown in Figure 10. disciplinary teams. The remaining 210 students were typical
126 Journal of Learning Disabilities 46(2)
Figure 9. Chi-square values for each item reflecting discrepancies between empirical estimates and theoretical postulates.
Large values are indicative of item misfit.
and were educated in general education settings (122 boys, compare groups of individuals (with and without LD) on
108 girls; data on gender were missing for 2 students). Stu- the latent reading trait (latent means model). Evidence in
dents were identified as having LD as in Study 1 (i.e., using favor of (a) would be manifested with significant correla-
multidisciplinary team criteria). The distribution of students tions between the latent trait and objective reading mea-
per age was as follows: Grade 3 = 33, Grade 4 = 39, Grade sures and for (b) would be manifested with a significant
5 = 35, Grade 6 = 36, Grade 7 = 35, Grade 8 = 28, and Grade coefficient linking a dummy (grouping) variable to the
9 = 17; data on grade were missing for 4 students. Schools latent reading trait. For the purpose of (a) and (b) model-fit
were selected for inclusion as in Study 1 using stratified indices (e.g., chi-square values and fit indices such as com-
random sampling with region being the stratum. parative fit index, etc.) were not relevant, as overall fit was
Learning Disabilities Screening Scale for Teachers. The same not an objective of either the (a) or (b) goal. Instead, the
scale used in Study 1 was also implemented in Study 2. structural coefficients and their respected significance were
The internal consistency estimate of the Reading subscale of interest.
was .929.
Learning Disabilities Reading Inventory. This normative
scale was developed by Padeliadu and Antoniou (2008) and Results and Discussion of Study 2
assesses four aspects of reading (i.e., decoding, fluency, External criterion validity of full and brief rating scale. As
morphology and syntax, and comprehension). Internal con- mentioned above, Model 1 tested the hypothesis that the
sistency estimates ranged between .605 and .853 using the rating scales would correlate significantly with objective
present data. indices of reading achievement (i.e., the Learning Disabili-
ties Reading Inventory). As shown in Figure 11, there were
Data Analyses. Latent variable modeling was implemented significant correlations between the latent teacher rating
using EQS 6.1 (Bentler, 2006). The purpose of this model- scale of reading and decoding (r = .21, p < .05), fluency
ing was twofold: (a) to correlate the latent “reading” trait with (r = -.28, p < .05), and comprehension (r = .12, p < .05, on
objective measurements of reading skills and (b) to a one-tailed test), but not morphology and syntax (r = .03, ns)
Sideridis and Padeliadu 127
Figure 10. Item characteristic curves for the brief six-item rating scale.
for the brief scale. These coefficients were very similar to reveal differences across scales (full form and brief) on
those for the full scale (estimates in parentheses of Figure 11). their relation with the external criterion. Interestingly, flu-
Furthermore, between-correlation-coefficient tests failed to ency correlated negatively with the latent rating of reading,
128 Journal of Learning Disabilities 46(2)
Figure 11. Correlations between reading latent construct (rating scale) and the subscales of the gold standard.
The sign of the estimates was reversed for clarity (because learning disability ratings were reversely coded).
most likely because of an emphasis on reading comprehen- one third of the items of the total scale was created in
sion and the fact that reading fast may be associated with a Study 1 based on multiple and rigorous psychometric crite-
lack of processing deep information from text. ria. Study 2 attempted to validate the findings of Study 1
Discriminant validity of full and brief rating scales: Latent using an objective measurement of reading achievement.
means model. A latent means model was fit to the data to Results indicated that the brief rating scale significantly
test the hypothesis that the latent reading rating would be discriminated between students with and without LD. In
different across students with and without LD. Results from fact, the discriminant ability of the brief rating scale was
that modeling confirmed this hypothesis (see Figure 12). even better, compared to the full item scale, across various
The effect was significant, suggesting that typical students psychometric indices.
had, on the average, a .69 higher reading score compared to The most important finding of the present study was that
students with LD. The respective estimate for the full scale the use of multiple criteria proved to be extremely useful for
was .64, very similar to that for the brief scale. identifying effective items and creating a short form of the
ROC analyses. When comparing the AUCs between the rating scale. Using multiple criteria results showed that the
two forms of the scale, results indicated that both scales brief scale, composed of fewer than one third of the total
provided exceptional discrimination (AUCFull Scale = .913, scale’s items, possessed no inferior psychometric properties
AUCBrief Scale = .919). Furthermore, there were no significant compared to the full scale. At first, this is very encouraging
differences between the two areas (z = 0.240, p = .810), a given the fact that brief scales are needed for assessing
finding that supports the null hypothesis that both forms of skills and competencies in LD. The second important find-
the scale were equally effective at discriminating students ing was that the use of an external gold standard further
with LD from typical students. substantiated the predictive and criterion validity of the
brief scale, which is very important for scale evaluation
(Messe, Crano, Messe, & Rice, 1979). This fact was evidenced
General Discussion more so even when comparing the brief scale to the full
The purpose of the present studies was to provide the means scale.
to create brief instruments for the assessment and screening Interestingly, the brief reading scale involved a combi-
of LD based on information from classical test theory and nation of items related to reading. For example, Item 7 was
modern theory analytic methods. Using a large sample of about fluency, Item 8 was about comprehension, Item 13
students with and without LD, a subscale that comprised was about deep versus surface processing (cognitive strategy
Sideridis and Padeliadu 129
Figure 12. Latent means model between students with and without learning disabilities on latent reading construct (rating scale).
Group was coded as 0 = learning disabilities, 1 = typical.
use), Item 15 was about memorization (cognitive ability, criteria has been a crude approximation of the actual
Item 16 was about motivation and the ability to not give up “quality” of each criterion. However, given the exploratory
easily, and Item 17 was about strategy implementation. In nature of the present work, our knowledge did not allow us
other words, the brief form contained a mixture of factors to differentially weigh the respective criteria. Also, the large
that are accelerators of reading, without belonging to one sample of identified students with LD provides strong sup-
unified aspect of reading (e.g., strategy use only or reading port with regard to the viability and usefulness of the iden-
comprehension only, motivation, etc.). The unified theme tifying criteria for creating a brief rating scale. Certainly,
was reading but the content of the items belonged to con- attempting to replicate the present criteria in the future will
ceptually different variables. aid the generalizability of the present psychometric attempt.
Bond, T., & Fox, C. M. (2001). Applying the Rasch model: Fun- Grilo, C. M., Becker, D. F., Anez, L. M., & McGlashan, T. H.
damental measurement in the human sciences. Mahwah, NJ: (2004). Diagnostic efficiency of DSM-IV criteria for border-
Lawrence Erlbaum. line personality disorder: An evaluation in Hispanic men and
Burns, M., Dean, V., & Klar, S. (2004). Using curriculum-based women with substance use disorders. Journal of Consulting
assessment in the responsiveness to intervention diagnostic and Clinical Psychology, 72, 126–131.
model for learning disabilities. Assessment for Effective Inter- Hammill, D. D. (1995). The Learning Disability Diagnostic Inven-
vention, 29, 47–56. tory (LDDI). Austin, TX: Pro-Ed.
Burns, M., Scholin, S., Kosciolek, S., & Livingston, J. (2010). Hammill, D. D., & Bryant, B. R. (1998). Learning Disabili-
Reliability of decision making frameworks for response to ties Diagnostic Inventory examiner’s manual. Austin, TX:
intervention for reading. Journal of Psychoeducational Assess- Pro-Ed.
ment, 28, 102–114. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of
Cassidy, S. (2008). Subjectivity and the valid assessment of pre- the area under a receiver operating characteristic (ROC) curve.
registration student nurse clinical learning outcomes: Implica- Radiology, 143, 39–46.
tions for mentors. Nurse Education Today, 29, 33–39. Hanley, J. A., & McNeil, B. J. (1983). A method for comparing the
Chang, H. H. (2004). Understanding computerized adaptive test- areas under a receiver operating characteristic curves derived
ing: From Robbins-Monro to Lord and Beyond. In D. Kaplan from the same cases. Radiology, 148, 839–843.
(Ed.), The sage handbook of quantitative methodology for the Hsu, L. M. (2002). Diagnostic validity statistics and the MCMI-III.
social sciences (pp. 117-131). CA, Thousand Oaks: Sage Psychological Assessment, 14, 410–422.
Compton, D. L. (2006). How should “unresponsiveness” to sec- Kane, S. T. (2008). Minimizing malingering and poor effort in the
ondary intervention be operationalized? It is all about the LD/ADHD evaluation process. ADHD Report, 16, 5–9.
nudge. Journal of Learning Disabilities, 39, 170–173. Karabatsos, G. (2003). Comparing the aberrant response detection
Crocker, L., & Algina, J. (1986). Introduction to classical performance of 36 person fit statistics. Applied Measurement
and modern test theory. Philadelphia, PA: Harcourt Brace in Education, 16, 277–298.
Jovanovich. Karakas, S., Turgut, S., & Bakar, E. (2008). Neuropsychometric
Deno, S. L. (1989). Curriculum based measurement and special comparison of children with pure learning disabilities, pure
education services: A fundamental and direct relationship. ADHD, comorbid ADHD with learning disability and normal
New York, NY: Guilford. controls using the Mangina test. International Journal of Psy-
Dimitrov, M. D. (2003). Reliability and true-score measures of chophysiology, 69, 147–148.
binary items as a function of their Rasch difficulty parameters. Keogh, B. K. (2005). Revisiting classification and identification.
Journal of Applied Measurement, 4, 222–233. Learning Disability Quarterly, 28, 100–102.
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropri- Kripke, B., Lynn, R., Madsen, J., & Gay, P. (1982). Familial
ateness measurement with polytomous item response models learning disability, easy fatigue, and maladroitness: Prelimi-
and standardized indices. British Journal of Mathematical and nary trial of monosodium glutamate in adults. Developmental
Statistical Psychology, 38, 67–86. Medicine & Child Neurology, 24, 745–751.
Fletcher, J., Francis, D., Morris, R., & Lyon, R. (2005). Evidence- Krus, D. J., & Ney, R. G. (1978). Convergent and discriminant
based assessment of learning disabilities in children and validity in item analysis. Educational and Psychological Mea-
adolescents. Journal of Clinical Child and Adolescent Psy- surement, 38, 135–137.
chology, 34, 506–522. Liang, T., Han, K. T., & Hambleton, R. K. (2008). User’s guide
Fletcher, J., Francis, D., Shaywitz, S., Lyon, R., Foorman, B., for ResidPlots-2: Computer software for IRT graphical resid-
Stuebing, K., & Shaywitz, B. (1998). Intelligence testing and ual analyses, Version 2.0 (Center for Educational Assessment
the discrepancy model for children with learning disabilities. Research Rep. No. 688). Amherst: University of Massachu-
Learning Disabilities Research & Practice, 13, 186–203. setts, Center for Educational Assessment.
Francis, D., Fletcher, J., Stuebing, K., Lyon, R., Shaywitz, B., & Linacre, J. M. (1999). A user’s guide and manual to Winsteps.
Shaywitz, S. (2005). Psychometric approaches to the identi- Chicago, IL: Mesa Press.
fication of LD: IQ and achievement scores are not sufficient. Linacre, J. M., & Wright, B. D. (1994). Chi-square fit statistics.
Journal of Learning Disabilities, 38, 98–108. Rash Measurement Transactions, 8, 360–361.
Fuchs, D., & Deshler, D. (2007). What we need to know about Lord, F. M. (1980). Applications of item response theory to practi-
responsiveness to intervention (and shouldn’t be afraid to ask). cal testing problems. Hillsdale, NJ: Lawrence Erlbaum.
Learning Disabilities Research & Practice, 22, 129–136. Lunz, M. E., & Bergstrom, B. A. (1995). Item recalibration and
Goodman, G., & Webb, M. M. (2006). Reading disability referrals: ability estimate stability. Rasch Measurement Transactions, 8,
Teacher bias and other factors that impact response to inter- 396–397.
vention. Learning Disabilities: A Contemporary Journal, 4, Luppescu, S. (1991). Graphical diagnosis. Rasch Measurement
59–70. Transactions, 5, 136.
132 Journal of Learning Disabilities 46(2)
Mashburn, A. J., & Henry, G. T. (2004). Assessing school readiness: Ritter, D. (1989). Teachers’ perceptions of problem behavior
Validity and bias in preschool and kindergarten teachers’ ratings. in general and special education. Exceptional Children, 55,
Educational Measurement: Issues and Practice, 23, 16–30. 559–564.
Mayes, S. D., Calhoun, S. L., & Crowell, E. W. (2000). Learn- Robins, R. W., Fraley, C. R., & Krueger, R. F. (2007). Handbook
ing disabilities and ADHD: Overlapping spectrum disorders. of research methods in personality psychology. New York,
Journal of Learning Disabilities, 33, 417–424. NY: Guilford.
Messe, L. A., Crano, W. D., Messe, S. R., & Rice, W. (1979). Sideridis, G. D. (2005). Performance approach-avoidance motiva-
Evaluation of the predictive validity of tests of mental ability tion and planned behavior theory: Model stability with Greek
for classroom performance in elementary grades. Journal of students with and without learning disabilities. Reading and
Educational Psychology, 71, 233–241. Writing Quarterly, 21, 331–359.
Morgan, P. L. (1977). The differential effects of visual background Sideridis, G. D. (2007). Why are students with learning disabilities
and fatigue on automatized task performance in learning dis- depressed? A goal orientation model of depression vulnerability.
abled and normal children. Dissertation Abstracts Interna- Journal of Learning Disabilities, 40, 526–539.
tional, 37, 4695–4696. Simms, L. J., & Watson, D. (2007). The construct validation
Morgan, P. L., Fuchs, D., Compton, D., Cordray, D., & Fuchs, L. approach to personality scale construction. In R. W. Robbins,
(2008). Does early reading failure decrease children’s reading C. Fraley, & R. F. Krueger (Eds.), Handbook of research meth-
motivation? Journal of Learning Disabilities, 41, 387–404. ods in personality psychology (pp. 240–258). New York, NY:
Morgan, P. L., & Sideridis, G. D. (2006). Contrasting the effec- Guilford.
tiveness of fluency interventions for students with or at risk for Smith, E. V., & Smith, R. M. (2004). Introduction to Rasch mea-
learning disabilities: A multilevel random coefficient model- surement: Theory, models and applications. Maple Grove,
ing meta-analysis. Learning Disabilities: Research and Prac- MN: JAM Press.
tice, 21, 191–210. Stanovich, K. E. (1991). Discrepancy definition of reading disability:
Morizot, J., Ainsworth, A. T., & Reise, S. P. (2007). Toward Has intelligence led us astray? Reading Research Quarterly, 26,
modern psychometrics: Application of item response theory 7–29.
models in personality research. In R. W. Robbins, C. Fraley, Swanson, L. (1991). Operational definitions and learning disabilities:
& R. F. Krueger (Eds.), Handbook of research methods in An overview. Learning Disability Quarterly, 14, 242–254.
personality psychology (pp. 407–423). New York, NY: Guilford. Swanson, L. (2008). Neuroscience and RTI: A complemen-
National Institute of Mental Health. (2003). Attention deficit hyper- tary role. In E. Fletcher-Janzen & C. R. Reynolds (Eds.),
activity disorder. Bethesda, MD: Department of Health and Neuropsychological perspectives on learning disabilities in
Human Services, National Institutes of Health. the era of RTI: Recommendations for diagnosis and intervention
Padeliadu, S., & Antoniou, F. (2008). Learning Disabilities Read- (pp. 28–53). Hoboken, NJ: John Wiley.
ing Inventory. Athens: YPEPTH, EPEAEK. Sykes, R. C., & Hou, L. (2003). Weighting constructed-response
Padeliadu, S., & Sideridis, G. D. (2008). Learning disabilities items in IRT-based exams. Applied Measurement in Education,
screening for teachers. Athens: YPEPTH, EPEAEK. 16, 257–275.
Paek, I. (2002). Investigations of differential item functioning: Vaughn, S., & Fuchs, L. (2003). Redefining learning disabilities
Comparisons among approaches, and extension to a multidi- as inadequate response to instruction: The promise and poten-
mensional context. Unpublished doctoral dissertation, Univer- tial problems. Learning Disabilities Research & Practice, 18,
sity of California, Berkeley. 137–146.
Podell, D., & Soodak, L. (1993). Teacher efficacy and bias in spe- Wilson, M. (2005). Constructing measures: An item response
cial education referrals. Journal of Educational Research, 86, modelling approach. Mahwah, NJ: Lawrence Erlbaum.
247–253. Wright, B. D., & Masters, G. N. (1981). Rating scale analysis.
Rasch, G. (1980). Probabilistic models for some intelligence and Chicago, IL: Mesa Press.
attainment tests. Chicago, IL: University of Chicago Press. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago,
Rescorla, L. A., Achenbach, T. M., Ginzburg, S., Ivanova, M., IL: Mesa Press.
Dumenci, L., Almqvist, F., & . . . Verhulst, F. (2007). Con- Ysseldyke, J., Burns, M., Scholin, S., & Parker, D. (2010). Instruc-
sistency of teacher-reported problems for students in 21 coun- tionally valid assessment within response to intervention.
tries. School Psychology Review, 36, 91–110. Teaching Exceptional Children, 42(4), 54–62.