Professional Documents
Culture Documents
The performance gap between ELL and non-ELL students was smaller in the
lower grades. For example, there was a 17.7-point difference between ELL and
non-ELL students in Grade 2 reading mean scores as compared with a 22-point
difference for students in Grade 9.
The performance gap between ELL and non-ELL students decreased when
the level of language demand of test items decreased. For example, for Grade 9 stu-
dents, the performance gap between ELL and non-ELL students in reading was 22
points, as compared to 15.4 points in math.
The results of analyses also show that other background variables affect test
performance. Background variables such as family income (as measured by partic-
ipation in free/reduced price lunch program) and parent education may not be di-
rectly related to students ELL status, but are confounded with it.
238 ABEDI
2
Since we are workingwiththe populationof students inthis site, nostatistical comparisonis needed.
Even a minor difference would be real. However, following tradition, we conducted some statistical sig-
nificance testing. To control for multiple comparisons, we used the BenjaminiHochberg False Discov-
ery Rate procedure. For a description of this procedure see Benjamini and Hochberg (1994).
239
TABLE 3
Site 2 Grade 2 Stanford 9 Subsection Scores
Grade 2 Grade 9
Subgroup/Grade Reading Math Science Reading Math Science
ELL status
ELL
M 31.6 37.7 NA 24.0 38.1 34.9
SD 15.9 19.7 NA 12.5 15.2 12.8
N 97,862 114,519 NA 48,801 50,666 50,179
Non-ELL
M 49.3 50.4 NA 46.0 53.5 49.2
SD 19.7 21.9 NA 18.0 19.4 16.1
N 252,696 267,397 NA 224,215 226,393 225,457
School lunch
Free/reduced price
M 35.4 38.8 NA 32.0 42.5 39.4
SD 17.5 20.1 NA 16.2 16.4 14.3
N 106,999 121,461 NA 56,499 57,961 57,553
No free/reduced price
M 47.0 48.5 NA 42.6 50.7 47.0
SD 20.6 22.4 NA 19.7 20.1 17.0
N 304,092 327,409 NA 338,285 343,480 341,663
Parent education
Not high school grad
M 30.1 34.7 NA 29.2 39.6 37.3
SD 15.3 19.1 NA 15.0 15.1 13.5
N 54,855 63,960 NA 69,934 71,697 71,183
High school graduate
M 40.5 42.6 NA 35.6 44.1 41.7
SD 18.1 20.3 NA 17.0 17.1 14.9
N 93,031 101,276 NA 71,986 73,187 72,810
Some college
M 48.8 50.3 NA 44.6 51.6 48.2
SD 18.6 20.6 NA 17.2 18.1 15.4
N 66,530 70,381 NA 70,364 70,971 70,687
College graduate
M 56.5 58.4 NA 48.1 56.3 51.5
SD 18.5 20.6 NA 18.5 19.6 16.4
N 54,391 56,451 NA 87,654 88,241 87,956
Post graduate studies
M 62.1 64.1 NA 57.6 65.8 58.8
SD 18.7 20.4 NA 19.6 20.7 17.1
N 25,571 26,367 NA 34,987 35,087 35,022
Note. ELL = English languge learner.
Students who did not participate in the free/reduced price lunch program had
higher mean scores in all subject areas than those who did participate in the pro-
gram. For example, the average NCE score for reading for Grade 2 students who
participated in the free/reduced price lunch program was 35.4 (SD = 17.5, N =
106,999), as compared with an average score of 47.0 (SD = 20.6, N = 304,092) for
those who did not participate in the program. The difference was statistically sig-
nificant (t = 177.8, df = 411,089, p < .001). For Grade 9 students participating in the
free/reduced price lunch program, the average NCEscore for reading was 32.0 (SD
= 16.2, N= 56,499), as compared with an average of 42.6 (SD= 19.7, N= 338,285)
for those who did not participate in the program. The difference between the per-
formances of the two groups was statistically significant (t = 139.2, df = 394,755, p
< .001).
The results also indicate that parent education has a substantial impact on the
Stanford 9 test scores. For example, the average NCE score for reading for Grade 2
students of parents with low education (not high school graduate) was 30.1 (SD =
15.3, N= 54,855), as compared with an average of 62.1 (SD= 18.7, N= 25,571) for
students of parents with high education (post graduate education). This difference
was statistically significant (t = 238.8, df = 80,424, p < .001). For Grade 9 students,
the average NCE score for reading for the low parent education category was 29.2
(SD = 15.0, N = 69,934). For students with parents in the high education category,
the average was 57.6 (SD= 19.6, N= 34,987). This difference was statistically sig-
nificant (t = 238.4, df = 104,919, p < .001). The results of our analyses also suggest
that family income and parent education are confounded with students ELL sta-
tus. Table 4 presents frequencies and percentages of family income (free/reduced
price lunch program) and parent education by ELL status.
240 ABEDI
TABLE 4
Site 2 Free/Reduced Price Lunch Status and Parent Education
by ELL Status
No Free/Reduced Price Lunch
Parent Education
Free/Reduced Price Lunch
Parent Education
Not
HS Grad Post Grad Total
Not
HS Grad Post Grad Total
Grand
Total
Non-ELL 20,738 22,410 43,148 9,763 980 10,743 53,891
26.2% 28.2% 54.4% 12.3% 1.3% 13.6% 68.0%
ELL 15,384 976 16,360 8,648 358 9,006 25,366
19.4% 1.2% 20.6% 10.9% 0.5% 11.4% 32.0%
Total 36,122 23,386 59,508 18,411 1,338 19,749 79,257
45.6% 29.5% 75.1% 23.2% 1.7% 24.9% 100.0%
Note. Percentages reported are based on the total number of students. Not HS Grad = not high school gradu-
ate; Post Grad = post graduate education; ELL = English language learner.
A chi-square of 12,096.72, which shows confounding of these variables, was
significant beyond the .01 nominal level (
2
= 12,096.72, p < .001). A square con-
tingency coefficient of .132 presents a rough estimate of the proportion of common
variance (or confounding) among the three variables. These results suggest that a
greater percentage of ELL students are fromfamilies with lower income and lower
education. For example, 95% of ELL students had parents with low education,
whereas only 57%of non-ELLstudents had parents with loweducation. Thirty-six
percent of all ELL students participated in the free/reduced price lunch programas
compared with only 20% of non-ELL students.
However, the results of analyses in this study suggest that among these back-
ground variables, language factors show a greater impact on assessment, much
greater than family income or parent education.
To make a more clear comparison between the performance of subgroups of
students (e.g., by ELL status, family income, and parent education) in different
content areas, a Disparity Index (DI) was computed. For example, to compute DI
by students ELL status, the mean score for ELL students was subtracted from the
mean for non-ELL students. The difference was then divided by the mean for ELL
students, and the result was multiplied by 100. Table 5 shows the DI by student
ELL status, as well as by school lunch program and parent education, for Grades 2
and 7, for Site 2, in four content areas.
3
Similar results were obtained for other
grades (see Abedi & Leon, 1999).
As the data in Table 5 show, the average DI for ELL status over reading, math,
language, and spelling for Grade 2 was 48.1 (i.e., over all four subject areas,
non-ELL students outperformed ELL students by 48.1%). For Grade 7, the DI was
74.8. We also computed DI by school lunch programand parent education. The DI
for school lunch program for Grade 2 students was 29.6. That is, students who did
not participate in the school lunch program outperformed students who partici-
pated in the program by 29.6%. For Grade 7, the DI was 35.2. We also compared
the performance of students with the lowest level of parent education with students
and the highest level of parent education. The DI for parent education for Grade 2
was 99.3; that is, children of parents with the highest level of education (post grad-
uate education) outperformed children of parents with lower levels of education
(no education or elementary level education) by 99.3%. The DI for Grade 7 by
parent education was 83.5.
By comparing the math DI with the DIs of the language-related subscales (read-
ing, language, and spelling), we can see the impact of language on students perfor-
mance. The DIs for all categories (ELL status, school lunch, and parent education)
were smaller for math and larger for reading. For example, for Grade 2 students,
the DI (non-ELL vs. ELL) was 55.8 in reading (non-ELL students outperformed
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 241
3
We have presented the results for Grade 7 rather than Grade 9 to cover a larger range of students in
different grades.
ELL students by 55.8%), 60.2 in language, and 42.8 in spelling, as compared with
a DI of 33.5 in math. For Grade 7 students, the DIs (non-ELL vs. ELL) were 96.9
for reading, 70.7 for language, and 81.1 for spelling, compared to 50.4 for math.
The DIs for school lunch program (nonparticipant vs. participant in free/reduced
price lunch) for Grade 2 students were 32.7 for reading, 35.2 for language, and
25.3 for spelling, as compared with 25.1 for math.
However, the difference between DIs for math and language-related subscales
was largest across the ELL categories. In Table 5, we included these DI differences
under theDifferencecolumn. TheDI differencebyELLstatus
4
was19.4for Grade
2and32.5for Grade 7, as comparedwiththe school lunchprogramDI differences of
6.0and7.7, respectively, andthe parent-educationDI differences of 15.8and9.8, re-
spectively. Once again, these data suggest that language factors may have a more
profound impact on the assessment outcome than other background variables, such
as family income and parent education, particularly for ELL students.
To shed light on the impact of language factors on assessment, analyses by math
subscales were conducted and will be presented. Standardized achievement tests
such as the Stanford 9 and ITBS include in their tests different math subscales that
have varying degrees of language demand. These subscales range from testing
math analytical skills, concepts and estimation, and problem solving with a rela-
tively higher level of language demand to testing math computation with a mini-
mal level of language demand. If the hypothesis concerning the impact of language
on content-based performance is tenable, then the performance difference between
ELL and non-ELL students should be at the minimum level in content-based tests
with a minimal level of language demand, such as math computation. This was ex-
actly what the results of our analyses showed.
242 ABEDI
TABLE 5
Site 2 Grades 2 and 7 Disparity Indexes (DI) by ELL Status,
Free/Reduced Price Lunch, and Parent Education
DI Reading Math Language Spelling Average Difference
Grade 2
ELL/Non-ELL 55.8 33.5 60.2 42.8 48.1 19.4
Free/reduced lunch 32.7 25.1 35.2 25.3 29.6 6.0
Parent education 106.3 84.9 118.5 87.5 99.3 15.8
Grade 7
ELL/Non-ELL 96.9 50.4 70.7 81.1 74.8 32.5
Free/reduced lunch 47.2 29.5 32.9 31.1 35.2 7.7
Parent education 98.4 76.2 79.0 80.5 83.5 9.8
Note. ELL = English language learner.
4
This DI difference was computed as follows: The three language-related DIs (reading, language,
and spelling) were averaged. The result was then subtracted from the DI for math.
Data obtained from some of the sites in this study included different subscale
scores including math computation. Table 6 presents the DIs for bilingual students
compared with nonbilingual
5
students by level and grade for math concepts and es-
timation, math problem solving, math computation, and reading in Site 1.
The results of the DI analyses shown in Table 6 present several interesting patterns:
1. The DIs indicated that the nonbilingual students generally outperformed the
bilingual students. However, the magnitude of the DIs depends, to a greater extent,
on the level of language demand of the test items. The DI for test items with less
language demand was smaller than for other items. For example, in Grade 3, bilin-
gual students performed better on math computation, which has the lowest level of
language demand.
2. Major differences between bilingual and nonbilingual students were found
for students in Grades 3 and above. There seemed to be a positive relationship be-
tween the mean score differences and grade level, in that the difference increased
as the grade level increased, up to Grade 5. Starting with Grade 6, the DI was still
positive, but the rate of increase was not as systematic as before. For example, in
Grade 3, nonbilingual over bilingual students had DIs of 5.3 in math concepts and
estimation, 11.1 in math problem solving and data interpretation, 3.1 in math
computation, and 23.4 in reading. In Grade 4, these indexes increased to 26.9 for
math concepts and estimation, 19.3 for math problem solving and data interpreta-
tion, 6.9 for math computation, and 30.1 for reading. The indexes further increased
in Grade 5 to 36.5 for math concepts and estimation, 32.7 for math problem solv-
ing and data interpretation, 12.6 for math computation, and 41.1 for reading.
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 243
5
This site did not provide information on students ELL status. Instead, we used students bilingual
status as a proxy for ELL status.
TABLE 6
Site 1 Disparity Indexes of Nonbilingual Over Bilingual Students
on Math and Reading
Test
Level
Primary
Grade
Math Concepts
and Estimation
Math Problem Solving
and Data Interpretation
Math
Computation Reading
9 3 5.3 11.1 3.1 23.4
10 4 26.9 19.3 6.9 30.1
11 5 36.5 32.7 12.6 41.1
12 6 27.5 30.9 11.8 43.7
13 7 39.4 32.7 12.9 39.6
14 8 30.5 31.7 12.9 42.7
Average of all levels/grades 27.7 26.4 9.0 36.8
3. The largest gap between bilingual and nonbilingual students was in reading.
The next largest gaps were in the content areas that appear to have more language
demand. For example, the math concepts and estimation and the math problem
solving and data interpretation subsections seem to have more language demand
than the math computation subsection. Correspondingly, the DIs were higher for
those subsections. The average DI for Grades 3 through 8 was 27.7 for math con-
cepts and estimation. That is, the mean of the nonbilingual group in math concepts
and estimation was 27.7% higher than the bilingual group mean. A similar trend
was observed in math problem solving and data interpretation; the average DI for
this subsection was 26.4. The average DI for math computation, however, was 9.0,
which was substantially lower than the corresponding DIs for the other two math
subsections. These results were consistent across the different data sites.
Table 7 reports the DIs, non-ELL versus ELL students, for reading, math total,
and the math calculation and math analytical subscales for Grades 3, 6, and 8 at
Site 4. Once again, the results of analyses clearly suggest the impact of language
factors on students performance, particularly in areas with more language de-
mand. For example, in reading, ELL students had the largest performance gap with
non-ELL students. The average DI for reading across the three grades was 86.7, as
compared with the average performance gap of 33.4 for math total. Among the
math subscale scores, those with less language demand showed a smaller perfor-
mance gap. The average DI was 41.0 for math analytical and 20.1 for math calcula-
tion. The math calculation DI was substantially less than the DI for reading (86.7)
and for math analytical (41.0). However, it must be indicated at this point that lan-
guage demand and cognitive complexity of test items may also be confounded.
That is, items in the math calculation subscale may not only have less language de-
mand, but they may also be less cognitively demanding than other math subscales,
such as math problem solving. This is a caveat in our discussion on the impact of
language on content-based assessments.
244 ABEDI
TABLE 7
Site 4 Disparity Indexes of Non-ELL Versus ELL Students in Reading
and Subscales of Math
Disparity Index
Grade Reading Math Total Math Calculation Math Analytical
3 53.4 25.8 12.9 32.8
6 81.6 37.6 22.2 46.1
8 125.2 36.9 25.2 44.0
Average over the three grades 86.7 33.4 20.1 41.0
Note. ELL = English language learner.
Possible Impact of Language Factors
on Reliability of Assessments
In classical test theory, reliability is defined as the ratio of the true-score variance
(
2
T
) to observed-score variance (
2
X
) (Allen & Yen, 1979). This observed score
variance (
2
X
) is the sum of two components, the true-score variance (
2
T
) and the
error variance (
2
E
). In a perfectly reliable test, the error variance (
2
E
) would be
zero; therefore, the true-score variance (
2
T
) would be equal to the observed-score
variance.
However, in measurement with human participants there is always an error
component, whether large or small, which is referred to in classical test theory as
the measurement error (see Allen & Yen, 1979; Linn & Gronlund, 1995; Salvia &
Ysseldyke, 1998). Appropriate evaluation of the measurement error is important in
any type of assessment, whether in a traditional, multiple-choice approach or in
performance-based assessments (Linn, 1995; see also AERA, APA, & NCME,
1999). Many different sources (e.g., occasion, task, test administration conditions)
may contribute to measurement error in traditional, closed-ended assessment in-
struments. In addition to these sources, the reliability of performance assessment
measures suffers from yet another source of measurement error, variation in scor-
ing of open-ended items. More important, in the assessment of ELL students, lan-
guage factors may be another serious source of measurement error, due to unneces-
sary linguistic complexity in content-based areas. In the classical approach to
estimating reliability of assessment tools, the level of contribution of different
sources to measurement error may be indeterminable. Through the generaliz-
ability approach, one would be able to determine the extent of the variance each
individual source contributes (such as occasion, tasks, items, scorer, and language
factors) to the overall measurement error (see Cronbach, Gleser, Nanda, &
Rajaratnam, 1972; Shavelson & Webb, 1991).
To estimate reliability of the standardized achievement tests used in this study
and to investigate their measurement error, we considered different approaches.
Since parallel forms or testretest data were not available, we decided to use an
internal consistency approach. The main limitation with the internal consistency
approach, however, is the assumption of unidimensionality. For example, the lit-
erature has indicated that the alpha coefficient, which is a measure of internal
consistency, is extremely sensitive to multidimensionality of test items (see, e.g.,
Abedi, 1996; Cortina, 1993). However, because the test items within each con-
tent area are assumed to measure the same construct, we believe this approach
may be appropriate for estimating reliability of the achievement tests used in this
study.
Because different data sites used different tests, and because within the individ-
ual sites, different test forms were used in different grades, these analyses were
performed separately for each site and each grade. Within each grade, we con-
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 245
ducted the internal consistency analyses separately for ELL and non-ELL stu-
dents. The results obtained from analyses at different sites were consistent. Due to
space limitations, only the results from Site 2, the site with the largest number of
students, are presented. Acomplete report of the results of analyses can be found in
Abedi et al. (2001).
Language (and perhaps other variables, such as socioeconomic status and op-
portunity to learn) may cause a restriction of range in the score distribution that
may result in lower internal consistency.
Table 8 presents reliability (internal consistency) coefficients for the Stanford 9
data for Grade 2 students in Site 2. As the data in Table 8 show, non-ELL students
had higher coefficients than the ELL students. There was also a slight difference
between the alpha coefficients across the free/reduced price lunch categories.
Nonparticipants in the free/reduced price lunch programhad slightly higher alphas
than the participating students. For example, the average reliability for the reading
subscale for the nonparticipant group was .913, as compared with an average reli-
ability of .893 for the participant group (a difference of .021), and for ELLstudents
the average reliability was .856, as compared with an average reliability of .914 for
non-ELL students, a difference of .058 (non-ELLs refers to English only). The re-
sults of our analyses, which are consistent across the different sites, indicate that
the difference in internal consistency coefficients between ELL and non-ELL stu-
dents is significantly larger than the difference between these coefficients across
the free/reduced price lunch and parent education categories.
Table 9 presents the reliability (internal consistency) coefficients for Grade 9
students. Comparing the internal consistency coefficients for Grade 9 students
with those for Grade 2 students (reported in Table 8) once again revealed that re-
liability coefficients for ELL students were lower than the coefficients for
non-ELL students. This was particularly true for students in higher grades,
where language has more impact on performance. In both Grade 3 and Grade 9,
reliabilities were lower for ELL students. However, in Grade 9, the difference
between reliability coefficients for ELL and non-ELL students was larger. For
example, for Grade 2, the difference between reliability coefficients for ELL and
non-ELL students was .058 in reading, .013 in math, and .062 in language, as
compared with the ELL/non-ELL reliability difference of .109 for reading, .096
for math, and .120 for language in Grade 9. The difference between the overall
reliability coefficient of ELL students and English-only students for Grade 9 was
.167, which was substantially higher than the respective difference of .043 in
Grade 2. Thus, the reliability gap between ELL and non-ELL students increases
with increase in the grade level. This may be due to the use of more complex
language structures in higher grades.
The results of these analyses strongly suggest that students language back-
ground factors have a profound effect on their assessment outcomes, above and be-
yondother backgroundcharacteristicssuchasfamilyincomeandparent education.
246 ABEDI
2
4
7
TABLE 8
Site 2 Grade 2 Stanford 9 Subscale Reliabilities
Non-ELL Students: Free Lunch
Participation
Subscale (No. of Items) Yes No English Only FEP RFEP ELL
Reading N = 209,262 N = 58,485 N = 34,505 N = 29,771 N = 3,471 N = 101,399
Word study (48) .917 .895 .916 .915 .920 .865
Vocabulary (30) .913 .897 .915 .906 .907 .857
Reading comp. (30) .908 .888 .910 .900 .899 .846
Average reliability .913 .893 .914 .907 .909 .856
Math N = 220,971 N = 63,146 N = 249,000 N = 31,444 N = 3,673 N = 118,740
Problem solving (45) .893 .881 .896 .886 .890 .871
Procedures (28) .892 .892 .891 .887 .895 .890
Average reliability .893 .887 .894 .887 .893 .881
Language N = 218,003 N = 62,028 N = 245,384 N = 31,035 N = 3,612 N = 111,752
Total (44) .890 .866 .891 .883 .892 .829
Note. ELL = English language learner; FEP = fluent English proficient; RFEP = redesignated fluent English proficient.
Validity
Research has indicated that complex language in content-based assessments for
nonnative speakers of English may reduce the validity and reliability of infer-
ences drawn about students content-based knowledge. For example, results
from earlier CRESST language background studies (Abedi & Lord, 2001;
Abedi, Lord, & Hofstetter, 1998; Abedi et al., 2000; Abedi, Lord, & Plummer,
1997) provided support for a strong link between language factors and con-
tent-based performance. The linguistic factors in content-based assessments
(such as math and science) may be considered a source of construct-irrelevant
variance because they are not conceptually related to the content being assessed
(Messick, 1994):
With respect to distortion of task performance, some aspects of the task may require
skills or other attributes having nothing to do with the focal constructs in question, so
that deficiencies in the construct-irrelevant skills might prevent some students from
demonstrating the focal competencies. (p. 14)
To examine the impact of students language background on the validity of stan-
dardized achievement tests, analyses were performed to compare criterion validity
coefficients for ELL and non-ELL students and to examine differences between
the structural relationship of ELL and non-ELL groups.
Linguistic complexity of test items, as a possible source of construct-irrelevant
variance, may be a threat to the validity of achievement tests, because it could be a
source of measurement error in estimating the reliability of the tests. Inter-
correlation between individual test items, the correlation between items and total
test score (the internal validity coefficient), and the correlation between item score
and total test score with the external criteria (the students other achievement data)
were computed. A significant difference across the ELL categories in the relation-
ships between test items, between individual items and total test scores (internal
validity), and between overall test scores and external criteria may be indicative of
the impact of language on the validity of tests. Since language factors should not
influence the performance of non-ELL students, these relationships may be stron-
ger for non-ELL students.
To examine the hypothesis regarding differences between ELL and non-ELL
students on the structural relationship of the test items, a series of structural equa-
tion models were created for Site 2 and Site 3 data. Fit indexes were compared
across ELL and non-ELL groups. The results generally indicated that the relation-
ships between individual items, items with the total test score, and items with the
external criteria were higher for non-ELL students than for ELL students.
In creating the structural models, test items in each content area (e.g., read-
ing, science, and math) were grouped as parcels. Figure 1 presents item par-
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 249
cels and latent variables for reading, math, and science for Site 2. As Figure 1
shows, the 54 reading items were grouped into four parcels. Each parcel was
constructed to systematically contain items with three degrees of item difficulty:
easy, difficult, and moderately difficult items (for a description of the item par-
cels and ways to create them, see Catell & Burdsal, 1975). A reading latent vari-
able was constructed based on these four parcels.
Similarly, item parcels and latent variables for math and science were created
fromthe 48 math items and 40 science items by the same process. The correlations
between the reading, math and science latent variables were estimated. Models
were tested on randomly selected subsamples to demonstrate the cross-validation
of the results.
Table 10 shows the results of the structural models for Grade 9 at Site 2. Corre-
lations of item parcels with the latent factors were consistently lower for ELL stu-
dents than they were for non-ELL students. This finding was true for all parcels re-
gardless of which grade or which sample of the population was tested. For
example, for Grade 9 ELL students, the correlations for the four reading parcels
ranged from a low of .719 to a high of .779 across the two samples (see Table 10).
In comparison, for non-ELL students, the correlations for the four reading parcels
ranged froma lowof .832 to a high of .858 across the two samples. The itemparcel
correlations were also larger for non-ELL students than for ELL students in math
and science. Again, these results were consistent across the different samples.
250 ABEDI
FIGURE 1 Latent variable model for reading, science, and math.
The correlations between the latent factors were also larger for non-ELL stu-
dents than they were for ELL students. This gap in latent factor correlations be-
tween non-ELL and ELL students was especially large when there was more lan-
guage demand. For example, in Sample 1 for Grade 9, the correlation between
latent factors for math and reading for non-ELLstudents was .782 compared to just
.645 for ELL students. When comparing the latent factor correlations between
reading and science from the same population, the correlation was still larger for
non-ELLstudents (.837) than for ELLstudents (.806), but the gap between the cor-
relations was smaller. This was likely due to language demand differences. Multi-
ple group structural models were used to test whether the differences between
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 251
TABLE 10
Site 2 Grade 9 Stanford 9 Reading, Math,
and Science Structural Modeling Results (df = 51)
Non-ELL (N = 22,782) ELL (N = 4,872)
Sample 1 Sample 2 Sample 1 Sample 2
Factor loadings
Reading comprehension
Parcel 1 .852 .853 .723 .719
Parcel 2 .841 .844 .734 .739
Parcel 3 .835 .832 .766 .779
Parcel 4 .858 .858 .763 .760
Math factor
Parcel 1 .818 .821 .704 .699
Parcel 2 .862 .860 .770 .789
Parcel 3 .843 .843 .713 .733
Parcel 4 .797 .796 .657 .674
Science factor
Parcel 1 .678 .681 .468 .477
Parcel 2 .679 .676 .534 .531
Parcel 3 .739 .733 .544 .532
Parcel 4 .734 .736 .617 .614
Factor correlation
Reading vs. Math .782 .779 .645 .674
Reading vs. Science .837 .839 .806 .802
Science vs. Math .870 .864 .796 .789
Goodness of fit
Chi-square 488 446 152 158
NFI .997 .998 .992 .992
NNFI .997 .997 .993 .993
CFI .998 .998 .995 .995
Note. There was significant invariance for all constraints tested with the multiple group model
(Non-ELL/ELL). ELL = English language learner; NFI = Normed Fit Index; NNFI = Nonnormed Fit
Index; CFI = Comparative Fit Index.
non-ELL and ELL students mentioned previously were significant. There were
significant differences for all constraints tested at the p < .05 level.
The results of simple structure confirmatory factor analyses also showed differ-
ences on factor loadings and factor correlations between the ELL and non-ELL
groups for the Site 3 data. The hypotheses of invariance of factor loadings and fac-
tor correlations between the ELL and non-ELL groups were tested. Specifically,
we tested the following null hypotheses:
Correlations between parcel scores and a reading latent variable are the same
for the ELL and non-ELL groups.
Correlations between parcel scores and a science latent variable are the same
for the ELL and non-ELL groups.
Correlations between parcel scores and a math latent variable are the same
for the ELL and non-ELL groups.
Correlations between content-based latent variables are the same for the ELL
and non-ELL groups.
Table 11 summarizes the results of structural models for reading and math tests
for Site 3 students in Grade 10. Table 11 includes fit indexes for the ELL and
non-ELL groups, correlations between parcel scores and content-based latent vari-
ables (factor loadings), and correlations between latent variables. Hypotheses re-
garding the invariance of factor loadings and factor correlations between ELL and
non-ELL groups were tested. Significant differences between the ELL and
non-ELL groups at or below .05 nominal levels were identified. These differences
are indicated by an asterisk next to each of the constraints. There were several sig-
nificant differences between the ELL and non-ELL groups on the correlations be-
tween parcel scores and latent variables. For example, on the math subscale, differ-
ences in factor loadings between the ELL and non-ELL groups on Parcels 2 and 3
were significant. Table 11 also shows a significant difference between the ELL and
non-ELL groups on the correlation between reading and math latent variables.
These results indicate that:
1. Findings from the two cross-validation samples are very similar and pro-
vide evidence on the consistency of the results.
2. Structural models show a better fit for non-ELL than for ELL students.
3. Correlations between parcel scores and the content-based latent variables
are generally lower for ELL students.
4. Correlations between the content-based latent variables are lower for ELL
students.
The results suggest that language factors may be a source of construct-irrele-
vant variance in the assessment of ELL students.
252 ABEDI
DISCUSSION
The purpose of this study was to examine the impact of students language back-
ground on the outcome of their assessments. Three major research questions
guided the analyses and reporting and will be the basis for discussion of the results
of this study:
1. Could the performance difference between ELL and non-ELL students be
partly explained by language factors in the assessment?
2. Could the linguistic complexity of test items as a possible source of mea-
surement error influence the reliability of the assessment?
3. Could the linguistic complexity of test items as a possible source of con-
struct-irrelevant variance influence the validity of the assessment?
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 253
TABLE 11
Site 3 Grade 10 Stanford 9 Reading and Math Structural Modeling Results
(Parcels Ordered by Item Number)
Goodness of Fit Model 1 (df = 75) Model 2 (df = 74)
Chi-square 2938 2019
NFI .916 .943
NNFI .902 .933
CFI .918 .945
Factor Loadings
Non-ELL
(N = 8,947)
ELL
(N = 303)
Non-ELL
(N = 8,947)
ELL
(N = 303)
Reading
Parcel 1 .677 .683 .679 .685
Parcel 2 .683 .612 .684 .613
Parcel 3 .738 .695 .739 .696
Parcel 4 .826 .816 .824 .812
Parcel 5 .693 .723 .690 .720
Math
Parcel 1 .735 .763 .752 .788
Parcel 2 .659 .702
*
.667 .716
*
Parcel 3 .623 .730
*
.592 .685
*
Parcel 4 .724 .774 .722 .774
Parcel 5 .389 .471 .330 .391
Factor correlation
Reading vs. Math .719 .624
*
.723 .622
*
Note. NFI = Normed Fit Index; NNFI = Nonnormed Fit Index; CFI = Comparative Fit Index: ELL
= English language learner.
*
Significant at or above .05.
In response to Question 1, results from the analyses of data from several loca-
tions nationwide indicated that students assessment results might be confounded
with language background variables. Descriptive statistics comparing ELL and
non-ELL student performance by subgroup and across different content areas re-
vealed major differences between the performance of the two groups. Included in
the descriptive statistics section was a DI (the disparity of performance of
non-ELL students over that of ELL students). This index showed major differ-
ences in performance between students with different language backgrounds. The
higher the level of English language complexity in the assessment tool, the greater
the DI (the performance gap between ELL and non-ELL students).
Accesstostudent-level anditem-level datafromthesitesprovidedanopportunity
to conduct analyses on student subgroups that were formed based on their back-
groundvariables, includinglanguagebackground. Theexceptionallylargenumbers
of studentsinsomesubgroupsenabledustoconduct cross-validationstudiestodem-
onstrate consistency of results over different sites and grade levels. The high degree
of consistency assured us of the validity and interpretability of the results.
Descriptive analyses revealed that ELL students generally perform lower than
non-ELL students on reading, science, and math subtests. The level of impact of
language proficiency on the assessment of ELL students is greater in content areas
with a higher level of language demanda strong indication of the impact of Eng-
lish language proficiency on assessment. For example, analyses show that ELL
and non-ELL students had the greatest performance differences in reading, and the
least performance differences in math, where language has less of an impact on the
assessment.
In response to Question 2, the results of our analyses indicated that test items for
ELL students, particularly ELL students at the lower end of the English profi-
ciency spectrum, suffer from lower internal consistency. That is, the language
background of students may add another dimension to the assessment in con-
tent-based areas. Thus, we speculate that language might act as a source of mea-
surement error in such areas. It is therefore imperative that test publishers examine
the impact of language factors on test reliability and publish reliability indexes
separately for the ELL subpopulation.
To shed light on the issues concerning the impact of language factors on validity
(Question 3), concurrent validity of standardized achievement tests (Stanford 9 and
ITBS) was examined using a latent-variable modeling approach. Standardized
achievement latent variables were correlated with the external-criterion latent vari-
ables. The results suggest that (a) there is a strong correlation between the standard-
ized achievement and external-criterion latent variables, (b) this relationship is
stronger when latent variables rather than measured variables are used, and (c) the
correlation between standardized achievement and external-criterion latent vari-
ables is significantlylarger for thenon-ELLpopulationthanfor theELLpopulation.
Wespeculatethat lowcorrelationbetweenthetwolatent variables for theELLgroup
254 ABEDI
stems from language factors. That is, language factors act as construct-irrelevant
sources (Messick, 1994).
Analyses of the structural relationships between individual items and between
items with the total test scores revealed a major difference between ELL and
non-ELL students. Structural models for ELL students demonstrated lower sta-
tistical fit. Further, the factor loadings were generally lower for ELL students,
and the correlations between the latent content-based variables were weaker for
ELL students.
The results of this study suggest that ELL test performance may be explained
partly by language factors. That is, linguistic complexity of test items unrelated to
the content being assessed may at least be partly responsible for the performance
gap between ELL and non-ELL students. Based on the findings of this study, we
recommend that (a) the issues concerning the impact of language factors on the as-
sessment of ELLstudents should be examined further; (b) psychometric character-
istics of assessment tools should be carefully reviewed for use with ELL students;
and (c) in assessing ELL students, student language background variables should
always be included, and efforts should be made to reduce confounding effects of
language background on the assessment outcome.
ACKNOWLEDGMENTS
This research was supported in part by the Office of Bilingual Education and Mi-
nority Languages Affairs under Contract R305B960002 as administered by the
U.S. Department of Education. The findings and opinions expressed in this report
do not reflect the position or policies of the Office of Bilingual Education and Mi-
nority Languages Affairs or the U.S. Department of Education.
I acknowledge the valuable contribution of colleagues in preparation of this ar-
ticle. Seth Leon and Jim Mirocha provided assistance with the data analyses.
Kathryn Morrison provided technical assistance in preparation of this article. Joan
Herman and Mary Courtney contributed to this article with their helpful comments
and suggestions. I am grateful to Eva Baker and Joan Herman for their support of
this work.
REFERENCES
Abedi, J. (1996). The interrater/test reliability system (ITRS). Multivariate Behavioral Research, 31,
409417.
Abedi, J., &Leon, S. (1999). Impact of studentslanguage background on content-based performance:
Analyses of extant data. Los Angeles: University of California, National Center for Research on
Evaluation, Standards, and Student Testing.
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 255
Abedi, J., Leon, S., & Mirocha, J. (2001). Examining ELL and non-ELL student performance differ-
ences and their relationship to background factors: Continued analyses of extant data. Los Angeles:
University of California, National Center for Research on Evaluation, Standards, and Student
Testing.
Abedi, J., & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Edu-
cation, 14, 219234.
Abedi, J., Lord, C., & Hofstetter, C. (1998). Impact of selected background variables on students
NAEP math performance. Los Angeles: University of California, National Center for Research on
Evaluation, Standards, and Student Testing.
Abedi, J., Lord, C., Hofstetter, C., & Baker, E. (2000). Impact of accommodation strategies on Eng-
lish language learners test performance. Educational Measurement: Issues and Practice, 19(3),
1626.
Abedi, J., Lord, C., & Plummer, J. R. (1997). Final report of language background as a variable in
NAEP mathematics performance (CSE Tech. Rep. No. 429). Los Angeles: University of Califor-
nia, National Center for Research on Evaluation, Standards, and Student Testing.
Allen, M. J., &Yen, W. M. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole.
American Educational Research Association, American Psychological Association, & National Coun-
cil on Measurement in Education. (1999). Standards for educational and psychological testing.
Washington, DC: American Educational Research Association.
Benjamini, Y., & Hochberg, Y. (1994). Controlling the false discovery rate: A practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289300.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Cattell, B. R., & Burdsal, A. C. (1975). The radial parcel double factoring design: A solution to the
item-vs.-parcel controversy. Multivariate Behavioral Research, 10, 165179.
Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of
Applied Psychology, 78, 98104.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral
measurements: Theory of generalizability of scores and profiles. New York: Wiley.
Duran, R. P. (1989). Assessment and instruction of at-risk Hispanic students. Exceptional Children, 56,
154158.
Garcia, G. E. (1991). Factors influencing the English reading test performance of Spanish-speaking
Hispanic children. Reading Research Quarterly, 26, 371391.
Hakuta, K., & Beatty, A. (Eds.). (2000). Testing English language learners in U.S. schools. Washing-
ton, DC: National Academy Press.
Linn, R. L. (1995). Assessment-based reform: Challenges to educational measurement. Princeton, NJ:
Educational Testing Service.
Linn, R. L., & Gronlund, N. E. (1995). Measurement and assessment in teaching (7th ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Mazzeo, J., Carlson J. E., Voelkl, K. E., & Lutkus, A. D. (2000). Increasing the participation of special
needs students in NAEP: Areport on 1996 NAEPresearch activities. Washington, DC: National Cen-
ter for Education Statistics.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance as-
sessments. Educational Researcher, 23(2), 1323.
Mestre, J. P. (1988). The role of language comprehension in mathematics and problemsolving. In R. R.
Cocking & J. P. Mestre (Eds.), Linguistic and cultural influences on learning mathematics (pp.
201220). Hillsdale, NJ: Lawrence Erlbaum, Associates, Inc.
National Clearinghouse for English Language Acquisition and Language Instruction Educational Pro-
grams. (2002). Survey of the states limited English proficient students and available educational
programs and services. Washington, DC: Author.
256 ABEDI
Navarrette, C., & Gustke, C. (1996). A guide to performance assessment for linguistically diverse stu-
dents. Albuquerque: New Mexico Highlands University.
No Child Left Behind Act of 2001, Pub. L. No. 107110, 115 Stat. 1425 (2002).
Salvia, J., & Ysseldyke, J. E. (1998). Assessment (7th ed.). Boston: Houghton Mifflin.
Shavelson, R., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Zehler, A. M., Hopstock, P. J., Fleischman, H. L., & Greniuk, C. (1994). An examination of assessment
of limited English proficient students (Special Issues Analysis Center Task Order D070 Report).
Arlington, VA: Development Associates.
STANDARDIZED ACHIEVEMENT TESTS FOR ELLS 257