You are on page 1of 19

Structural Equation Modeling: A Multidisciplinary Journal

ISSN: 1070-5511 (Print) 1532-8007 (Online) Journal homepage: http://www.tandfonline.com/loi/hsem20

Using Structural Equation Modeling to Test for


Differential Reliability and Validity: An Empirical
Demonstration

Ruth Raines-Eudy

To cite this article: Ruth Raines-Eudy (2000) Using Structural Equation Modeling to Test for
Differential Reliability and Validity: An Empirical Demonstration, Structural Equation Modeling: A
Multidisciplinary Journal, 7:1, 124-141, DOI: 10.1207/S15328007SEM0701_07

To link to this article: http://dx.doi.org/10.1207/S15328007SEM0701_07

Published online: 19 Nov 2009.

Submit your article to this journal

Article views: 1370

View related articles

Citing articles: 39 View citing articles

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=hsem20

Download by: [UNAM Ciudad Universitaria] Date: 28 March 2017, At: 17:01
STRUCTURAL EQUATION MODELING, 7(1), 124–141
Copyright © 2000, Lawrence Erlbaum Associates, Inc.

TEACHER’S CORNER

Using Structural Equation Modeling to


Test for Differential Reliability and
Validity: An Empirical Demonstration
Ruth Raines-Eudy
Tulane University School of Social Work

Structural equation modeling (SEM) techniques provide us with excellent tools for
conducting preliminary evaluation of differential validity and reliability of measure-
ment instruments among a comprehensive selection of population groups. This article
demonstrates empirically an SEM technique for group comparison of reliability and
validity. Data are from a study of 495 mothers’ attitudes toward pregnancy. Propor-
tions of African American and White, married and unmarried, and Medicaid and
non-Medicaid mothers provided sample sizes large enough for group comparisons.
Four hypotheses are tested: that factor structures are invariant between subgroups,
that factor loadings are invariant between subgroups, that measurement error is in-
variant between subgroups, and that means of the latent variable are invariant be-
tween subgroups. Discussion of item distributions, sample size issues, and appropri-
ate estimation techniques is included.

The interrelated dilemmas of differential reliability and validity are inherent in


many fields of social research due to the differing ethnic, socioeconomic, and cul-
tural groups that comprise our populations of interest. Whether the goal of our re-
search is theory testing or practical application, as social scientists we wish to de-
velop measurement instruments that are simultaneously valid, reliable, and
generalizable to populations as large and inclusive as possible. However, as noted
by Blalock (1982),

Requests for reprints should be sent to Ruth Raines-Eudy, Department of Health Services Adminis-
tration, University of Arkansas at Little Rock, 2801 S. University Avenue, Little Rock, AR 72204.
E-mail: rleudy@ualr.edu
TEACHER’S CORNER 125

Whenever measurement comparability is a doubt, so is the issue of the


generalizability of the corresponding theory. Although one may state a theory in a suf-
ficiently general form that it may be applied in diverse settings, tests of this theory will
require an assessment of measurement comparability. If the theory succeeds in one
setting but fails in another, and if measurement comparability is in doubt, one will be
in the unfortunate position of not knowing whether the theory needs to be modified,
whether the reason for the differences lies in the measurement-conceptualization pro-
cess, or both. (p. 30)

What Blalock’s comment implies, and what is often not explicitly addressed in
published reports of research results, is that testing for differential validity and re-
liability among a comprehensive selection of population groups should be part of
the preliminary evaluation of instruments used for social research. Structural equa-
tion modeling (SEM) techniques provide us with excellent tools for conducting
this preliminary evaluation.
This article presents an empirical demonstration of a technique for examining
the reliability and validity of measurement instruments or scales both within and
between pertinent subpopulations. For clarity, the technique will be illustrated
with a one-factor measurement model and three sets of subpopulation compari-
sons; however it is equally applicable for use with multifactor models or with
structural models.

DEVELOPMENT OF THE MEASUREMENT MODEL

Mueller’s (1997) guidelines and Bollen’s (1989) recommendations for specifica-


tion of measurement models and their application to the current example will be re-
viewed here because they are crucial preliminary steps for estimation of be-
tween-group differences in reliability and validity.
Mueller (1997) stressed the importance of theory and understanding of the sub-
stantive area for meaningful model construction. Bollen (1989) stated that the first
steps in developing a measurement model should include a theoretical definition
that guides selection of measures, identification of the latent variable or variables,
formation of measures, and specification of their relationship to the latent variable
(p. 180). The measurement model chosen for this demonstration of testing differ-
ential reliability and validity originated from a larger, theoretically driven study in-
vestigating women’s health beliefs related to pregnancy and prenatal care. The
theoretical approach to specification of measurement models was the Health Be-
lief Model (HBM) developed by Rosenstock (1974) and his associates (Becker &
Maiman, 1983) for use with preventive health measures.
The HBM is an applied value-expectancy theory with roots in social cognitive
theory and social learning theory (Rosenstock, 1974). It is the most commonly ap-
plied theoretical approach in studies of prenatal care use, an area in which most
studies are atheoretical. In its most parsimonious form, the HBM contains two
126 RAINES-EUDY

main clusters of latent constructs: health threat and expected outcome. These con-
structs, along with exogenous demographic factors and cues to action such as me-
dia campaigns, predict the likelihood of obtaining preventive health care. The
perceived susceptibility of individuals to health problems along with the perceived
seriousness of those problems if left untreated constitute the health threat (Becker
& Maiman, 1983). The individual then undertakes a cost–benefit analysis of the
ratio of benefits of care to barriers to care, which comprises the expected outcome
(Rosenstock, 1991).
In this study the HBM was applied to women’s decisions to seek timely and ade-
quate prenatal care. Latent variables measured included maternal and infant suscep-
tibility and seriousness, two types of benefits (general and specific benefits), and
three types of barriers (internal psychological, convenience, and health system bar-
riers). The latent variable chosen for this demonstration of differential reliability and
validity was one of the barrier constructs proposed by Bluestein and Rutledge in
their 1993 conceptualization of the HBM as it applies to prenatal care seeking. They
described the concept as internal psychological barriers to care (1993). This concept,
which incorporates the mother’s early feelings about her pregnancy, had not been
previously tested in studies using the HBM. However, fairly consistent support had
emerged in the literature for the importance of individual items related to the concept
in determining the level of care obtained (Bluestein & Rutledge, 1992; Fisher et al.,
1991; Poland, Ager, & Olson, 1987; Sable, Stockbauer, Schramm, & Land, 1990).
These items were therefore selected as measures or indicators representing the latent
variable, “Feelings about Pregnancy.” They were specified as effect indicators of the
construct, with the causal direction going from the latent variable to the indicators,
rather than the reverse (Bollen & Lennox, 1991). Figure 1 contains the originally hy-
pothesized measurement model.

RELIABILITY AND VALIDITY

Because this was a new instrument, there were no preexisting measures of reliabil-
ity. However, preliminary item–scale correlations and Cronbach’s alphas provided
evidence of acceptable levels of reliability. Overall reliability extracted and calcu-
lated using the appropriate formula1 was 0.91, indicating low overall error variance
in the model. The reliability extracted measure ranges from 0 to 1, with values over
0.50 considered acceptable.

1
2
 ρ 
 ∑ λ xi  var( ξ )
ρξ =  i =1  (Dillon & Goldstein, 1984, p. 480).
2
 ρ  ρ

 ∑ λ xi  var( ξ ) + ∑ var(δ t1 )
 i =1  i =1
TEACHER’S CORNER 127

FIGURE 1 Hypothesized one-factor measurement model for Feelings about Pregnancy.

The survey instrument was pilot tested twice, using samples of 28 and 25
women to establish face validity and appropriateness of question wording for the
population. Content validity was established by a thorough review of the literature
for items measuring attitudes toward pregnancy. Discriminant validity was estab-
lished by estimating two factor measurement models with Feelings about Preg-
nancy as one of the factors and each of two other barriers latent variables, one for
convenience barriers, and the other for health systems barriers. According to the
literature, these are all discrete constructs that are not expected to be positively
correlated, although they all present barriers to care. The models for
intercorrelation were not supported, χ2(19, N = 495) = 52.85, p = 0.02 for conve-
nience barriers, χ2(19, N = 495) = 34.62, p = 0.04 for health systems barriers. This
indicates that Feelings about Pregnancy is indeed not correlated with measures of
latent variables measuring different constructs within the same model (Bollen,
1989). Convergent validity was not possible to ascertain in the data set because
there were no other latent variables that were hypothesized to be highly correlated
with Feelings about Pregnancy.
128 RAINES-EUDY

Criterion validity has been difficult to establish in this area of research, as there
are no generally accepted criteria for measures of maternal health beliefs (Bates,
Fitzgerald, & Wolinsky, 1994). However, two measures were available in the data
set that provided support for the criterion validity of the construct. These were the
mothers’ response to the item “I started prenatal care late with this pregnancy” and
the actual number of visits recorded by the physician on the official American Col-
lege of Obstetricians and Gynecologists form that was available for many of the
mothers’ charts. The results of a structural model testing the relation between the
construct Feelings about Pregnancy and an endogenous latent variable
(“Poorcare”) comprising these two measures provided strong evidence for crite-
rion validity, χ2(7, N = 460) = 8.78, p = 0.36 (adjusted goodness of fit index
[AGFI] = 0.98; comparative goodness of fit index [CFI] = 1.00; root mean square
error of approximation [RMSEA] = 0.02; RMSR = 0.04). The squared multiple
correlation for the structural model was 0.32; the standardized gamma coefficient
for the path from Feelings to Poorcare was 0.54 (t = 4.81).
Finally, Dillon and Goldstein’s (1984) formula2 for calculating the shared vari-
ance of the indicators in a construct provided support for construct validity. The
shared variance is called the variance extracted. It varies from 0 to 1, and it repre-
sents the ratio of the total variance that is due to the latent variable. According to
Dillon and Goldstein and Bagozzi (1991), a variance extracted of greater than 0.50
indicates that the validity of both the construct and the individual variables is high.
The shared variance for the Feelings about Pregnancy latent variable was 0.72.

DIFFERENTIAL RELIABILITY AND VALIDITY

The reliability of a measure is that part containing no purely random error


(Carmines & Zeller, 1979). In SEM terms, the reliability of an indicator is defined
as the variance in that indicator that is not accounted for by measurement error. It is
commonly represented by the squared multiple correlation coefficient, which
ranges from 0 to 1 (Bollen, 1989; Jöreskog & Sörbom, 1993a). However, because
these coefficients are standardized, they are not useful for comparing reliability
across subpopulations.
Differential validity has been defined as differing test scores for differing sub-
groups of test takers (Cole & Moss, 1989). This can be detected in SEM models by
comparison of factor loadings or unstandardized λ coefficients for the same mea-

2
2
 ρ 
 ∑ λ xi  var( ξ )
ρvr ( ξ ) =  i =1  (Dillon & Goldstein, 1984, p. 480).
2
 ρ  ρ

 ∑ λ xi  var( ξ ) − ∑ var(δ t 1 )
 i =1  i =1
TEACHER’S CORNER 129

surement model estimated for different subpopulations, by visually inspecting the


coefficients for the subpopulations (Bollen, 1989).
The reliability and validity extracted formulas previously presented can be exam-
ined for rough estimates of the amount of error variance and degree of validity pres-
ent in each subgroup. However, in order to draw meaningful comparisons in which
statistically significant differences in subgroup factor structure, reliability, and va-
lidity are detected, more sophisticated multigroup methods are required.
The example described here demonstrates empirically the SEM technique de-
scribed by Jöreskog and Sörbom (1989) that can be used to detect whether mea-
sures contain significant amounts of either differential reliability or validity,
depending on the populations in which they are used. The method systematically
assesses several hypotheses comparing the factor structure, reliability, validity,
and mean differences in latent variables of a measure as it is applied in different
subpopulations. First, the factor pattern of the measurement model is hypothesized
to be identical or invariant for each group (H ξ = 1). Second, the factor loadings or
λ coefficients for the measurement model are hypothesized to be invariant across
groups (H Λ), indicating that there is not differential validity. Third, given that the
λs are invariant, both these factor loadings and the error terms are hypothesized to
be invariant across groups (H ΛΘ). If supported, this hypothesis provides evidence
that reliabilities do not differ for differing subgroups. For the model presented
here, one further hypothesis is tested. The means of latent variables for differing
subpopulations are hypothesized to differ significantly. This would indicate that
members of one subpopulation on average have a tendency to feel significantly
more positively or negatively about their pregnancies.

DATA AND METHODS

Sample size is a critical concern when working with SEM measurement models,
particularly when variables may not be normally distributed or when distribu-
tions of variables are not known in advance. For the current study, participants
were 495 women who gave birth in the labor and delivery unit of a hospital af-
filiated with a medical school in a large Midwestern metropolitan area. Prelimi-
nary data indicated that its population was more representative of the
metropolitan area population than those of other local hospitals. Twenty-four
percent of mothers who delivered during the time of the study did not partici-
pate. However, there were no statistically significant differences in demograph-
ics between women who did or did not participate. The demographics of the
study sample and the large sample size made it ideal for multigroup comparison
using SEM. Subgroups available for comparison included African American
mothers (58%) to White mothers (42%); married mothers (47%) to unmarried
mothers (53%); and mothers with Medicaid (43%) to mothers with private insur-
ance (57%). The full sample was interviewed using a structured survey instru-
130 RAINES-EUDY

ment with items measuring perceptions of pregnancy and prenatal care. In order
to rule out bias due to varying levels of literacy, the questionnaire was read to all
participants, who then circled their responses to each item.

ESTIMATION

Mueller (1997) cautioned that, if the assumptions of multivariate normality are not
met when using distribution-dependent methods such as maximum likelihood
(ML), then biased estimates may result. PRELIS 2 provides estimates of the proba-
bility that variables are bivariate normal and an overall multivariate normality mea-
sure that should be taken into account when determining the proper estimator for a
model (Jöreskog & Sörbom, 1993b). In cases where these assumptions are not met,
the appropriate estimation technique is weighted least squares (WLS) rather than
ML estimation (Jöreskog & Sörbom, 1989). Calculations are based on the
polychoric correlation matrix rather than the covariance matrix. Values of the vari-
ables must be weighted by the inverse of the asymptotic covariance matrix in order
to minimize the sum of squared deviations of the sample from the population
(Bollen, 1989). The assumed underlying bivariate normality of the weighted
polychoric correlation matrix must be confirmed before proceeding with model es-
timation (Muthen, 1993).
Prior to conducting the tests for differential reliability and validity, the measure-
ment model was estimated for the entire sample of 495 women using WLS estima-
tion. An 11-point Likert scale was chosen in anticipation of a continuous response
pattern. However, visual inspection of the variables revealed a bimodal distribution,
with modes of 0 and 10 for all four items. Further analysis with PRELIS 2 revealed
that the items did not meet assumptions of bivariate normality required for ML esti-
mation. Accordingly, the items were dichotomized. This resulted in variables with
no more than 60% of the cases in one category, which is less likely to create biased es-
timates than are more unbalanced dichotomies (Pedhazur & Schmelkin, 1991).
Tests in PRELIS indicated that this transformation yielded bivariate and
multivariate normality for the polychoric correlations.
Measurement models were then estimated within each of the groups to be com-
pared to ascertain that the models held within each group (Byrne, Shavelson, &
Muthen, 1989). The hypothesized measurement model was replicated in each of
three sets of randomly selected subsamples: African American and White compar-
ison, married and unmarried comparison, and Medicaid and private insurance
comparison. All samples met the sample size requirement of N ≥ 200 for WLS esti-
mation (Boomsma, 1987). All samples also met the assumptions of bivariate and
multivariate normality for the polychoric correlations.
Next, the hypotheses of equal factor pattern, equal factor loadings (λ coeffi-
cients), and equal error terms (Θδs) were tested using Jöreskog & Sörbom’s
(1989) guidelines. First, pattern structures were constrained to be invariant across
TEACHER’S CORNER 131

two groups, with λs and Θδs allowed to vary. Second, both factor patterns and λs
were constrained to be invariant with Θδs allowed to vary. Finally, all three were
constrained to be invariant across groups.
The group comparisons are designed to evaluate construct validity across sub-
groups. Because the group comparisons are not designed for hypothesis testing,
estimation of three separate tests for construct validity across subgroups should
not result in compounded probability of Type I error due to group member overlap.
The subsamples are randomly drawn from a larger population in which these de-
mographic groups do naturally overlap.
As Table 1 illustrates, the means of the items differed among subsamples in an ap-
parently nonrandom way. Unmarried women, African American women, and
women whose primary insurer was Medicaid were more likely to view their preg-
nancies in a more negative light than were married women, White women, and
women with private insurance. In order to determine whether these differences are
statistically significant, it is necessary to conduct a test for equal means of the latent
variable. However, estimation of mean differences poses a difficult problem when
WLS estimation must be used with ordinal or nonnormal variables. With WLS the
polychoric correlations are weighted by the inverse of the asymptotic covariance
matrix, and the means are standardized to 0 (Jöreskog & Sörbom, 1993a).
To test for the significance of the difference in means of latent variables re-
quires ML estimation. The kappa (κ) coefficient is the mean vector of ξ. When κ is
constrained to be 0 in the first group and allowed to vary in the second group, the
difference in group means is given by the value of κ in the second group (Jöreskog
& Sörbom, 1989). It is also possible to perform a test of the χ2 difference in a
model with κ held invariant in the second group compared to one in which it is al-
lowed to vary in the second group. If the model with unequal means (κ free) is a
better fit to the data than the model with equal means (κ invariant), then the hy-
pothesis of equal means can be rejected. Results of the ML estimation must be in-
terpreted with caution, however, when it is used with nonnormally distributed
variables, as is the case here.

TABLE 1
Means of Indicators for Subsamples

x1 x2 x3 x4

African American 4.85 4.70 3.80 6.90


White 2.75 3.05 2.78 4.34
Unmarried 5.39 5.48 5.48 7.78
Married 2.35 2.26 2.26 3.60
Medicaid 5.25 5.25 4.02 7.41
Private insurance 2.60 2.73 1.90 4.19
132 RAINES-EUDY

RESULTS

The estimated model for the full sample of 495 mothers yielded evidence of reli-
ability and construct validity. The chi square for the measurement model using the
full sample was 3.58 (p = 0.17), with an AGFI of 0.99, CFI of 1.00, RMSR of 0.02,
and RMSEA of 0.04. Standardized λ coefficients (validity coefficients) for the in-
dicators ranged from 0.81 to 0.93, with highly significant t values. Squared multiple
correlations (reliabilities) ranged from 0.76 to 0.86. Pearson correlations among
items ranged from 0.26 to 0.54. As previously noted, the reliability extracted was
0.91, and the variance extracted (validity) was 0.72.
All hypothesized measurement models for the individual subsamples were
supported, with acceptably low chi squares, all other fit statistics well above
0.90, and low levels of overall model error indicated in the RMSEA. Figures 2

FIGURE 2 Measurement model for African American subsample with t values in parentheses
(N = 272).
TEACHER’S CORNER 133

FIGURE 3 Measurement model for White subsample with t values in parentheses (N = 205).

through 7 contain the overall model fit statistics and individual model parame-
ters for each of these pairs of subsamples. The parameter estimates for the factor
loadings (λs) were all significant and fairly high. There were few differences in
λs between subgroups, with two notable exceptions. The λs for indicator X3 dif-
fered more than the λs for other indicators for all subgroup comparisons. Addi-
tionally, the error terms and squared multiple correlations appeared to differ in
several of the subgroup comparisons. These preliminary results indicated that it
was necessary to proceed to multigroup comparisons of factor structure, factor
loadings, and error terms.
Tables 2 through 4 contain the results of the tests of the first three hypotheses
for each subgroup comparison (Hξ = 1, HΛ, HΛΘ). The decision to support each hy-
pothesis, reported at the bottom of the table, was based on several overall measures
of fit (χ2, goodness of fit index [GFI], AGFI, normed fit index [NFI], CFI, and
RMSEA) as recommended by Tanaka (1993), rather than on a single measure. All
FIGURE 4 Measurement model for married subsample with t values in parentheses (N = 233).

TABLE 2
Model Fit Statistics for Comparison of African American and White Subsamples for
Hypotheses 1 Through 3.

Hypothesis 1 Equal Hypothesis 2 Hypothesis 3 Equal


Factor Structure Equal Factor Factor Loadings and
Model Fit Statistics (Hξ = 1) Loadings (HΛ) Error Terms (HΛΘ)

Chi square (df, p) 4.99 (4, 0.29) 9.79 (7, 0.20) 10.41 (11, 0.49)
Goodness-of-fit index 1.00 0.99 0.99
Adjusted goodness-of-fit index 0.99 0.98 0.99
Normed fit index 0.99 0.99 0.99
Comparative fit index 1.00 1.00 1.00
Root mean square error of 0.02 0.03 0.00
Approximation
Decision based on fit Supported Supported Supported

134
FIGURE 5 Measurement model for unmarried subsample with t values in parentheses
(N = 256).

TABLE 3
Model Fit Statistics for Comparison of Married and Unmarried Subsamples for Hypotheses
1 Through 3

Hypothesis 1 Equal Hypothesis 2 Hypothesis 3 Equal


Factor Structure Equal Factor Factor Loadings and
Model Fit Statistics (Hξ = 1) Loadings (HΛ) Error Terms (HΛΘ)

Chi square (df, p) 3.75 (4, 0.44) 7 26 (7, 0.40) 8.57 (11, 0.66)
Goodness-of-fit index 1.00 1.00 0.99
Adjusted goodness-of-fit index 0.99 0.98 0.98
Normed fit index 0.99 0.98 0.98
Comparative fit index 1.00 1.00 1.00
Root mean square error of 0.00 0.01 0.01
approximation
Decision based on fit Supported Supported Supported

135
FIGURE 6 Measurement model for Medicaid subsample with t values in parentheses
(N = 208).

TABLE 4
Model Fit Statistics for Comparison of Medicaid and Private Insurance Subsamples for
Hypotheses 1 Through 3

Hypothesis 1 Equal Hypothesis 2 Hypothesis 3 Equal


Factor Structure Equal Factor Factor Loadings and
Model Fit Statistics (Hξ = 1) Loadings (HΛ) Error Terms (HΛΘ)

Chi square (df, p) 3.00 (4, 0.50) 6 05 (7, 0.53) 7.43 (11, 0.66)
Goodness-of-fit index 1.00 1.00 1.00
Adjusted goodness-of-fit index 0.99 0.99 0.99
Normed fit index 0.99 0.99 0.99
Comparative fit index 1.00 1.00 1.00
Root mean square error of 0.00 0.00 0.00
approximation
Decision based on fit Supported Supported Supported

136
TEACHER’S CORNER 137

FIGURE 7 Measurement model for private insurance subsample with t values in parentheses
(N = 241).

hypotheses were supported for each subgroup comparison. This allows us to con-
clude with some degree of certainty that the factor structure, reliability, and valid-
ity for this measure of feelings about pregnancy did not differ significantly for
African American and White mothers, married and unmarried mothers, and for
mothers with private insurance compared to mothers with Medicaid.
The test for mean differences using ML estimations, however, indicated that the
mean values of ξ did differ significantly for all three subgroup comparisons. Table
5 contains the results of nested models. The chi-square differences in all compari-
sons (κ free compared with κ invariant) were significant, indicating that models
with unequal means were a better fit than models with equal means. Again it
should be noted that these estimations were performed with ML and should be in-
terpreted tentatively.
138 RAINES-EUDY

TABLE 5
Tests for Hypothesis 4, Equality of Means for Three Sets of Subsamples (Kappa Invariant
vs Kappa Free)

Subgroups ∆χ2 ∆df ∆κ

African American and White 12.43* 1 0.37


Unmarried and married 21.84* 1 0.29
Medicaid and private insurance 11.52* 1 0.28

*p < .001.

DISCUSSION

This comparison of measurement models for feelings about pregnancy between


three demographic subgroups of mothers demonstrates a method to determine
the suitability of measurement instruments for a wide variety of population sub-
groups. In the example reported here, no significant differences in validity or re-
liability of the hypothesized measurement model, Feelings about Pregnancy,
were found for any of the three subgroup comparisons. Practically, this means
that the measurement model is valid and reliable for use within these
subpopulations. The findings also make it possible to test the hypothesis that at-
titudes or feelings about pregnancy may play an intervening role in the relation-
ship between exogenous variables such as poverty, ethnicity, and marital status,
and mothers’ behavior in beginning and maintaining adequate prenatal care dur-
ing their pregnancies.
The findings regarding equal means are less conclusive. In order to estimate
this model, it was necessary to use the original, continuous metric with ML estima-
tion. Because the item responses were bimodal, the excessive kurtosis of the con-
tinuous variables may have influenced the chi square or other overall fit estimates
(Bollen, 1989). However, estimates of λs and δs in the model using ML were con-
sistent with those calculated using WLS. In any event, this illustrates the impor-
tance of taking into account the clinical or social significance of findings when
considering their statistical significance. The fact that African American mothers,
mothers who were not married, and mothers whose medical bills were paid by
Medicaid were all less likely to view their pregnancies favorably in the early stages
is an important substantive finding, regardless of its statistical significance. It has
implications for further theoretical work using the HBM to predict or explain pre-
natal care use.
Testing for differential reliability and validity is a practical preliminary test
prior to the use of survey instruments for the social sciences, given the grow-
ing diversity and multicultural nature of American society. Used in pilot tests
for new survey instruments, the method could help to ensure that study find-
TEACHER’S CORNER 139

ings are not biased due to flaws in the instruments used to measure attitudes,
opinions, beliefs, and other variables that may hold different meanings in dif-
ferent segments of our pluralistic society. Used retrospectively, this method
would be particularly useful with large data sets such as the General Social
Survey, General Accounting Office Surveys, or any other nationally represen-
tative surveys.
Many extant data sets contain variables with ordinal or dichotomous scaling
that do not meet the assumptions of ML. In these cases WLS will be required. This
has implications for the determination of sample size for reliability and validity
studies in these data sets. When planning a new study that includes pretests for dif-
ferential reliability and validity, sample size requirements become even more criti-
cal to address. The findings reported here came from a sample in which the
demographic categories were fairly evenly split, and the sample size was large
enough to facilitate comparisons. Thus, even though the variable distribution dic-
tated the use of WLS, sample sizes of at least 200 were available for the subgroups.
This may not always be the case. If the metropolitan population had been more
representative of the country as a whole, the sample size required to compare Afri-
can American and White mothers using WLS would have been approximately
1700. In this case, a carefully planned oversampling of the underrepresented de-
mographic groups might be one way to reduce the overall sample size.
Sample size is less problematic when variables are normally distributed and
ML estimation can be used. Bollen (1989) recommended “at least several cases per
free parameter” (p. 268), which would reduce the sample size requirement for a
one-factor model with four indicators considerably. When planning sample size, it
is important for the researcher to consider whether or not all item responses to a
given questionnaire will be normally distributed. If views are polarized on an is-
sue, it is likely that variable distributions will be bimodal, or at least have high lev-
els of kurtosis. In other cases, the majority of respondents may have a positive
response to some items, but a significant number of outlying opinions may skew
the distribution, making it unsuitable for ML estimation. A careful review of re-
search findings using the same or similar questions may shed some light on the ex-
pected distributions. However, unless strong evidence suggests that all responses
will have normal distributions, and that variables within a scale will have bivariate
and multivariate normality, the safe approach appears to be to assume that WLS
will be required and to plan sample size accordingly.

ACKNOWLEDGMENTS

Work for this article was supported by a grant from the Tulane University
Committee on Research.
I wish to thank David F. Gillespie and the anonymous reviewers for their
suggestions.
140 RAINES-EUDY

REFERENCES

Bagozzi, R. P. (1991). Further thoughts on the validity of measures of elation, gladness, and joy. Journal
of Personality and Social Psychology, 61, 98–104.
Bates, A. S., Fitzgerald, J. F., & Wolinsky, F. D. (1994). Reliability and validity of an instrument to mea-
sure maternal health beliefs. Medical Care, 32, 832–846.
Becker, M. H., & Maiman, L. A. (1983). Models of health-related behavior. In D. Mechanic, Handbook
of health, heath care, and the health profession (pp. 539–566). New York: Free Press.
Blalock, H. M. (1982). Conceptualization and measurement in the social sciences. Newbury Park, CA:
Sage.
Bluestein, D., & Rutledge, C. M. (1992). Determinants of delayed pregnancy among adolescents. The
Journal of Family Practice, 35, 406–410.
Bluestein, D., & Rutledge, C. M. (1993). Psychosocial determinants of late prenatal care: The health be-
lief model. Family Medicine, 25, 269–272.
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation perspec-
tive. Psychological Bulletin, 110, 305–314.
Boomsma, A. (1987). The robustness of maximum likelihood estimation in structural equation models.
In P. Cuttance & R. Ecob (Eds.), Structural equation modeling by example (pp. 160–188). Cam-
bridge, England: Cambridge University Press.
Byrne, B. M., Shavelson, R. J., & Muthen, B. O. (1989). Testing for equivalence of factor covariance
and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105,
456–466.
Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. In E. G. Carmines (Ed.),
Sage university papers series on quantitative applications in the social sciences (pp. 107–117).
Newbury Park, CA: Sage.
Cole, N., & Moss, P. (1989). Bias in test use. In Robert L. Linn (Ed.), Educational measurement (3rd
ed.), New York: Macmillan.
Dillon, W., & Goldstein, M. (1984). Multivariate analysis: Methods and applications. New York:
Wiley.
Fisher, M. -J., Ewigman, B., Campbell, J., Benfer, R., Furbee, L., & Zweig, S. (1991). Cognitive factors
influencing women to seek care during pregnancy. Family Medicine, 23, 443–446.
Jöreskog, K. G., & Sörbom, D. (1989). LISREL 7 user’s guide. Chicago: Scientific Software Interna-
tional.
Jöreskog, K. G., & Sörbom, D. (1993a). LISREL 8: Structural equation modeling using the SIMPLIS
command language. Chicago: Scientific Software International.
Jöreskog, K. G., & Sörbom, D. (1993b). New features in PRELIS2. Chicago: Scientific Software Inter-
national.
Mueller, R. O. (1997). Structural equation modeling: Back to basics. Structural Equation Modeling, 4,
352–369.
Muthen, B. O. (1993). Goodness of fit with categorical and other nonnormal variables. In K. A.
Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 205–234). Newbury Park,
CA: Sage.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, & analysis: An integrated approach.
Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.
Poland, M. L., Ager, J. W., & Olson, J. M. (1987). Barriers to receiving adequate prenatal care. Ameri-
can Journal of Obstetrics and Gynecology, 157, 297–303.
Rosenstock, I. M. (1974). Historical origins of the health belief model. Health Education Monographs,
2, 328–335.
TEACHER’S CORNER 141

Rosenstock, I. M. (1991). The health belief model: Explaining health behavior through expectancies. In
K. Glanz, F. M. Lewis, & B. K. Rimer (Eds.), Health behavior and health education: Theory, re-
search, and practice (pp. 39–61). San Francisco: Jossey-Bass.
Sable, M. R., Stockbauer, J. W., Schramm, W. F., & Land, G. H. (1990). Differentiating the barriers to
adequate prenatal care in Missouri, 1987–1988. Public Health Reports, 105, 549–555.
Tanaka, J. S. (1993). Multi facet conceptions of fit in structural equation models. In K. Bollen & J. S.
Long (Eds.), Testing structural equation models. Newbury Park, CA: Sage.

You might also like