You are on page 1of 14

Learn to Perform Confirmatory

Factor Analysis in Stata With Data


From the General Social Survey
(2016)

© 2019 SAGE Publications, Ltd. All Rights Reserved.


This PDF has been generated from SAGE Research Methods Datasets.
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

Learn to Perform Confirmatory


Factor Analysis in Stata With Data
From the General Social Survey
(2016)
Student Guide

Introduction
This example introduces confirmatory factor analysis (CFA). CFA allows
researchers to model how a concept or a group of concepts, such as depression,
are measured by a set of observed variables, such as responses to questions
about depression symptoms. The general model allows for multiple latent
variables to be estimated. However, for simplicity, the example will illustrate CFA
for a single latent variable.

CFA is used to assess the overall measurement of a concept when there are
multiple items available to measure it. Measuring a concept with multiple items is
generally better than using only one. Multiple items better cover the breadth and
depth of a concept. In addition, researchers often simply sum or average multiple
item values to create a score to measure a concept. In contrast, the strength of
CFA is that it models and accounts for measurement errors in indicators, leaving
the latent variable, representing the concept, free of measurement error.

This example describes the specification and assumptions of CFA models, briefly
describes CFA model identification, estimation of CFA models, assessment of how
well the specified model fits the data, and the interpretation of CFA results. We

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 2 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

illustrate this using a subset of data from the 2016 version of the General Social
Survey (2016) (http://www.gss.norc.org/). Specifically, we perform a CFA for the
concept of depression among adults in the US.

What Is Confirmatory Factor Analysis?


CFA is used to specify and test a measurement model for one or more concepts.
CFA is a method used to model the extent to which an unobserved (latent)
variable is measured by multiple items. The item values are assumed to be
caused by two sources: the latent variable and measurement error. There are
several steps involved in a CFA. They are specification, identification, estimation,
model fit and hypothesis testing, and interpretation of results. Each step is
described below. Of course, entire books have been written on CFA, so it is
important to consult them for more details on the topics covered here. See
suggested Further Readings below.

Specification of a CFA Model


A general equation for a CFA model is:

x = Λxξ + δ

where x represents a set of observed variables (often called items), ξ represents


a set of latent variables (the technical term for concepts), Λx represents the
factor loadings or coefficients connecting the latent and observed variables, and
δ represents measurement errors associated with each observed variable. A
diagram of this equation for a single latent variable and multiple observed
variables looks like the figure below. This shows a model of how each item is the
combination of two quantities: the extent of its relationship with the latent variable
and its extent of measurement error. By convention, observed variables, here the
x’s, are depicted in boxes and latent/unobserved variables are depicted in ovals
or circles.
Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 3 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

In this model, we are assuming that both the latent and observed variables are
continuous. As such, the model is essentially a set of simple regression models
with the x variables as dependent and the latent variable as independent and δ
as the errors for each equation. Therefore, the typical assumptions of regression
apply in the case of CFA with continuous items. Unlike regression models, the
errors in CFA models represent measurement error.

Measurement error is the result of how the data are collected. For example,
if an item is the set of answers from respondents to a survey question, then
measurement errors can result from poor question wording, question order on
the survey, respondents not understanding the question, or the conditions of
Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 4 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

respondents (if respondents were ill at the time of answering the survey), which
can affect their answers. Measurement error is inevitable, yet it is important to
separate the parts of items that reflect the latent variable and the parts that reflect
measurement error.

In addition, CFA, and structural equation models (SEM) generally, are not
recommended with sample sizes less than 200. Power analyses are
recommended for specific models because the necessary sample size is related
to the number of latent and observed variables in a model.

Identification of a CFA Model


To estimate the parameters of a CFA model, the model must be properly identified.
That is, the number of estimated (unknown) parameters (q) must be less than or
equal to the number of unique variances and covariances among the observed
variables, p(p + 1)/2, where p is the number of observed variables or x’s in the
model. Therefore, for a CFA model to be identified, this criterion must be met: q ≤
p(p + 1)/2. If there is too little information available on which to base the parameter
estimates, then the model is described as under identified, and model parameters
cannot be estimated.

For a CFA model to be identified, the latent variable must be given a measurement
scale. This is done in one of two ways, either by setting one of the factor loadings
to 1 or by setting the variance of the latent variable to 1. By setting a factor loading
to 1, the latent variable is scaled to the item with that factor loading. That is, if the
item is measured in years, then the latent variable is measured in years. Choosing
to set the variance of the latent variable to 1 makes it a standardized variable. The
findings are the same with the two methods. Only the factor loading values differ,
with the first method yielding unstandardized values and the other standardized
values. The standardized values put factor loadings on the same scale, which
allows researchers to compare the size of the factor loadings to assess which are
Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 5 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

larger or smaller.

Before performing CFA, a researcher must establish the identification status of


a CFA model. There are many rules of identification to help researchers with
determining a CFA model’s identification status. They are not discussed here.
However, for the single latent variable CFA, the model is identified if it includes at
least three observed variables. Again, identification is a complex topic, which must
be tackled for each CFA model.

Estimation of a CFA Model


Multiple methods of estimation have been developed for SEM models generally
and CFA models specifically. They are available for researchers to implement in
statistical software programs, such as Stata, Mplus, AMOS, and R. The choice
of estimation method is quite technical, and again, will not be discussed in detail
here. The references listed in the Further Readings section provide guidance on
choice of estimation technique.

Different estimation methods can produce different analytic results. The level
of measurement of the observed variables in CFA often guides the choice of
estimation method. Here, we are assuming that the observed variables are
continuous. Maximum likelihood (ML) estimation is typically used in this case. It
is a complex iterative technique for choosing the values for coefficients and other
parameters of the model. The process is performed by computer.

One problem researchers can have with the estimation of CFA models, as with all
SEM models, is that the estimation process does not converge, in other words,
it cannot come to a final solution. Nonconvergence can be caused by many
factors, such as a poorly specified model or a small sample size. These problems
may need to be addressed by contacting a researcher with substantial CFA/SEM
experience.

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 6 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

Fit of a CFA Model


The fit of a CFA model, like other SEM models, can be assessed at three levels:
the overall model level, the equation level, and the parameter (factor loading
level). Each of these levels is discussed.

Overall Model Level Fit. Many measures of overall model fit have been
developed. Each one indicates whether the modeled relationships among the
latent and observed variables replicate the relationships among the observed
variables in the data. There are hypothesis tests for some of these measures.
Other measures are only descriptive, and others are only useful for comparing
two or more models. No one measure is considered adequate by itself for model
evaluation. Typically reported measures and their cutoffs indicating a good model
are listed here.

• Model Chi-square: A hypothesis test statistic for the null hypothesis that the
model fits perfectly. It assesses the discrepancy between the sample and
fitted covariances. However, it is sensitive to sample size, such that in large
samples, it can be high even if the model is a good one.
Cutoff: A good model is one with a p-value greater than .05, indicating that
the null hypothesis should not be rejected.

• RMSEA: The root mean square error of approximation is a parsimony-


adjusted fit index, meaning that it favors simplicity in models. The closer the
value is to 0, the better the model. It has an associated hypothesis test for
the null hypothesis that the RMSEA is less than .05.
Cutoff: A good model has a value of 0.08 or lower and a p-value greater
than .05, indicating the null hypothesis should not be rejected.

• CFI: The comparative fit index reflects the correlations among observed
variables in the model. Higher correlations among the variables produce

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 7 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

higher CFI values.


Cutoff: A good model has a value of 0.95 or higher.

• TLI: The Tucker–Lewis index reflects how well a model fits compared to a
null model where all of the factor loadings are 1 and the variance of the
latent variable is 0. It is sensitive to sample size and is favored in smaller
samples.
Cutoff: A good model has a value of 0.95 or higher. A value of 0.95 indicates
that the estimated model improves the fit by 95% relative to the null model.

• SRMR: The standardized root mean square residual is the square root
of the standardized difference between the sample covariances and the
covariances predicted by the model.
Cutoff: A good model has a value of 0.08 or lower.

• CD: The coefficient of determination (R2) for the model ranges from 0 to 1,
with values closer to 1 indicating a better fit.

Cutoff: There is no agreed upon cutoff for the model R2, like equation R2
values.

Equation Level Fit. The most frequently used equation level fit measures are R2
values. There is an equation for every observed variable or item in a CFA model;

therefore, an R2 value is reported for each item. R2 values range from 0 to 1.


Higher values indicate better equation level fit. Some software provide hypothesis

tests for R2 values and others do not.

Parameter Level Fit. Factor loadings or the coefficients linking the latent
variables and the items are the parameters most often assessed. Because they
are fundamentally simple regression coefficients in CFA, the same hypothesis
tests apply to factor loadings as to regression coefficients. The null hypothesis is
Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 8 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

that the factor loading is equal to 0, the alternative is usually that the factor loading
is not equal to 0, but one-sided alternative hypothesis tests can be performed as
well. Factor loadings estimated with ML generally use z-tests. The choice of level
of significance for the test is made by the researcher because the actual p-value is
reported. The assessment of the statistical significance of each factor loading with
these tests lets the researcher know if the latent variable is related to a particular
observed variable or item. Non-significant items can be trimmed from CFA models
and they can be re-estimated. Such model fitting is typical for CFA, as for all SEM
models.

Interpretation of CFA Model Results


The interpretation of CFA results comes from combining all that is learned from
the assessment of model fit: the actual magnitudes of statistically significant factor

loadings, R2 values, and measures of fit, not solely whether they are statistically
significant or not. The example below shows how results are interpreted using a
dataset.

Illustrative Example: Measuring Depression


This example presents the use of CFA to model and assess how well the latent
variable depression is measured by five observed variables. This is relevant
because level of depression is a complex latent concept that is difficult to
measure, like many concepts in social science. Researchers should come to CFA
with a concept they are trying to measure, for which they have a conceptual
definition and substantial knowledge from their own past work or prior research by
others. Then, if collecting their own data via survey, they will design questions to
measure the defined concept and use that data to assess whether the questions
are good measures of the latent concept. If they are using secondary data, they
look for questions asked on existing surveys that can be argued to reasonably and

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 9 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

logically measure the concept as defined.

Here we are in the later situation. I looked for existing surveys that measured
depression as defined by medical professionals and social scientists, that is,
experts on this concept. I found a set of questions developed by the Center for
Epidemiological Studies (CES) at the National Institute of Mental Health, which
has been validated and is considered reliable. Then, I proceeded to assess
how well the five questions measure depression for the sample chosen for this
example, using CFA.

It is important to note that CFA cannot empirically tell us whether we are actually
measuring depression. It can only tell us whether the measures we chose, in this
case, the five CES-D questions, are empirically related either strongly or weakly
to the latent variable we have called depression.

The Data
This example uses the General Social Survey (2016) dataset
(http://www.gss.norc.org/). The variables in the dataset comprise responses to
a series of five questions asked of a sample of adults living in the US. The
questions are taken from the frequently used 19-question CES-D depression
scale developed by experts at the CES at the National Institute for Mental Health.
The five CES-D questions were the following: Please tell me how much of the time
during the past week … (1) you felt depressed, (2) your sleep was restless, (3) you
were happy, (4) you felt lonely, and (5) you felt sad. The possible responses were
(1) none or almost none of the time, (2) some of the time, (3) most of the time,
and (4) all or almost all of the time. The responses to the third question (happy)
were reverse coded so that high values on all of the variables indicated higher
depressive symptoms.

Only a subset of the full sample (979 of 2,867 people) for the General Social

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 10 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

Survey (2016) were asked the depression questions. Those not asked (1,888) and
all missing responses (18) were eliminated, leaving 961 people in the subset for
this analysis.

Analyzing the Data


As a type of SEM, CFA is performed using SEM modeling techniques. This is
an example of a single latent variable (factor) model with five observed variables
(items). All items are treated as continuous variables here because this is an
introduction to CFA; however, in practice, it is best to treat binary, ordinal, or
nominal level variables as categorical using more complex modeling. The
references provided below include explanations of such modeling.

Presenting Results
The results for the CFA are typically reported in two ways. One way is by including
the results in a figure showing the drawn measurement model with factor loadings
as labels for the arrows from the latent variables to the observed variables, the

R2 values under the measurement error circles, and the model fit indices listed in
a note below the figure. The other typical way of reporting the results of CFA is
in tabular form with a column for the names of the observed variables, columns
containing the factor loadings (unstandardized or unstandardized or two columns
for both), standard errors, and confidence intervals for each item, and a final

column for the R2 values corresponding to each item. Both ways are shown here.

Figure 1: Confirmatory Factor Analysis of Depression (N = 961).

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 11 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

*p < .05.

Table 1: Confirmatory Factor Analysis Results for Depression (N = 961).


Item Standardized factor loadings Standard errors 95% confidence intervals R2 values

CES-D 1 0.80* 0.02 [0.77, 0.84] 0.65

CES-D 2 0.42* 0.03 [0.36, 0.48] 0.18

CES-D 3 (reversed) 0.61* 0.02 [0.56, 0.67] 0.37

CES-D 4 0.62* 0.02 [0.58, 0.67] 0.39

CES-D 5 0.78* 0.02 [0.74, 0.82] 0.61

Model Fit Indices: χ2(5) = 4.52, p = .47; RMSEA = 0.00, p = .98; CFI = 1.00; TLI = 1.00; SRMR = 0.01; CD = 0.83.

*p < .05.

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 12 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

The results, presented above in Figure 1 and Table 1, strongly support the
measurement of depression by the five items. Overall model fit is excellent with

all measures better than the cutoffs. One R2 value is low at 0.18, for the restless
sleep item (cesd2), suggesting that this item is not as strongly related to latent
depression as the other four items, which have moderate to large amounts of
variation explained (0.37–0.65). Finally, all factor loadings are statistically
significant at the .05 level (0.42–0.80). The restless sleep item has the lowest
factor loading (0.42), confirming that it is the weakest measure of depression in the
model. The factor loadings are interpreted as standardized regression coefficients.
For example, as the standardized latent variable, Depression, increases by 1, the
standardized responses to the CES-D 1 item increase by 0.80, or almost 1.

Review
CFA is used to assess how well the latent variable is measured by multiple
observed variables. You should know:

• What CFA is.


• Assumptions associated with CFA.
• Estimation of CFA models.
• CFA model fit assessment and testing of hypotheses.
• Interpretation and presentation of CFA results.

Your Turn
Download this sample dataset to see whether you can replicate these results.

Further Readings
Bollen, K. A. (1989). Structural equations with latent variables. New York, NY:
Wiley.

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 13 of 14
General Social Survey (2016)
SAGE SAGE Research Methods Datasets Part
2019 SAGE Publications, Ltd. All Rights Reserved. 2

Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York,
NY: Guilford.

Harrington, D. (2009). Confirmatory factor analysis. New York, NY: Oxford


University Press.

Hoyle, R. (1995). Structural equation modeling: Concepts, issues and


applications. Thousand Oaks, CA: SAGE.

Hoyle, R. (2012). Handbook of structural equation modeling. New York, NY:


Guilford.

Kaplan, D. (2000). Structural equation modeling: Foundations and extensions.


Thousand Oaks, CA: SAGE.

Kline, R. (2016). Principles and practice of structural equation modeling. New


York, NY: Guilford.

Loehlin, J. C. (1998). Latent variable models: An introduction to factor, path, and


structural analysis. Mahwah, NJ: Lawrence Erlbaum.

Maruyama, G. M. (1998). Basics of structural equation modeling. Thousand


Oaks, CA: SAGE.

Learn to Perform Confirmatory Factor Analysis in Stata With Data From the
Page 14 of 14
General Social Survey (2016)

You might also like