Professional Documents
Culture Documents
Contents
▪ Factor Analysis Assumptions
▪ Principal Component Analysis (PCA)
▪ Principal Axis Factoring (PAF)
▪ Rotation methods (i.e., Orthogonal and Oblique)
▪ Numerical, Ordinal, Nominal variables
▪ Categorical Principal Components Analysis (CATPCA)
▪ Reverse-coded variables (Appendix)
Factor Analysis
Factor analysis draws on the assumption that all variables correlate to some degree.
Hence, those variables that express the same or similar concepts could be highly
correlated. The role of factor analysis is to identify factors. In other words, to identify
groups of highly correlated variables.
Attention: Suppose we want to run regression analysis. After factor analysis, we will
not introduce the original variables in the regression model but the factors or latent
variables.
Please mind that factor analysis reduces (not always) multicollinearity (advantage) and
𝑅 2 (disadvantage). Factor analysis is not recommended when the regression model’s
𝑅 2 < 0.5.
To be confident that multicollinearity has been addressed after extracting factors from
the original variables, you can rerun the regression model using the factors as
independent variables instead of the original variables.
Note: Latent variables (or factors) are variables that are inferred, not directly observed,
from other variables that are observed.
3. Rotate the extracted factors to reach a terminal (optimal) solution. There are two
factor-rotation methods:
(a) the Orthogonal (e.g., Varimax, Quartimax, Equamax).
Note: the Orthogonal rotation methods assume that factors are not correlated.
(b) the Oblique (e.g., Direct Oblimin, Promax).
Note: the Oblique rotation methods assume that factors are correlated.
Attention: In social science research, the Oblique rotation is preferred as factors
are correlated.
Answer: We rotate the axes to identify factors that fit the actual variables better (the
figure above shows a two-factor analysis)
Further Insight
▪ Principal Component Analysis (PCA) is appropriate if the purpose is no more than
to reduce the number of variables in the data set. In other words, we use PCA to
obtain the minimum number of factors (groups of variables) needed to represent the
original data set.
Attention: PCA assumes no error in the data or the measurement. This is not
consistent with social sciences, as error exists.
Attention: the factors extracted using the PCA do not have any theoretical support
or validity but are justified statistically only.
Note: The PCA is the most widely used factor extraction approach as it is less
restrictive than other methods. However, there are many cases where PAF is
preferred. In social sciences, many scholars prefer to use PAF than PCA.
Note: If you select PCA, it is better to use the Orthogonal rotation method – most
commonly the Varimax.
▪ Principal Axis Factoring (PAF) is appropriate if the objective is to identify
theoretically meaningful factors.
Attention: PAF allows errors in the data or measurements.
Note: If you select PAF, it is better to use the Oblique rotation – most commonly the
Promax. Kindly mind that the Promax is a relaxed Orthogonal rotation method.
Assumptions
Assumption #1: The variables should be continuous. However, it is common to use
factor analysis with ordinal variables as well. We can use factor
analysis even with nominal variables (Note: only when a few
nominal variables are present in the data set we can use factor
analysis. If a considerable number of nominal and/or ordinal variables
is present, we can use Categorical Principal Components Analysis
instead).
Assumption #2: Sample size: 𝑛 ≥ 𝑚𝑎𝑥{150 𝑝𝑎𝑟𝑡𝑖𝑐𝑖𝑝𝑎𝑛𝑡𝑠, 10 × #𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠}
For instance, if the number (#) of variables is 10, then 10 × 10 =
100. In that case, the adequate sample size for having a reliable factor
analysis is at least 150 (i.e., the maximum between 150 participants
and 100).
Assumption #3: Normality. Each variable should be roughly normally distributed.
Note: Violations of this assumption do not distort the quality of the
factor analysis results. Factor analysis is fairly robust against
violations of the normality assumption.
Assumption #4: Linearity. Roughly linear relationships should be present between the
variables.
Note: This assumption is associated with correlation analysis that is
a foundation for factor analysis.
Assumption #5: No extreme outliers. The presence of extreme observations can distort
the factor analysis results.
Note: Assumptions #1 and #2 are met by the study design, while assumptions #3, #4,
and #5 are tested using SPSS.
Figure 5a. Data set #24 – Factor analysis Assumptions – Boxplots / Outliers
Note: There are no extreme outliers (e.g., denoted by asterisks in SPSS).
Figure 12a. Data set #24 – Factor analysis / PCA – Correlations coefficients cut-off
Note: We change the Absolute value below from .10 (default value) to .30. This
selection simplifies the output and makes it easier to interpret. If you are concerned
about the missing information due to the Absolute value change, you can rerun the
factor analysis with the default value and compare the results.
Note: As several of the correlations in the Correlation Matrix above (Figure 13a) are
higher than 0.3 (absolute values), the data are suitable for factor analysis.
Note: If we have one or more variables that do not have any correlations above 0.3,
then it’s better to remove this (or these) variable(s) from the factor analysis, as it is not
suitable.
How to remove a variable from factor analysis?: Please go back to Figure 8a and move
the variable that reports correlations lower than 0.3 with all the remaining sample
variables back to the left-hand side box.
analysis and can be removed from the sample variables, which will be used for factor
analysis.
Barlett’s test of sphericity (in other words, homogeneity of variances) also indicates
how suitable the data are for factor analysis. If Barlett’s test is significant (i.e., 𝑝 <
0.05), then the data are suitable for factor analysis. If Barlett’s test is not significant
(i.e., 𝑝 ≥ 0.05), then the data are not suitable for factor analysis.
𝐻𝑜 : all variables’ correlations are equal to zero (𝑝 ≥ 0.05)
𝐻1 : variables’ correlations are not equal to zero (𝑝 < 0.05)
Interpretation: Both the KMO measure and Barlett’s test of sphericity suggest that the
data are suitable for factor analysis.
Note: KMO measures for individual variables are found in the Anti-Image Correlation
matrix above (see Figure 16a). In particular, the diagonal scores (highlighted values)
should ideally be ≥ 𝟎. 𝟔. However, given that the overall KMO measure (see Figure
14a) suggests that the sample variables are suitable for factor analysis, we could include
in the factor analysis even variables with (diagonal) scores between 0.5 and 0.6, even
if they are regarded as weak.
Attention: Please mind that sometimes, variables assigned low KMO scores (i.e., 𝑝 ≤
0.5) are reverse coded. Hence, before removing a variable with a low KMO (diagonal)
score from the sample, please check whether it is reverse coded. Details about reverse
coding are found in the appendix of this chapter.
We also find that after rotation (Orthogonal – Varimax; see the last set of columns on
the right side of Figure 19a), there is no improvement in the explained variance, which
remains 92%.
Inflection Point
Interpretation: Based on the Eigenvalue table above (see Figure 19a) and the Scree
Plot (please bear in mind that we retain the number of factors before the last inflection
point of the graph – in Figure 20a, the last inflection point is the third; thus, we could
retain two factors), we retain two factors.
Components or Factors
▪ Also, taking into account the questions in factor 2 (column #2), the name for the
second factor could be: “Smoke in restaurants” or another related title.
Attention: If you decide to run a regression analysis using these survey data, you
should introduce the two factors (latent variables) in the regression model instead of
the five variables.
Figure 24a. Data set #24 – Factor analysis / PCA – Setting the number of factors
(user-defined process)
Attention: Suppose that you want to set the number of factors (e.g., you prefer to have
a smaller number of factors in your analysis), then you can follow the same pathway
for running factor analysis in SPSS and select the Extraction option. As shown in Figure
24a, you should select “Fixed number of factors” instead of “Based on Eigenvalues”
and set the number of factors to extract in the highlighted box. Then, we should rerun
factor analysis.
PCA indicated two factors to be retained that had Eigenvalues greater than one, which
explained 92.02% of the total variance.
A Varimax orthogonal rotation was employed and yielded the following results
Factor 1 Factor 2
I think people should have the right to smoke .985
I think smoking is acceptable .983
I don't care if people smoke around me .939
I don't think people should smoke around food .939
I don't think people should smoke in restaurants .933
Suppose that the factors are correlated (looking back at the results of the previous
analysis, the relationship between the two extracted factors is a valid scenario). Also,
suppose that there is a possibility of errors in the data or measurements.
In other words, let us apply Principal Axis Factoring (PAF) together with the Oblique
(Promax) rotation method to Data set #24.
As shown below, the results are similar to those obtained from the PCA and Orthogonal
(Varimax) rotation method. Specifically
Figure 1b. Data set #24 – Factor analysis Principal Axis Factoring (Promax)
Figure 2b. Data set #24 – Factor analysis Principal Axis Factoring (Promax)
Figure 3b. Data set #24 – Factor analysis Principal Axis Factoring (Promax)
Figure 3b. Data set #24 – Factor analysis Principal Axis Factoring (Promax)
Figure 4b. Data set #24 – Factor analysis Principal Axis Factoring (Promax)
Figure 5b. Data set #24 – Factor analysis Principal Axis Factoring (Promax)
Figure 6b. Data set #24 – PCA (left-hand side) vs PAF (right-hand side)
Figure 7b. Data set #24 – PCA (left-hand side) vs PAF (right-hand side)
Figure 8b. Data set #24 – PCA (left-hand side) vs PAF (right-hand side)
Figure 9b. Data set #24 – PCA (left-hand side) vs PAF (right-hand side)
Figure 10b. Data set #24 – PCA (left-hand side) vs PAF (right-hand side)
Figure 11b. Data set #24 – PCA (left-hand side) vs PAF (right-hand side)
Note: When all variables in the data set are numeric, and they are linearly related, PCA
and CATPCA will render exactly the same results.
Assumptions
Assumption #1: The analysis is based on positive integer data (i.e., ordinal or nominal,
including dichotomous, or both ordinal and nominal).
Background information: Data set #25 consists of 25 ordinal variables (i.e., questions)
measured on a 7-point scale (i.e., 1 – Strongly agree; 2 – Agree; 3 – Agree somewhat;
4 – Undecided; 5 – Disagree somewhat; 6 – Disagree; 7 – Strongly disagree).
Figure 7c. Data set #25 – Factor analysis – CATPCA (Orthogonal rotation: Varimax)
Figure 14c. Data set #25 – Factor analysis – CATPCA (Eigenvalues & % of Variance
explained)
Note: After a few trials, playing with the number of the dimensions, we decide that the
optimal number is 8, which explains 78.3% of the total variance. For instance, we set
dimension number 9 as well. Then, the % of variance increased, but the Eigenvalue
assigned to the 9th dimension was lower than one, which is not acceptable.
Dimension
1 2 3 4 5 6 7 8
Q3 0.841 0.114 -0.026 0.094 0.128 0.062 0.161 0.061
Appendix
Reverse coding commonly applies to survey items (i.e., questions) with “negative”
meaning. For instance,
a. When we try to validate the consistency of the survey participants’ answers, it is
common to rephrase a “positive” question in a “negative” way. Example:
1. I would like to have more quantitative methods courses in the DBA program
St. Disagree Disagree Neutral Agree St. Agree
1 2 3 4 5
2. I don’t believe that extra quantitative methods courses in the DBA program
would add any value to DBA candidates’ knowledge.
St. Disagree Disagree Neutral Agree St. Agree
1 2 3 4 5
Let a DBA candidate answer “5” to question #1 and “1” to question #2. Both
answers convey the same meaning. Hence, they are consistent. However, the
scores are not consistent. Therefore, if we used correlation analysis, the
correlation coefficient would be very low. To tackle this problem, we should
reverse code question #2. Then, the scale would read:
St. Disagree Disagree Neutral Agree St. Agree
5 4 3 2 1
After reverse coding, both answers and scores are consistent for both questions.
b. A survey item expressing a “negative” meaning, while most of the other items in
the same survey have “positive” meanings. Let the following questions:
1. I take responsibility of my decisions.
St. Disagree Disagree Neutral Agree St. Agree
1 2 3 4 5
The answers to the three questions would be: “5”, “5”, and “5”.