You are on page 1of 31

Principal Components Analysis in SPSS

Factor Analysis

Automobile Mfr: what makes customers choose a particular model of a car


• Several aspects like mileage, easy loan facility, roof height, leg space,
maintenance, road clearance, steering functionality, brakes, lighting,
luggage space may be investigated
• By using factor analysis, these variables may be clubbed in different
factors like economy (mileage, easy loan facility), comfort (roof height, leg
space, maintenance, luggage space), and technology (steering
functionality, brakes, lighting, and road clearance)
• Instead of concentrating on so many parameters, the authorities can make a
strategy to optimize these three factors for the growth of their business
PCA

• Variable-reduction technique: does not make a distinction between


independent and dependent variables
• Shares many similarities to Exploratory Factor Analysis (EFA)
• Aim: reduce a larger set of variables into a smaller set of 'artificial'
variables (Principal Components)
• That account for most of the variance in the original variables
• Conceptually different from factor analysis
• Often used interchangeably with factor analysis in practice
• Included within the Factor procedure in SPSS Statistics
Coverage

• Characteristics of PCA, main assumptions


• Procedure required in SPSS to analyse data using PCA
• How to interpret two of the assumptions of PCA
• Linearity between all variables
• Sampling adequacy
• How to assess the PCA result (communalities, extracting and retaining
components, forced factor extraction)
Coverage

• How to choose how many principal components to extract


• How to determine a 'good result'
• How to create component scores or component-based scores
• Can be used for later testing
• Determining how many components to extract; how to make them load
appropriately on the variables
• Iterative process involving multiple criteria: ultimately, subjective reasoning
• There is no single right answer: confidence in the steps taken
• How to report the results
• How to assign a score to each component for each participant
• Using component scores and component-based scores
PCA: Background & Requirements

To run PCA, assumptions to be met:


1. Relates to the choice of study design
a. Multiple variables that are measured at the continuous level (ordinal data is very
frequently used)
2. Reflects the nature of the data
a. There should be a linear relationship between all variables
• PCA: based on Pearson correlation coefficients: expects a linear relationship between variables
• In practice, this assumption is somewhat relaxed with the use of ordinal data for variables
• Needs to be tested before PCA is run
• Can be tested using a matrix scatterplot: often considered overkill
• Scatterplot can sometimes have hundreds of linear relationships
• Random selection of a few possible relationships between variables; test those
• Non-linear relationships can be transformed
Assumptions

• There should be no outliers


• These can have a disproportionate influence on the results
• Component scores greater than 3 standard deviations away from the mean
• There should be large sample sizes for a PCA to produce a reliable result
• Many different rules-of-thumb have been proposed
• Either uses absolute sample size numbers or a multiple of the number of variables in the sample
• A minimum of 150 cases or 5 to 10 cases per variable: generally recommended
Steps to run PCA

• Initial extraction of the components


• Determining the number of 'meaningful' components to retain
• Rotation to a final solution
• Interpreting the rotated solution
• Computing component scores or component-based scores
• Reporting the results
PCA: Study Designs

1. Removing superfluous/unrelated variables


• Allows grouping of variables that all load on the same component
• One component only loads on one variable ⇒ this variable is not related to the other variables in the
data set
• Might not be measuring anything of importance to the particular study (i.e., it is measuring some other
construct or measure)
2. Reducing redundancy in a set of variables
• Measurement of multiple variables; belief that some of the variables are measuring the same
underlying construct ⇒ have variables that are highly correlated
• PCA allows the reduction of many correlated variables into a single artificial variable
• Principal Component can then be used in later analyses
3. Removing multicollinearity
• Two or more variables that are highly correlated ⇒ reduce the highlighted correlated variables into
principal components
• Can then be used to generate a component score which can be used in lieu of the original variables
Data

A company director wanted to hire employees for his company


• Looking for someone who would display high levels of motivation, dependability,
enthusiasm and commitment
• Selection of candidates for interview: questionnaire comprising 25 questions
• Phrasing of questions such that the required qualities represented therein
• All questions were Likert-style items with seven possible responses
• Questions associated with motivation: Qu3, Qu4, Qu5, Qu6, Qu7, Qu8, Qu12 and Qu13
• Dependability: Qu2, Qu14, Qu15, Qu16, Qu17, Qu18 and Qu19
• Enthusiasm: Qu20, Qu21, Qu22, Qu23, Qu24 and Qu25
• Commitment: Qu1, Qu9, Qu10 and Qu11
• Questionnaire administered to 315 potential candidates
• Requirement: determine a score for each candidate so that these scores could be used
to grade the potential recruits
PCA: Procedure

Requires running through a series of steps to arrive at the best possible solution
many of the steps are performed simultaneously in SPSS
• During the procedure, required to select all the options necessary to successfully
complete a PCA
• May need to re-run the analysis with different optional inputs
• Depending on the results of the initial run-through of the procedure

Analyze → Dimension Reduction → Factor


PCA: Procedure
PCA: Procedure (Assumptions)

Checking to make sure that the data to be analysed can actually be analysed
using this test: Assumptions
• Linearity between all variables: evaluated using a correlation matrix
• Correlation matrix of all the variables in the PCA
• Examine the correlations to check if there are any variables that are not strongly
correlated with any other variable (threshold: r ≥ 0.3)
• Scan the correlation matrix for any variable that does not have at least one correlation with another
variable with r ≥ 0.3
• In this data set, all variables have at least one correlation with another variable greater than the 0.3
cut-off: Qu20 does not have any correlations greater than 0.4
• In case there are variables where no correlations with other variables are greater than
0.3, consider removing that variable from analysis
• If the variable is not correlated with any other variables, it is likely measuring something different
• Transfer that variable out of the list of variables in the Factor Analysis dialogue box
PCA: Procedure (Assumptions)

Sampling adequacy: methods to detect


• Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy for the overall data set
• KMO measure for each individual variable
• Bartlett's test of sphericity

KMO measure:
• Used as an index of whether there are linear relationships between the variables
• Whether it is appropriate to run a PCA on the current data set
• Value can range from 0 to 1: values above 0.6 suggested as a requirement for
sampling adequacy
• Values above 0.8 considered good and indicative of PCA being useful
• Values close to zero ⇒ weak correlation between the variables
PCA: Procedure (Assumptions)

KMO measures for individual variables: on the diagonals of the Anti-image


Matrices table under the "Anti-image Correlation" section

KMO measures as close to 1 as possible


• Values above 0.5 an absolute minimum; greater than 0.8 considered good
• Variables having low KMO measure (KMO < 0.5), consider removing it from the analysis
PCA: Procedure (Assumptions)

Bartlett's test of sphericity:


• Tests the null hypothesis that the correlation matrix is an identity matrix
• One that has 1's on the diagonal and 0's on all the off-diagonal elements
⇒ There are no correlations between any of the variables
• No correlations between variables
⇒ Not be able to reduce the variables to a smaller number of components
⇒ No point in running a principal components analysis
PCA: Procedure (Interpreting Results)

Involves re-runs: focus on


• Communality: proportion of each variable's variance that is accounted for
by the PCA: can also be expressed as a percentage
• Extracting and retaining components
• PCA: produce as many components as there are variables
• Purpose of PCA: explain as much of the variance in the variables using as few components
• Post components' extraction: four major criteria to decide on the number of components to retain
• Eigenvalue-one criterion • Scree plot test
• Proportion of total variance accounted for • Interpretability criterion

• Requires some degree of subjective analysis


• Forced factor extraction:
• Approach to instruct SPSS on the number of components to retain (rather than the
eigenvalue-one criterion)
PCA: Interpreting Results

PCA: Produce as many components as variables


• Will explain all the variance in each variable if all components are
left in the solution: value of 1 (⇒ 100% of the variance is explained)
• SPSS: communality value when all components are retained
• Located under the "Initial" column
• Communalities of the variables when only the components that are selected to
be retained: reported in the "Extraction" column
PCA: Interpreting Results

Extracting components: to explain as much of the variance using as few components


• First component will explain the greatest amount of total variance
• Each subsequent component accounting for relatively less of the total variance
• Only the first few components will need to be retained for interpretation
• Will account for the majority of the total variance
• Amount of variance each component accounts for, its contribution towards total variance
• Presented in the Total Variance Explained table; under the "Initial Eigenvalues" columns
PCA: Interpreting Results

Eigenvalue: measure of the variance that is accounted for by a component


• Eigenvalue of one represents the variance of one variable
• With 25 variables there is a total of 25 eigenvalues of variance
For the 1st component:
• It explains 6.730 eigenvalues of variance ("Total" column)
⇒ It explains 6.730/25 x 100 = 26.9% of the total variance
• Reported in the "% of Variance" column
• All 25 components explain all of the total variance
PCA: Interpreting Results

Four major criteria for choosing components to retain:


• Eigenvalue-one criterion (a.k.a. Kaiser criterion): default in SPSS
• Eigenvalue less than one ⇒ component explains less variance that a variable would
• Hence shouldn't be retained
• Problem: when the EV is close to 1
• 5 components
• Percentage of variance explained by each component
• Individual: explained by each component
• Cumulative: explained by a set number of components
• Component should only be retained if it explains at least 5% to 10% of the total variance
• Can also retain all components that can explain at least 60% or 70% of the total variance
• 4 components (with 60%)
PCA: Interpreting Results

Scree plot:
• Plot of the total variance explained by each
component (its "eigenvalue") against its respective
component
• Components to retain: those before the (last)
inflection point of the graph
• Inflection point: meant to represent the point where the graph
begins to level out
• Subsequent components add little to the total variance
• 4 components
PCA: Interpreting Results

Interpretability criterion: need to inspect the rotated


component matrix
• Shows how the retained, rotated components load
on each variable: suppress all coefficients < 0.3
• To achieve: "Simple structure"
• When each variable has only one component that loads
strongly on it
• Each component loads strongly on at least three variables
• Complex structure: many components loading on the
same individual variables
• Components 1 and 5 both load on variable Qu18
PCA: Interpreting Results

Deciding on the number of components to retain


• No single correct answer likely: 4 / 5 components
Extracting five components: simple structure not
attained
⇒ Re-run the principal components analysis
• Force SPSS Statistics to only extract (retain) four
components
• Instead of the default five using the eigenvalue-one criterion
PCA: Interpreting Results

• Total Variance Explained table: with four components


extracted
• Rotation Sums of Squared Loadings: slightly different
values
• Cumulative percentage: 59.9% of the total
• All extracted components now explain more than 5% of
the total variance
• Rotated Component Matrix: Simple structure
PCA: Interpreting Results
Rotated Component Matrixa

Component
Reporting: 1 2 3 4
Qu13 .780 .107 .109 .034
• A principal components analysis (PCA) was run on a 25- Qu3
Qu12
.774
.768
.108
.192
.158
.010
.119
.039

question questionnaire that measured desired employee Qu4


Qu6
.765
.745
.104
.192
.241
.018
.059
.088

characteristics on 315 job candidates. Qu5


Qu7
.660
.651
.160
.166
.086
-.052
.034
.138
Qu8 .614 .043 .081 .273
• The suitability of PCA was assessed prior to analysis. Qu15
Qu14
.192
.187
.820
.819
.111
.057
-.052
-.035

Inspection of the correlation matrix showed that all variables Qu19


Qu17
.096
.088
.784
.751
-.007
.160
.135
-.092

had at least one correlation coefficient greater than 0.3. The Qu18
Qu2
.161
.091
.749
.723
.120
.026
.005
.100

overall Kaiser-Meyer-Olkin (KMO) measure was 0.83 with Qu16


Qu24
.207
-.006
.720
.023
.091
.807
.004
-.092
Qu21 .060 -.027 .765 -.142
individual KMO measures all greater than 0.7. Qu25 .071 .192 .761 .040
Qu22 .055 -.013 .761 -.033

• Bartlett's test of sphericity was statistically significant (p < Qu23


Qu20
.147
.163
.228
.120
.708
.508
.166
.031

.0005), indicating that the data was likely factorisable. Qu10


Qu11
.226
.011
-.007
-.055
.054
.057
.837
.807
Qu1 .117 -.005 -.018 .796
Qu9 .227 .134 -.193 .691
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 5 iterations.
PCA: Interpreting Results
Rotated Component Matrixa

• PCA revealed five components that had eigenvalues greater than Items
Rotated Component Coefficients
Component Component Component Component Communalities
one and which explained 26.9%, 13.4%, 11.6%, 8.1% and 4.2% of Qu13
1
.780
2
.107
3 4
.109 .034 .648
the total variance, respectively. Qu3 .774 .108 .158 .119 .542
Qu12 .768 .192 .010 .039 .649
• Visual inspection of the scree plot indicated that four components Qu4
Qu6
.765
.745
.104
.192
.241
.018
.059
.088
.657
.470
should be retained. In addition, a four-component solution met the Qu5 .660 .160 .086 .034 .600
Qu7 .651 .166 -.052 .138 .473
interpretability criterion. As such, four components were retained. Qu8 .614 .043 .081 .273 .460
Qu15 .192 .820 .111 -.052 .584
The four-component solution explained 59.9% of the total variance. Qu14 .187 .819 .057 -.035 .754
.784

Qu19 .096 -.007 .135 .658
A Varimax orthogonal rotation was employed to aid interpretability. Qu17 .088 .751 .160 -.092 .629
Qu18 .161 .749 .120 .005 .632
The rotated solution exhibited 'simple structure'. Qu2 .091 .723 .026 .100 .710


Qu16 .207 .720 .091 .004 .725
The interpretation of the data was consistent with the personality Qu24 -.006 .023 .807 -.092 .570
Qu21 .060 -.027 .765 -.142 .605
attributes the questionnaire was designed to measure with strong Qu25 .071 .192 .761 .040 .602

loadings of motivation items on Component 1, dependability items Qu22


Qu23
.055
.147
-.013
.228
.761
.708
-.033
.166
.642
.300
on Component 2, enthusiasm items on Component 3 and Qu20 .163 .120 .508 .031 .610
Qu10 .226 -.007 .054 .837 .583
commitment items on Component 4. Qu11 .011 -.055 .057 .807 .603
Qu1 .117 -.005 -.018 .796 .660
• Component loadings and communalities of the rotated solution are Qu9 .227 .134 -.193 .691 .622
Extraction Method: Principal Component Analysis.
presented separately Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 5 iterations.
PCA: Interpreting Results

Component scores and component-based scores


• Assign a score to each component for each participant on completion of
analysis
• Score that reflects an individual's 'motivation' (for e.g.)
• Two common methods to achieving a score that reflects the variables that
are associated with each of the components
• Component scores
• Component-based scores
PCA: Interpreting Results

Component scores: scores calculated by SPSS


• Linear composite of the optimally-weighted original variables
• SPSS determines the regression weights, multiply each participant response by the
respective weights and then sums the products
• The resulting sum (the component score) will be the score the individual achieves on that
particular retained component
• This procedure (the equations are different for each component) is performed for all
retained components
• The component scores will have been added to the end of the data file (SPSS)
• Column names: FAC1_1: component 1, iteration 1
PCA: Interpreting Results

Component-based scores:
• Composite score: summation of the scores on all the variables that loaded
strongly on a particular component
• For example, Qu1, Qu9, Qu10 and Qu11 loaded on Component 4
• Associated with commitment items
• Each score for each of these questions would be summated to generate a
component-based score for commitment
• Difference between component-based scores and component scores:
original variables are not multiplied by optimal weights in a component-
based score

You might also like