MiM Predictive Analytics Sessions 1 2 (PCA)

6/14/2023
PART II: REDUCING DATA COMPLEXITY
SESSIONS 1+2: PRINCIPAL COMPONENT ANALYSIS
Predictive Analytics – Forecasting Future Events

MIM Program – Summer 2023
IE Business School
Matthias Seifert, Ph.D.
Professor of Decision Sciences
Email: matthias.seifert@ie.edu
PURPOSE
There are two main applications of Principal Component Analysis:
1. Data reduction: Reduce the number of variables to a smaller

number of factors.
2. Exploration: Detect structure in the relationships between

variables, that is, to classify variables.
1
6/14/2023
FACTOR ANALYSIS
Factor analysis is not about making predictions from

variables!
→ It is about finding relationships between whole sets of
variables, and finding the strength of those relationships
Principal Component Analysis (PCA):

❖ Goal: to re-express the data by reducing the number of
variables to a more manageable subset
❖ The new variables/dimensions
– Are linear combinations of the original ones
– Are uncorrelated with one another
3
APPLICATIONS OF PCA
❖ A retail chain may want to understand how consumers select stores for
their purchases based on 80 different characteristics (e.g. store type,
services offered etc).
→ 80 variables may be too many to understand how consumers make
decisions and develop action plans. Instead, a few more general
dimensions would be helpful for creating profiles (e.g. “salesperson”,
“product range”)
❖ In economics, we may characterize states or countries based on lots of

variables that are measured.
→In reality, a few well-chosen indices (based on these variables) may account
for the economic differences among countries.
❖ In multiple regression, sometimes our explanatory (predictor) variables are

highly correlated (or simply too numerous).
→It may be advantageous to consider a few uncorrelated indices as predictors 4
rather than the correlated explanatory variables.
2
6/14/2023
EXAMPLE: ESSENTIAL FACIAL FEATURES

(IVANCEVIC, 2003)
EXAMPLE: ESSENTIAL FACIAL FEATURES

(IVANCEVIC, 2003)
Six orthogonal factors, represent 76.5% of the total variability in

facial recognition (in order of importance):
❖ upper-lip
❖ eyebrow-position
❖ nose-width
❖ eye-position
❖ eye/eyebrow-length
❖ face-width
3
6/14/2023
WHY PRINCIPAL COMPONENT ANALYSIS?
1. We study phenomena that cannot be directly observed

– ego, personality, intelligence in psychology
– Underlying factors that govern the observed data
2. We have too many observations and dimensions

– To reason about or obtain insights from
– To visualize
– We want to reduce noise in the data
– We want a better representation of data without losing
much information 7
CONCEPTUAL MODEL OF PRINCIPAL COMPONENT ANALYSIS
PCA uses correlations among many

items to search for common clusters.
4
6/14/2023
BASIC CONCEPT
• Identifying areas of variance in data to best discriminate between key
underlying phenomena observed
– Areas of greatest “signal” in the data
• If two items or dimensions are highly correlated or dependent

– They are likely to represent highly related phenomena
– If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable
• So we want to combine related variables, and focus on uncorrelated or

independent ones, especially those along which the observations have high
variance
• We want a smaller set of variables that explain most of the variance in the
original data, in more compact and insightful form
9
FACTOR, LOADING, EIGENVALUE

❖ Factor/Component: A variable (or construct) that is not
directly observable but needs to be inferred from the
input variables
❖ Factor Loading: A “correlation coefficient” showing the

importance (strength of association) of each variable in
defining the factor
❖ Eigenvalue: Represents the variance in the original

variables that is explained by a factor (standardized so
that the avg. variable has an eigenvalue = 1.0)
10
10
5
6/14/2023
HOW DOES IT WORK?
• Orthogonal directions of greatest variance in data
• Data points are represented in a rotated orthogonal coordinate system: the 11

origin is the mean of the data points and the axes are provided by the
eigenvectors.
11
PRINCIPAL COMPONENTS
• First principal component is the direction of greatest variability (covariance)
in the data
•Direction of greatest variability is that in which the average square
of the projection is greatest
• Second is the next orthogonal (uncorrelated) direction of greatest variability

•So first remove all the variability along the first component, and
then find the next direction of greatest variability
• And so on …
Geometrically: centering followed by rotation

• Linear transformation
12
12
6
6/14/2023
HOW MANY PRINCIPAL COMPONENTS ARE

THERE?
To explain all the variation in the original data, we would need (in
general) all q principal components.
But it is practically sufficient to explain most of the variation in the

original data.
This can usually be done using merely the “first few” principal
components.
If we will retain the first m < q components, how can we choose

m?
13
13
DIMENSIONALITY REDUCTION
We can ignore the components of lesser significance.
You do lose some information, but if the eigenvalues are small, you don’t lose
much
– q dimensions in original data
– calculate q eigenvectors and eigenvalues 14
– choose only the first m eigenvectors, based on their eigenvalues
– final data set has only m dimensions
14
7
6/14/2023
NUMBERS OF COMPONENTS TO BE OBTAINED
There are three criteria for choosing the appropriate

number of components:
First criterion:
❖ Simply choose the number of components based on

the scree-plot where there is an obvious “elbow”.
❖ That should put us in the range or threshold of 70% -

90%. 15
15

Second criterion (Kaiser rule):
Keep only the components whose eigenvalues are at least
→ Which is the average eigenvalue and also the average sample

variance of the original variables.
→ When PCA is done on the correlation matrix, this average is

1, so Kaiser (1958) suggested keeping components with
eigenvalues at least 1.
16
16
8
6/14/2023

Third criterion:
Retain the first m components sufficient to explain a

specified percentage (70%? 80%? 90%?) of the total
variance of the original variables.
17
17
HOW MANY FACTORS?

A subjective process ... Seek to explain maximum variance using
fewest factors, considering:
１．Theory – what is predicted/expected?

２．Eigenvalues > 1? (Kaiser’s criterion)
３．Scree Plot – where does it drop off?
４．Interpretability of last factor?
５．Try several different solutions?

(consider FA type, rotation, # of factors)
６．Factors must be able to be meaningfully interpreted & make
theoretical sense? 19
19
9
6/14/2023
ASSUMPTIONS
１．GIGO
２．Sample size
３．Levels of measurement
４．Normality
５．Linearity
６．Outliers
７．Factorability
20
20
GARBAGE. IN. GARBAGE. OUT
• Screen the data

• Use variables that theoretically “go together” 21
21
10
6/14/2023
ASSUMPTION TESTING:
SAMPLE SIZE
Some guidelines:
Min.: N > 4 cases per variable

e.g., 20 variables, should have > 80 cases (1:4)
Ideal: N > 20 cases per variable

e.g., 20 variables, ideally have > 400 cases (1:20)
Total N > 200 preferable

22
22
ASSUMPTION TESTING:
LEVEL OF MEASUREMENT
All variables must be suitable for correlational analysis, i.e., they

should be ratio/metric data or at least Likert data with several
interval levels.
23
23
11
6/14/2023
ASSUMPTION TESTING:
NORMALITY
FA is robust to violation of assumptions of normality
If the variables are normally distributed then the solution is

enhanced
24
24
ASSUMPTION TESTING:
LINEARITY
Because FA is based on correlations between variables, it is

important to check there are linear relations amongst the
variables (i.e., check scatterplots)
25
25
12
6/14/2023
ASSUMPTION TESTING:
OUTLIERS
❖ FA is sensitive to outlying cases
❖ Identify outliers, then remove or transform
26
26
EXAMPLE: READY TO EAT CEREALS

(RTE_CEREALS.CSV)
❖Study sponsored by Kellogg Australia
❖Survey of 121 consumers´ perception of their favorite brand of cereals
❖Each respondent was asked to evaluate 3 preferred brands on each of 25 different
attributes
❖5-point Likert scales to indicate the extent to which the brand possesses an attribute
1. Filling 13. Health
1. All Bran 2. Natural 14. Family
2. Cerola Muesli 3. Fibre 15. Calories Objective:
3. Just Right Sweet Plain
4. 16.
Explain which brands the
4. Kellog´s Corn Flakes Easy Crisp
5. 17.
consumer would consider
Komplete
5. 6. Salt 18. Regular purchasing
6. NutriGrain 7. Satisfying 19. Sugar
7. Purina Muesli 8. Energy 20. Fruit
8. Rice Bubbles 9. Fun 21. Process
9. Special K 10. Kids 22. Quality
10. Sustain 11. Soggy 23. Treat 27
11. Vitabrit 12. Economical 24. Boring
12. Weetbix 25. Nutritious
27
13
6/14/2023
ASSUMPTION TESTING:
FACTORABILITY
Check the factorability of the correlation matrix (i.e., how suitable
is the data for factor analysis?) by one or more of the following
methods:
❖ Correlation matrix correlations > .3?
❖ When using more advanced statistical software tools:

❖ Anti-image matrix diagonals > .5?
❖ Measures of sampling adequacy (MSAs)?
 Bartlett’s sig.?
 Kaiser-Meyer-Olkin Test > .5? 28
28
ASSUMPTION TESTING:
FACTORABILITY (CORRELATIONS)
Are there SOME correlations over .3? If so, proceed
with PCA
Takes some effort with a large number of variables, but 29

accurate
29
14
6/14/2023
ASSUMPTION TESTING:
FACTORABILITY: MEASURES OF SAMPLING
ADEQUACY
❖ Global diagnostic indicators - correlation matrix is factorable if:
❖ Bartlett’s test of sphericity is significant and/or

❖ Kaiser-Mayer Olkin (KMO) measure of sampling adequacy
> .5
30
30
STEPS / PROCESS FOR CONDUCTING

PRINCIPAL COMPONENT ANALYSIS
１．Testassumptions
２．Select type of analysis
３．Determine no. of factors

(Eigen Values, Scree plot, % variance explained)
４．Select items
(check factor loadings to identify which items belong in which factor; drop items one by
one; repeat)
５．Name and define factors
６．Examine correlations amongst factors
７．Analyse internal reliability
８．Compute composite scores

31
31
15
6/14/2023
EXPLAINED VARIANCE
5 factors explain 62.4% of the

variance in the items
32
SCREE PLOT
33
33
16
6/14/2023
IDEAL FACTOR SOLUTION

Each variable loads onto one factor
Variable 1
Factor 1
Variable 2
Variable 3 Factor 2
Variable 4 Factor 3
34
Variable 5
34
COMMON FACTOR SOLUTION…

Variables may load onto more than one factor
Variable 1
Factor 1
Variable 2
Variable 3 Factor 2
Variable 4 Factor 3
35
Variable 5
35
17
6/14/2023
INITIAL SOLUTION:
UNROTATED FACTOR STRUCTURE
❖ Seldom see a simple unrotated factor structure
❖ Many variables load on 2 or more factors
❖ Some variables may not load highly on any factors (check: low
communality)
❖ Rotation of the FL matrix helps to find a more interpretable factor

structure.
36
36
TWO BASIC TYPES OF FACTOR ROTATION
Orthogonal Oblique
(Varimax) (Oblimin)
-produces uncorrelated factors -allows correlations between
factors
37
37
18
6/14/2023
WHY ROTATE A FACTOR LOADING MATRIX?
❖ After rotation, the vectors (lines of best fit) are rearranged

to optimally go through clusters of shared variance
❖ A rotated factor structure is simpler & more easily

interpretable
❖ each variable loads strongly on only one factor
❖ each factor shows at least 3 strong loadings
❖ all loading are either strong or weak, no intermediate
loadings
38
38
ORTHOGONAL VS. OBLIQUE ROTATIONS
❖ Consider purpose of factor analysis
❖ If in doubt, try both
❖ Consider interpretability
39
39
19
6/14/2023
INTERPRETABILITY
❖ It is dangerous to be driven by factor loadings only – think
carefully - be guided by theory and common sense in
selecting factor structure.
❖ You must be able to understand and interpret a factor if

you’re going to extract it.
❖ There may be more than one good solution! e.g., in

personality
❖ 2 factor model
❖ 5 factor model
40
❖ 16 factor model
40
FACTOR LOADINGS & ITEM SELECTION

A factor structure is most interpretable when:
1. Each variable loads strongly (> +.40) on only one factor
2. Each factor shows 3 or more strong loadings; more loadings =

greater reliability
3. Most loadings are either high or low, few intermediate values.
41
41
20
6/14/2023
FACTOR MATRIX
42
42
FACTOR MATRIX
Healthful
Artificial
Popularity
Interesting/
exciting
43
43
21
6/14/2023
HOW TO USE THE PCA SOLUTION FOR FOLLOW UP ANALYSES

New Product Development:
Goal: Introduce a new cereal brand that scores high on factor 1 and low on
factor 2
Who are our main competitors in the market?
44
44
HOW DO I ELIMINATE ITEMS?
A subjective process; consider:

１．Size of main loading (min = .4)
２．Size of cross loadings (max = .3?)
３．Meaning of item (face validity)
４．Contribution it makes to the factor
５．Eliminate 1 variable at a time, then re-run, before

deciding which/if any items to eliminate next
６．Number of items already in the factor
45
45
22
6/14/2023
FACTOR LOADINGS & ITEM SELECTION

Comrey & Lee (1992):
loadings > .70 - excellent
> .63 - very good
> .55 - good
> .45 - fair
>.32 - poor
Cut-off for acceptable loadings:

❖ Look for gap in loadings
(e.g., .8, .7, .6, .3, .2)
❖ Choose cut-off because factors can be interpreted above but not below
cut-off
46
46
PRINCIPAL COMPONENT ANALYSIS IN PRACTICE
❖Eliminate poor items one at a time, retesting the possible

solutions
❖Check factor structure across sub-groups (e.g., gender) if there is

sufficient data
❖You will probably come up with a different solution from someone

else!
❖Check/consider reliability analysis
47
47
23
6/14/2023
LIMITATIONS
Factor analysis is a highly subjective process regarding the determination

of the number of factors and the interpretation of the factors.
There are no statistical tests regularly employed in factor analysis. As a

result, it is sometimes difficult to know if the results are merely an
accident or do reflect something meaningful.
Best Practice:
Consequently a standard procedure of factor analysis should be to divide
the sample randomly into two or more groups and independently run a
factor analysis with each group. If the same factors emerge in each
analysis, then confidence that the results do not represent a statistical
accident is increased.
48
48
HOMEZILLA: ATTRACTING HOMEBUYERS

THROUGH BETTER PHOTOS
49
24
6/14/2023
CASE OVERVIEW
❖Toronto-based company offering web-listing services for real

estate agents to attract home shoppers in Canada
❖Sandy Ward: CEO
❖Real estate agents post property information on Homezilla

website. Home buyers search for properties according to area,
price, number of bedrooms, sq-footage etc.
❖Goal: How to keep users’ attention on the photos as long as

possible?
50
❖Success: If user views photos for more than 3 seconds.
50
GROSS STATE PRODUCT EXAMPLE

The data (expressed in millions of dollars) are measures of gross state
product (GSP) for each of 13 different areas of economic activity in 1996:
1. Agriculture
2. Mining
3. Construction
4. Manufacturing (durable goods)
5. Manufacturing (nondurable goods)
6. Transportation
7. Communications
8. Electricity, gas, sanitation
9. Wholesale trade The sample contains 50
10. Retail trade observations, one for each of
11. Fiduciary, insurance, real estate the 50 US states
12. Services
51
13. Government
51
25
6/14/2023
PLOTTING FACTOR SCORES: HOW SHOULD WE LABEL

THE FACTORS?
52
52
PLOTTING FACTOR SCORES: HOW SHOULD WE LABEL

THE FACTORS?
53
53
26

MiM Predictive Analytics Sessions 1 2 (PCA)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MiM Predictive Analytics Sessions 1 2 (PCA)

Uploaded by

Copyright:

Available Formats

6/14/2023

PART II: REDUCING DATA COMPLEXITY

SESSIONS 1+2: PRINCIPAL COMPONENT ANALYSIS

Predictive Analytics – Forecasting Future Events

There are two main applications of Principal Component Analysis:

1. Data reduction: Reduce the number of variables to a smaller

2. Exploration: Detect structure in the relationships between

Factor analysis is not about making predictions from

Principal Component Analysis (PCA):

❖ In economics, we may characterize states or countries based on lots of

❖ In multiple regression, sometimes our explanatory (predictor) variables are

EXAMPLE: ESSENTIAL FACIAL FEATURES

EXAMPLE: ESSENTIAL FACIAL FEATURES

Six orthogonal factors, represent 76.5% of the total variability in

WHY PRINCIPAL COMPONENT ANALYSIS?

1. We study phenomena that cannot be directly observed

2. We have too many observations and dimensions

CONCEPTUAL MODEL OF PRINCIPAL COMPONENT ANALYSIS

PCA uses correlations among many

• If two items or dimensions are highly correlated or dependent

• So we want to combine related variables, and focus on uncorrelated or

FACTOR, LOADING, EIGENVALUE

❖ Factor Loading: A “correlation coefficient” showing the

❖ Eigenvalue: Represents the variance in the original

HOW DOES IT WORK?

• Orthogonal directions of greatest variance in data

• Data points are represented in a rotated orthogonal coordinate system: the 11

• Second is the next orthogonal (uncorrelated) direction of greatest variability

Geometrically: centering followed by rotation

HOW MANY PRINCIPAL COMPONENTS ARE

But it is practically sufficient to explain most of the variation in the

If we will retain the first m < q components, how can we choose

NUMBERS OF COMPONENTS TO BE OBTAINED

There are three criteria for choosing the appropriate

❖ Simply choose the number of components based on

❖ That should put us in the range or threshold of 70% -

NUMBERS OF COMPONENTS TO BE OBTAINED

Keep only the components whose eigenvalues are at least

→ Which is the average eigenvalue and also the average sample

→ When PCA is done on the correlation matrix, this average is

NUMBERS OF COMPONENTS TO BE OBTAINED

Retain the first m components sufficient to explain a

HOW MANY FACTORS?

１．Theory – what is predicted/expected?

３．Scree Plot – where does it drop off?

４．Interpretability of last factor?

５．Try several different solutions?

GARBAGE. IN. GARBAGE. OUT

• Screen the data

Min.: N > 4 cases per variable

Ideal: N > 20 cases per variable

Total N > 200 preferable

All variables must be suitable for correlational analysis, i.e., they

FA is robust to violation of assumptions of normality

If the variables are normally distributed then the solution is

Because FA is based on correlations between variables, it is

❖ FA is sensitive to outlying cases

❖ Identify outliers, then remove or transform

EXAMPLE: READY TO EAT CEREALS

❖ Correlation matrix correlations > .3?

❖ When using more advanced statistical software tools:

 Kaiser-Meyer-Olkin Test > .5? 28

Takes some effort with a large number of variables, but 29

❖ Bartlett’s test of sphericity is significant and/or