You are on page 1of 26

6/14/2023

PART II: REDUCING DATA COMPLEXITY

SESSIONS 1+2: PRINCIPAL COMPONENT ANALYSIS

Predictive Analytics – Forecasting Future Events


MIM Program – Summer 2023
IE Business School
Matthias Seifert, Ph.D.
Professor of Decision Sciences
Email: matthias.seifert@ie.edu

PURPOSE

There are two main applications of Principal Component Analysis:

1. Data reduction: Reduce the number of variables to a smaller


number of factors.

2. Exploration: Detect structure in the relationships between


variables, that is, to classify variables.

1
6/14/2023

FACTOR ANALYSIS

Factor analysis is not about making predictions from


variables!
→ It is about finding relationships between whole sets of
variables, and finding the strength of those relationships

Principal Component Analysis (PCA):


❖ Goal: to re-express the data by reducing the number of
variables to a more manageable subset
❖ The new variables/dimensions
– Are linear combinations of the original ones
– Are uncorrelated with one another
3

APPLICATIONS OF PCA
❖ A retail chain may want to understand how consumers select stores for
their purchases based on 80 different characteristics (e.g. store type,
services offered etc).
→ 80 variables may be too many to understand how consumers make
decisions and develop action plans. Instead, a few more general
dimensions would be helpful for creating profiles (e.g. “salesperson”,
“product range”)

❖ In economics, we may characterize states or countries based on lots of


variables that are measured.
→In reality, a few well-chosen indices (based on these variables) may account
for the economic differences among countries.

❖ In multiple regression, sometimes our explanatory (predictor) variables are


highly correlated (or simply too numerous).
→It may be advantageous to consider a few uncorrelated indices as predictors 4
rather than the correlated explanatory variables.

2
6/14/2023

EXAMPLE: ESSENTIAL FACIAL FEATURES


(IVANCEVIC, 2003)

EXAMPLE: ESSENTIAL FACIAL FEATURES


(IVANCEVIC, 2003)

Six orthogonal factors, represent 76.5% of the total variability in


facial recognition (in order of importance):

❖ upper-lip
❖ eyebrow-position
❖ nose-width
❖ eye-position
❖ eye/eyebrow-length
❖ face-width

3
6/14/2023

WHY PRINCIPAL COMPONENT ANALYSIS?

1. We study phenomena that cannot be directly observed


– ego, personality, intelligence in psychology
– Underlying factors that govern the observed data

2. We have too many observations and dimensions


– To reason about or obtain insights from
– To visualize
– We want to reduce noise in the data
– We want a better representation of data without losing
much information 7

CONCEPTUAL MODEL OF PRINCIPAL COMPONENT ANALYSIS

PCA uses correlations among many


items to search for common clusters.

4
6/14/2023

BASIC CONCEPT
• Identifying areas of variance in data to best discriminate between key
underlying phenomena observed
– Areas of greatest “signal” in the data

• If two items or dimensions are highly correlated or dependent


– They are likely to represent highly related phenomena
– If they tell us about the same underlying variance in the data,
combining them to form a single measure is reasonable

• So we want to combine related variables, and focus on uncorrelated or


independent ones, especially those along which the observations have high
variance

• We want a smaller set of variables that explain most of the variance in the
original data, in more compact and insightful form
9

FACTOR, LOADING, EIGENVALUE


❖ Factor/Component: A variable (or construct) that is not
directly observable but needs to be inferred from the
input variables

❖ Factor Loading: A “correlation coefficient” showing the


importance (strength of association) of each variable in
defining the factor

❖ Eigenvalue: Represents the variance in the original


variables that is explained by a factor (standardized so
that the avg. variable has an eigenvalue = 1.0)
10

10

5
6/14/2023

HOW DOES IT WORK?

• Orthogonal directions of greatest variance in data

• Data points are represented in a rotated orthogonal coordinate system: the 11


origin is the mean of the data points and the axes are provided by the
eigenvectors.

11

PRINCIPAL COMPONENTS
• First principal component is the direction of greatest variability (covariance)
in the data
•Direction of greatest variability is that in which the average square
of the projection is greatest

• Second is the next orthogonal (uncorrelated) direction of greatest variability


•So first remove all the variability along the first component, and
then find the next direction of greatest variability

• And so on …

Geometrically: centering followed by rotation


• Linear transformation
12

12

6
6/14/2023

HOW MANY PRINCIPAL COMPONENTS ARE


THERE?

To explain all the variation in the original data, we would need (in
general) all q principal components.

But it is practically sufficient to explain most of the variation in the


original data.

This can usually be done using merely the “first few” principal
components.

If we will retain the first m < q components, how can we choose


m?
13

13

DIMENSIONALITY REDUCTION
We can ignore the components of lesser significance.

You do lose some information, but if the eigenvalues are small, you don’t lose
much
– q dimensions in original data
– calculate q eigenvectors and eigenvalues 14
– choose only the first m eigenvectors, based on their eigenvalues
– final data set has only m dimensions

14

7
6/14/2023

NUMBERS OF COMPONENTS TO BE OBTAINED

There are three criteria for choosing the appropriate


number of components:

First criterion:

❖ Simply choose the number of components based on


the scree-plot where there is an obvious “elbow”.

❖ That should put us in the range or threshold of 70% -


90%. 15

15

NUMBERS OF COMPONENTS TO BE OBTAINED


Second criterion (Kaiser rule):

Keep only the components whose eigenvalues are at least

→ Which is the average eigenvalue and also the average sample


variance of the original variables.

→ When PCA is done on the correlation matrix, this average is


1, so Kaiser (1958) suggested keeping components with
eigenvalues at least 1.

16

16

8
6/14/2023

NUMBERS OF COMPONENTS TO BE OBTAINED


Third criterion:

Retain the first m components sufficient to explain a


specified percentage (70%? 80%? 90%?) of the total
variance of the original variables.

17

17

HOW MANY FACTORS?


A subjective process ... Seek to explain maximum variance using
fewest factors, considering:

1.Theory – what is predicted/expected?


2.Eigenvalues > 1? (Kaiser’s criterion)

3.Scree Plot – where does it drop off?

4.Interpretability of last factor?

5.Try several different solutions?


(consider FA type, rotation, # of factors)
6.Factors must be able to be meaningfully interpreted & make
theoretical sense? 19

19

9
6/14/2023

ASSUMPTIONS

1.GIGO
2.Sample size
3.Levels of measurement
4.Normality
5.Linearity
6.Outliers
7.Factorability

20

20

GARBAGE. IN. GARBAGE. OUT

• Screen the data


• Use variables that theoretically “go together” 21

21

10
6/14/2023

ASSUMPTION TESTING:
SAMPLE SIZE
Some guidelines:

Min.: N > 4 cases per variable


e.g., 20 variables, should have > 80 cases (1:4)

Ideal: N > 20 cases per variable


e.g., 20 variables, ideally have > 400 cases (1:20)

Total N > 200 preferable


22

22

ASSUMPTION TESTING:
LEVEL OF MEASUREMENT

All variables must be suitable for correlational analysis, i.e., they


should be ratio/metric data or at least Likert data with several
interval levels.

23

23

11
6/14/2023

ASSUMPTION TESTING:
NORMALITY

FA is robust to violation of assumptions of normality

If the variables are normally distributed then the solution is


enhanced

24

24

ASSUMPTION TESTING:
LINEARITY

Because FA is based on correlations between variables, it is


important to check there are linear relations amongst the
variables (i.e., check scatterplots)

25

25

12
6/14/2023

ASSUMPTION TESTING:
OUTLIERS

❖ FA is sensitive to outlying cases

❖ Identify outliers, then remove or transform

26

26

EXAMPLE: READY TO EAT CEREALS


(RTE_CEREALS.CSV)
❖Study sponsored by Kellogg Australia
❖Survey of 121 consumers´ perception of their favorite brand of cereals
❖Each respondent was asked to evaluate 3 preferred brands on each of 25 different
attributes
❖5-point Likert scales to indicate the extent to which the brand possesses an attribute
1. Filling 13. Health
1. All Bran 2. Natural 14. Family
2. Cerola Muesli 3. Fibre 15. Calories Objective:
3. Just Right Sweet Plain
4. 16.
Explain which brands the
4. Kellog´s Corn Flakes Easy Crisp
5. 17.
consumer would consider
Komplete
5. 6. Salt 18. Regular purchasing
6. NutriGrain 7. Satisfying 19. Sugar
7. Purina Muesli 8. Energy 20. Fruit
8. Rice Bubbles 9. Fun 21. Process
9. Special K 10. Kids 22. Quality
10. Sustain 11. Soggy 23. Treat 27
11. Vitabrit 12. Economical 24. Boring
12. Weetbix 25. Nutritious

27

13
6/14/2023

ASSUMPTION TESTING:
FACTORABILITY
Check the factorability of the correlation matrix (i.e., how suitable
is the data for factor analysis?) by one or more of the following
methods:

❖ Correlation matrix correlations > .3?

❖ When using more advanced statistical software tools:


❖ Anti-image matrix diagonals > .5?
❖ Measures of sampling adequacy (MSAs)?
 Bartlett’s sig.?

 Kaiser-Meyer-Olkin Test > .5? 28

28

ASSUMPTION TESTING:
FACTORABILITY (CORRELATIONS)
Are there SOME correlations over .3? If so, proceed
with PCA

Takes some effort with a large number of variables, but 29


accurate

29

14
6/14/2023

ASSUMPTION TESTING:
FACTORABILITY: MEASURES OF SAMPLING
ADEQUACY
❖ Global diagnostic indicators - correlation matrix is factorable if:

❖ Bartlett’s test of sphericity is significant and/or


❖ Kaiser-Mayer Olkin (KMO) measure of sampling adequacy
> .5

30

30

STEPS / PROCESS FOR CONDUCTING


PRINCIPAL COMPONENT ANALYSIS
1.Testassumptions
2.Select type of analysis

3.Determine no. of factors


(Eigen Values, Scree plot, % variance explained)
4.Select items
(check factor loadings to identify which items belong in which factor; drop items one by
one; repeat)
5.Name and define factors
6.Examine correlations amongst factors

7.Analyse internal reliability

8.Compute composite scores


31

31

15
6/14/2023

EXPLAINED VARIANCE

5 factors explain 62.4% of the


variance in the items

32

SCREE PLOT

33

33

16
6/14/2023

IDEAL FACTOR SOLUTION


Each variable loads onto one factor
Variable 1

Factor 1
Variable 2

Variable 3 Factor 2

Variable 4 Factor 3

34
Variable 5

34

COMMON FACTOR SOLUTION…


Variables may load onto more than one factor
Variable 1

Factor 1
Variable 2

Variable 3 Factor 2

Variable 4 Factor 3

35
Variable 5

35

17
6/14/2023

INITIAL SOLUTION:
UNROTATED FACTOR STRUCTURE

❖ Seldom see a simple unrotated factor structure

❖ Many variables load on 2 or more factors

❖ Some variables may not load highly on any factors (check: low
communality)

❖ Rotation of the FL matrix helps to find a more interpretable factor


structure.
36

36

TWO BASIC TYPES OF FACTOR ROTATION

Orthogonal Oblique
(Varimax) (Oblimin)
-produces uncorrelated factors -allows correlations between
factors

37

37

18
6/14/2023

WHY ROTATE A FACTOR LOADING MATRIX?

❖ After rotation, the vectors (lines of best fit) are rearranged


to optimally go through clusters of shared variance

❖ A rotated factor structure is simpler & more easily


interpretable
❖ each variable loads strongly on only one factor
❖ each factor shows at least 3 strong loadings
❖ all loading are either strong or weak, no intermediate
loadings

38

38

ORTHOGONAL VS. OBLIQUE ROTATIONS

❖ Consider purpose of factor analysis

❖ If in doubt, try both

❖ Consider interpretability

39

39

19
6/14/2023

INTERPRETABILITY
❖ It is dangerous to be driven by factor loadings only – think
carefully - be guided by theory and common sense in
selecting factor structure.

❖ You must be able to understand and interpret a factor if


you’re going to extract it.

❖ There may be more than one good solution! e.g., in


personality
❖ 2 factor model
❖ 5 factor model
40
❖ 16 factor model

40

FACTOR LOADINGS & ITEM SELECTION


A factor structure is most interpretable when:
1. Each variable loads strongly (> +.40) on only one factor

2. Each factor shows 3 or more strong loadings; more loadings =


greater reliability

3. Most loadings are either high or low, few intermediate values.

41

41

20
6/14/2023

FACTOR MATRIX

42

42

FACTOR MATRIX
Healthful

Artificial

Popularity

Interesting/
exciting
43

43

21
6/14/2023

HOW TO USE THE PCA SOLUTION FOR FOLLOW UP ANALYSES


New Product Development:
Goal: Introduce a new cereal brand that scores high on factor 1 and low on
factor 2
Who are our main competitors in the market?

44

44

HOW DO I ELIMINATE ITEMS?

A subjective process; consider:


1.Size of main loading (min = .4)

2.Size of cross loadings (max = .3?)

3.Meaning of item (face validity)

4.Contribution it makes to the factor

5.Eliminate 1 variable at a time, then re-run, before


deciding which/if any items to eliminate next
6.Number of items already in the factor

45

45

22
6/14/2023

FACTOR LOADINGS & ITEM SELECTION


Comrey & Lee (1992):
loadings > .70 - excellent
> .63 - very good
> .55 - good
> .45 - fair
>.32 - poor

Cut-off for acceptable loadings:


❖ Look for gap in loadings
(e.g., .8, .7, .6, .3, .2)

❖ Choose cut-off because factors can be interpreted above but not below
cut-off
46

46

PRINCIPAL COMPONENT ANALYSIS IN PRACTICE

❖Eliminate poor items one at a time, retesting the possible


solutions

❖Check factor structure across sub-groups (e.g., gender) if there is


sufficient data

❖You will probably come up with a different solution from someone


else!

❖Check/consider reliability analysis

47

47

23
6/14/2023

LIMITATIONS

Factor analysis is a highly subjective process regarding the determination


of the number of factors and the interpretation of the factors.

There are no statistical tests regularly employed in factor analysis. As a


result, it is sometimes difficult to know if the results are merely an
accident or do reflect something meaningful.

Best Practice:
Consequently a standard procedure of factor analysis should be to divide
the sample randomly into two or more groups and independently run a
factor analysis with each group. If the same factors emerge in each
analysis, then confidence that the results do not represent a statistical
accident is increased.

48

48

HOMEZILLA: ATTRACTING HOMEBUYERS


THROUGH BETTER PHOTOS

49

24
6/14/2023

CASE OVERVIEW

❖Toronto-based company offering web-listing services for real


estate agents to attract home shoppers in Canada

❖Sandy Ward: CEO

❖Real estate agents post property information on Homezilla


website. Home buyers search for properties according to area,
price, number of bedrooms, sq-footage etc.

❖Goal: How to keep users’ attention on the photos as long as


possible?
50

❖Success: If user views photos for more than 3 seconds.

50

GROSS STATE PRODUCT EXAMPLE


The data (expressed in millions of dollars) are measures of gross state
product (GSP) for each of 13 different areas of economic activity in 1996:

1. Agriculture
2. Mining
3. Construction
4. Manufacturing (durable goods)
5. Manufacturing (nondurable goods)
6. Transportation
7. Communications
8. Electricity, gas, sanitation
9. Wholesale trade The sample contains 50
10. Retail trade observations, one for each of
11. Fiduciary, insurance, real estate the 50 US states
12. Services
51
13. Government

51

25
6/14/2023

PLOTTING FACTOR SCORES: HOW SHOULD WE LABEL


THE FACTORS?

52

52

PLOTTING FACTOR SCORES: HOW SHOULD WE LABEL


THE FACTORS?

53

53

26

You might also like