You are on page 1of 40

DSCI 5240

Dimension Reduction
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240

Dimension Reduction

2
DSCI 5240

Dimensionality Reduction
• Dimensionality reduction is a set of
techniques used to reduce the amount of
data necessary for prediction while still
providing accurate models Airplane
Deicing
• Many data mining techniques are not Snowplow Costs
Costs
effective for high-dimensional data
• Two high level approaches
Heat
• Feature selection - selectively stroke
exclude dimensions from
consideration
• Feature extraction - mathematically
combine dimensions to produce
intrinsic/latent dimensions
Temperature?
3
DSCI 5240

Main Approaches

• Feature Selection
• Manual Feature Selection – The data analyst can examine the available
dimensions and exclude those they feel are not useful for modeling purposes
• Feature Selection based on Objective Function – A modeling approach is
used to identify the features that appear to have the most influence on the
dependent variable

• Feature Extraction – Maps high dimensional data onto a lower dimensional


subspace (i.e., combines variables)

4
DSCI 5240
The Curse of Dimensionality
(Bellman 1961)
• The curse of dimensionality is based
𝑛
  2 >𝑛1

Classifier Performance
on the fact that statistical methods
count observations that occur in a
given space
• As dimensions increase
• The data needed to make accurate
inferences grows exponentially
  𝑛2
• The observations become sparser
(more spread out)
  𝑛1
• Model performance often suffers as
dimensionality increases

Dimensionality
5
DSCI 5240

Goal

• Simplify the data by removing unnecessary dimensions (noise)


• Improve speed of learning
• Improve predictive accuracy

6
DSCI 5240

Applications
• Dimension reduction is useful anytime there is high-dimension data
that may be simplified
• Text mining
• Image retrieval
• Intelligent character recognition
• Facial recognition
• Dimension reduction is used in the business world when producing
models with large datasets
• Customers
• Products
• Etc.

7
DSCI 5240

Principal Components Analysis

8
DSCI 5240

Dimension Reduction and PCA


• Principal Components Analysis (PCA) is a feature extraction method which takes a
classical linear approach to dimension reduction
• PCA projects high-dimensional data onto a lower dimensional sub-space using linear
transformation
• All dimension reduction techniques involve some degree of information loss
• Goal of PCA is to reduce dimensionality while retaining as much information
(variation) as possible in the dataset

  𝑎1 𝑏1
𝑎

𝑎𝑁
[] ′
𝑿 = 2 ⇒ 𝑃𝐶𝐴 ⇒ 𝑿 =

𝑤h𝑒𝑟𝑒 𝐾 ≪ 𝑁
 
[]
𝑏2

𝑏𝐾

9
DSCI 5240

PCA Requirements
1. PCA only involves predictors, not target
variables.
2. PCA can only be performed on dimensions
which are numeric in nature

10
DSCI 5240

11
DSCI 5240

Some Matrix Terminology


• Matrix: a rectangular array of rows and columns
• If the number of rows is equal to the number of columns, the matrix is square
• Principal: Diagonal from upper-left to lower right of a matrix
• Principal elements: Elements of the diagonal
• Trace: Sum of the principal elements
Principal (Diagonal)

  5 2 4
[ −3
3
6
−3
2
1 ] 𝑡𝑟𝑎𝑐𝑒
  =( 5+6+1 )= 12

Useful reference for linear algebra: Lay, D. C. (2003). Linear Algebra and its Applications 4th edition.
12
DSCI 5240

More Matrix Terminology


• Diagonal matrix: Matrix • Identity/Unity matrix:
in which all non-diagonal Scalar matrix in which all
elements are zero diagonal elements equal 1

5 0 0   1 0 0

[ ]
 

[ 0
0
6
0
0
1 ] 𝐼= 0
0
1
0
0
1

• Scalar matrix: Diagonal • Transpose matrix AT of matrix


matrix in which all diagonal A obtained by converting rows to
elements are equal columns and columns to rows

𝑘 0 0   5 4
[ ]
 
𝑘𝐼 = 0
0 [ 𝑘
0
0
𝑘 ] 𝑨= 2
3 1
𝑻
3 ⇒𝑨 =
5
4 [ 2
3
3
1 ]
13
DSCI 5240

Matrix Operations
3 6   −6 1

[ ]
 
𝑨= 5
−2 [ ]8
9
𝑩= 0
8
9
3

Addition: Subtraction:
𝐙 = 𝐀+ 𝐁 ⟺ 𝑧 𝑖 , 𝑗 =𝑎 𝑖, 𝑗 +𝑏 𝑖, 𝑗
  𝐙 = 𝐀 − 𝐁 ⟺ 𝑧 𝑖 , 𝑗=𝑎 𝑖 , 𝑗 −𝑏 𝑖 , 𝑗
 

  3 6 −6 1   3 6 −6 1
𝑨 + 𝑩= 5
[
−2 9][
8 + 0
8
9
3 ] 𝑨 − 𝑩= 5
−2 [ ][
8 − 0
9 8
9
3 ]
  −3 7   9 5
¿ 5
6[ ]
17
12
¿ 5
[
− 10 ]
−1
6
14
DSCI 5240

Matrix Operations
•  Multiplication: where is a scalar

𝒁 = 𝑨 ∗ 𝑏 ⟺ 𝑧 𝑖𝑗 =𝑎𝑖 1 ∗ 𝑏+ 𝑎𝑖 2 ∗ 𝑏+… 𝑎𝑖𝑚 ∗𝑏


 

2 6
 
[ ]𝑨=
0 5𝑏=3  

2 6 6 18
 
𝑨 ∗ 𝑏=
[ 0 5 ] [ 0 15 ]
∗ 3=

2 ∗3=6
  6 ∗3=18
 

0 ∗3=0
  5 ∗3=15
 

15
DSCI 5240

Matrix Operations
•Multiplication:
  defined if number of columns in equals the number of rows in

 𝒁 = 𝑨 ∗ 𝑩 ⟺ 𝑧 𝑖 , 𝑗=𝑎𝑖 ,1 ∗ 𝑏1 , 𝑗 +𝑎 𝑖, 2 ∗ 𝑏2 , 𝑗 +𝑎𝑖 , 3 ∗ 𝑏3 , 𝑗 +… 𝑎 𝑖 ,𝑚 ∗ 𝑏 𝑛 , 𝑗
  4 1 9

[ ] [ ] 2 9
 
6 2 8 𝑩= 5
𝑨= 12
7 3 5 8 10
11 10 12
  4 1 9 85 138
𝑨 ∗ 𝑩= 6
7
11
[ 2
3
10
8 ∗
5
12
2
5
8 ][ ][ ] 9
12
10
= 86
69
168
158
149
339
 
¿  +1∗ 5  +9 ∗ 8¿  ¿ 85
 +1∗ 12  +9 ∗10 ¿  ¿ 138
 
¿ 16
DSCI 5240

Determinant
• Determinant: A function that associates a scalar to a square matrix
• Singularity
• A matrix with a nonzero determinant is called non-singular (it has an inverse)
• Conversely, a matrix with a zero determinant is called singular (it has no inverse)
• The determinant of matrix A is denoted |A| or det(A)
• Laplace’s Formula
  𝑛 𝑛
𝑖+ 𝑗
| 𝑨|=∑ 𝑎𝑖𝑗 𝐶𝑖𝑗 =∑ 𝑎𝑖 , 𝑗 (− 1) 𝑀 𝑖 , 𝑗
𝑗 =1 𝑖=1
 • Where
• : the i, j minor matrix of . Obtained by removing row i and column j
• : the scalar cofactor of

17
DSCI 5240

Determinant Example  
𝑛 𝑛
| 𝑨|=∑ 𝑎𝑖𝑗 𝐶𝑖𝑗 =∑ 𝑎𝑖 , 𝑗 (− 1)𝑖+ 𝑗 𝑀 𝑖 , 𝑗
𝑗 =1 𝑖=1

2 x 2 Matrix 3 x 3 Matrix

𝑨= 1 3 5 1 2 3
 
[ 2 4 ]  
𝑩= 4
[ 5 6
]
7 8 9
5 6 4 6     4 5
| 𝑨|= 𝑎1,1 𝑎2,2 − 𝑎1,2 𝑎2,1
  | |=¿ 1
 𝑩  
 
∙ |
8 | −2 |
9
 
 

7 |

9 +3 7 | 8|
|𝑩|=1 ∙ ( ( 5 ∗ 9 ) −(8 ∗ 6) )
 

 −2 ∙ ( ( 4 ∗ 9 ) −(7 ∗ 6) )
| 𝑨|= (13 )( 4 ) − ( 5 ) ( 2 ) =42
 
 3 ∙ ( ( 4 ∗8 ) − ( 7 ∗5 ) ) =0

18
DSCI 5240

PCA Methodology
• Step 1: Calculate the mean of each dimension (variable)
• Step 2: Calculate the variance/covariance matrix
a. Calculate the variance of each attribute
b. Calculate the covariance of the attributes
c. Construct the matrix
• Step 3: Compute the Eigenvalues of the covariance matrix and order them from
largest to smallest
• Step 4: Compute the Eigenvectors of the covariance matrix for the associated with the
Eigenvalues
• Step 5: Keep terms corresponding to the K largest Eigenvalues

Useful reference for PCA: Alpaydin, E. (2020). Introduction to machine learning. MIT press.
19
DSCI 5240

Example Data

3.5

2.500 2.400 3.0


0.500 0.700
2.5
2.200 2.900
2.0 x
1.900 2.200

x2
1.5
3.100 3.000
2.300 2.700 1.0

2.000 1.600 0.5 x


1.000 1.100
0.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
1.500 1.600
x1
1.100 0.900

20
DSCI 5240

Example Data

2.500 2.400
0.500 0.700
2.200 2.900
1.900 2.200
3.100 3.000
2.300 2.700
2.000 1.600
1.000 1.100
1.500 1.600
1.100 0.900

21
DSCI 5240

Step 1: Calculate Means

𝑛
 
∑ 𝑥1 𝑖
2.500 2.400 𝑥
´1 = 𝑖=1

0.500 0.700 𝑛
2.200 2.900   18.1
1.900 2.200 𝑥´1 = =1.810
3.100 3.000
10
2.300 2.700 𝑛
 
2.000 1.600
∑ 𝑥2 𝑖
𝑖=1
𝑥
´2 =
1.000 1.100 𝑛
1.500 1.600   19.1
1.100 0.900 𝑥´2 = =1.910
Sum: 18.100 19.100 10
22
DSCI 5240

Step 2a: Calculate Variances

𝑛
 
(1) (2) (3) (4) (5) (6)
∑ ( 𝑥1 𝑖 − 𝑥´1 )2
2.500 2.400 0.690 0.490 0.476 0.240
2.500 2.400 0.690 0.490 0.476 0.240 𝑠 2𝑥 = 𝑖=1
0.500 0.700 -1.310 -1.210 1.716 1.464
1
( 𝑛− 1)
0.500 0.700 -1.310 -1.210 1.716 1.464
2.200 2.900 0.390 0.990 0.152 0.980
2.200 2.900 0.390 0.990 0.152 0.980   2 5.549
1.900
1.900 2.200
2.200 0.090
0.090 0.290
0.290 0.008
0.008 0.084
0.084 𝑠 =
𝑥1 =0.617
3.100
3.100
3.000
3.000
1.290
1.290
1.090
1.090
1.664
1.664
1.188
1.188
9
2.300 2.700 0.490 0.790 0.240 0.624 𝑛
2.300 2.700 0.490 0.790 0.240 0.624  
2.000 1.600 0.190 -0.310 0.036 0.096
∑ ( 𝑥2 𝑖 − 𝑥´2 )2
2.000 1.600 0.190 -0.310 0.036 0.096 2 𝑖=1
𝑠𝑥 =
1.000
1.000
1.100
1.100
-0.810
-0.810
-0.810
-0.810
0.656
0.656
0.656
0.656
2
(𝑛− 1)
1.500
1.500
1.600
1.600
-0.310
-0.310
-0.310
-0.310
0.096
0.096
0.096
0.096   2 6.449
1.100 0.900 -0.710 -1.010 0.504 1.020 𝑠 =
𝑥2 =0.717
1.100 0.900 -0.710 -1.010
Sum:
0.504
5.549
1.020
6.449 9
Sum: 5.549 6.449
23
DSCI 5240

Step 2b: Calculate Covariance

(1) (2) (3) (4) (5)


2.500 2.400 0.690 0.490 0.338
0.338
2.500 2.400 0.690 0.490
0.500 0.700 -1.310 -1.210 1.585
1.585
0.500 0.700 -1.310 -1.210
2.200 2.900 0.390 0.990 0.386
0.386 𝑛
2.200 2.900 0.390 0.990  
1.900
1.900 2.200
2.200 0.090
0.090 0.290
0.290
0.026
0.026 𝐶𝑂𝑉 𝑥1 𝑥 2 =∑ ¿ ¿ ¿
3.100 3.000 1.290 1.090 1.406 𝑖=1
3.100 3.000 1.290 1.090 1.406
2.300 2.700 0.490 0.790 0.387
2.300 2.700 0.490 0.790 0.387
  5.539
2.000
2.000
1.600
1.600
0.190
0.190
-0.310
-0.310
-0.059
-0.059 𝐶𝑂𝑉 𝑥 𝑥 = =0.615
1.000 1.100 -0.810 -0.810 0.656
1 2
9
1.000 1.100 -0.810 -0.810 0.656
1.500 1.600 -0.310 -0.310 0.096
1.500 1.600 -0.310 -0.310 0.096
1.100 0.900 -0.710 -1.010 0.717
1.100 0.900 -0.710 -1.010 0.717 Note: COVx1x2=COVx2x1
Sum: 5.539
Sum: 5.539
24
DSCI 5240
Step 2c: Construct Variance/Covariance
Matrix
  Variance of

0.617 0.615
0.617 0.615
0.615 0.717
0.615 0.717

  Variance of
  Covariance of and

25
DSCI 5240

Step 3: Compute Eigenvalues

  Eigenvalue is a measure of the variation within the data along a particular path
• An
(Eigenvector)
• Given
• : Scalar (this is what we are solving for… our eigenvalues)
• : Identity Matrix
• : Non-singular matrix (our variance/covariance matrix)
• For what value of is ? In other words: for what value of does the matrix not have an
inverse?
0 .617 0.615 1 0
{1}
 
det ( 𝑨 − 𝜆 𝑰 )=
|[ 0.615 0.717 ] [
−𝜆
0 1 ]|
=0

0 .617 0.615 𝜆 0
{2}
 
¿
|[ 0.615 0.717 ] [

0 ]|
𝜆
=0
26
DSCI 5240

Step 3: Compute Eigenvalues

¿ 0 .617 − 𝜆 0.615
{1}
 
|[
0.615 0.717 − 𝜆 ]|
=0

2
{2} ¿ ( 0.617 − λ )( 0.717 − 𝜆 ) −0.615 =0
  Characteristic Equation

{3} ¿ ( 0.442 ) − ( 0.617 𝜆 ) − ( 0.717 𝜆 ) + 𝜆2 −0.6152 =0


 

{4} ¿ 𝜆2 −1.333 𝜆+0.063=0


 

27
DSCI 5240

Step 3: Compute Eigenvalues


2
{1} ¿ 1 𝜆 − 1.333 𝜆 +0.063=0
 

2
− 𝑏 ± √ 𝑏 − 4 𝑎𝑐
 
 
𝑎=1
{2}
2𝑎
=0 𝑏=−1.333
 

𝑐=0.063
 

2
 
− ( −1.333 ) + √ (− 1.333) − 4(1)(0.063)
{3} 𝜆1 = =1.284
2(1)

 
− ( −1.333 ) − √ (− 1.333)2 − 4(1)(0.063)
{4} 𝜆2 = =0.050
2(1)
28
DSCI 5240

Observation
0 .617 0.615
 
𝑨=
[ 0.615 0.717 ]
 
𝑡𝑟𝑎𝑐𝑒 ( 𝑨 ) =( 0.617+0.717 )=1.334
(𝜆 ¿ ¿ 1+ 𝜆2 )=( 1.284 +0.050 )=1.334 ¿
 

You can write Eigenvalues as a percentage of trace(A)


  𝜆1 1.284
∗100= ∗ 100=96.25 %
𝑡𝑟𝑎𝑐𝑒 ( 𝑨 ) 1.334
  𝜆2 0.050
∗100= ∗ 100=3.75 %
𝑡𝑟𝑎𝑐𝑒 ( 𝑨 ) 1.334
 The largest Eigenvalue () is referred to as the principal Eigenvalue
29
DSCI 5240

Another Observation

The product of the Eigenvalues is the determinant of the matrix

0 .617 0.615
 
[
𝑨=
0.615 0.717 ]
𝑑𝑒𝑡 ( 𝑨 )= ( 0.617 ∗0.717 ) −(0.615 ∗0.615)=0.064
 

(𝜆 ¿ ¿ 1 ∗ 𝜆 2)= ( 1.284 ∗0.050 )=0.064 ¿


 

30
DSCI 5240

Step 4: Calculate Eigenvectors


•  An Eigenvector is the magnitude and direction of a path through the data
• For each Eigenvalue , there is a set of associated Eigenvectors
• The number of Eigenvectors in the set is infinite
• Eigenvectors corresponding to different Eigenvalues are linearly independent
• Given , find a vector such that

  0 .617 0.615 − 𝜆 1 0
{1} ([ 0.615 0.717 ]𝑖
0 [ 1 ]) ⃗𝑍 = 0
  0 .617 0.615 − 𝜆 𝑖 0
{2} ([ 0.615 0.717 ] [
0 𝜆𝑖 ]) 𝑍=0

  0 .617 − 𝜆 𝑖 0.615
{3} ([ 0.615 0.717 − 𝜆 𝑖 ]) 𝑍 =0

31
DSCI 5240

Step 4: Calculate Eigenvectors

Consider
•  

  0 .617 − 1.284 0.615


{1} ([ 0.615 0.717 − 1.284 ]) ⃗𝑍 = 0
  − 0.667 0.615 𝑧1 0
{2} ¿
([ 0.615 − 0.567 ])[ ] [ ]
𝑧2
=
0

{3} −0.667
  𝑧 1 +0.615 𝑧2 =0
These are linearly dependent ({4} is {3}*(-0.922))
{4} 0.615
  𝑧 1+(−0.567) 𝑧 2 =0
0.922 1.844 2.766 …
{5}  𝑧 1=0.922 𝑧2 1 2 3 …
1 2 3 …
32
DSCI 5240

Step 4: Calculate Eigenvectors

Consider
•  

  0 .617 − 0.050 0.615


{1} ([ 0.615 0.717 − 0.050 ]) ⃗𝑍 =0
  0.567 0.615 𝑧1 0
{2} ¿
([ 0.615 0.667 ]) [ ] [ ]
𝑧2
=
0

{3} 0.567
  𝑧 1+0.615 𝑧 2= 0

{4} 0.615
  𝑧 1+0.667 𝑧 2= 0
-1.085 -2.170 -3.255 …
{5}  𝑧 1=−1.085 𝑧2 1 2 3 …
1 2 3 …
33
DSCI 5240

Normalizing Eigenvectors
• Dealing with infinite sets is difficult   [ 𝑧1 , 𝑧2 ]
[ 𝑧 1 𝑠 , 𝑧 2 𝑠 ]= 2 2
• Statistical packages typically normalize Eigenvectors √ 𝑧1 + 𝑧2

{1}  
𝑧 1 𝑠=
𝑧1
=
0.922
=0.678
2 2 2 2
0.922 1.844 2.766 … √ 𝑧 +𝑧
1 2 √ 0.922 +1
 𝜆1 𝑧2
1 2 3 …   1
1 2 3 … 𝑧 2 𝑠= 2 2
= 2 2
=0.735
√ 𝑧 +𝑧
1 2 √ 0.922 +1
{1}   𝑧1 −1.085
-1.085 -2.170 -3.255 … 𝑧 1 𝑠= 2 2
= 2 2
=−0.735
 𝜆2
√ 𝑧 +𝑧
1 2 √− 1.085 +1
1 2 3 … 𝑥2
1 2 3 …   1
𝑧 2 𝑠= 2 2
= 2 2
=0.678
√ 𝑧 +𝑧
1 2 √− 1.085 +1
34
DSCI 5240

Eigenvalues and Eigenvectors


• For our example:

Component Eigenvalue Normalized Eigenvectors


PC1 1.284 =0.678 =0.735
PC2 0.050 =-0.735 =0.678

• It is clear that PC1 is associated with significantly more variation than PC2
• Discarding PC2 allows us to reduce dimensions (2  1) and doesn’t cost us much
information

35
DSCI 5240

Calculating Principal Component Values

New reduced dimension are created by multiplying the matrix of retained Eigenvectors
by a matrix of the original dimensions minus their means (centered data)

(1) (2) (3) (4) Dimensions Eigenvector PC1


2.500
2.500 2.400
2.400 0.690
0.690 0.490
0.490  
0.500 0.700 -1.310 -1.210 0.69 0.49 0 .828

[ ] []
0.500 0.700 -1.310 -1.210
2.200 2.900 0.390 0.990
− 1.31 − 1.21 − 1.778
2.200 2.900 0.390 0.990
0.39 0.99 0.992
1.900 2.200 0.090 0.290
1.900 2.200 0.090 0.290 0.09 0.29 0.274
3.100 3.000 1.290 1.090
1.29 1.09 0.6779 1.676
3.100
2.300
2.300
3.000
2.700
2.700
1.290
0.490
0.490
1.090
0.790
0.790 0.49 0.79

[0.7352
=
] 0.913
2.000
2.000
1.600
1.600
0.190
0.190
-0.310
-0.310 0.19 − 0.31 − 0.099
1.000 1.100 -0.810 -0.810 − 0.81 − 0.81 − 1.145
1.000 1.100 -0.810 -0.810
1.500 1.600 -0.310 -0.310 − 0.31 − 0.31 − 0.438
1.500 1.600 -0.310 -0.310
1.100 0.900 -0.710 -1.010 − 0.71 − 1.01 − 1.224
1.100 0.900 -0.710 -1.010 36
DSCI 5240

Choosing K
• In an example with two dimensions this is easy… what if we had two hundred
dimensions? How many should we keep?
• There are two common approaches for choosing how many principal components to
keep
• Threshold – Determine how much of the information you want to retain, keep
enough components to satisfy that threshold

𝐾
 ∑ 𝜆 𝑖
𝑖=1
>𝑇h𝑟𝑒𝑠h𝑜𝑙𝑑 (𝑒 . 𝑔 . , 0.9 𝑜𝑟 0.95 )
𝑡𝑟𝑎𝑐𝑒

• Scree Plot – Order Eigenvalues in descending order


and plot the kth Eigenvalue against k

37
DSCI 5240

A Note on PCA and Classification


• PCA works well when the target variable is interval
• If the target is nominal, care must be used if PCA will be employed on the
independent variables
• Projection axes chosen by PCA may not give good discrimination power
• PCA maintains what is common in the data, not what differentiates them
• Thus, PCA may reduce the efficacy of classification algorithms if not used with
care
• Linear discriminant analysis may be advisable in these situations

39
DSCI 5240

Some Issues
• Covariance is sensitive to large values
• Dimensions with large scales dominate
• Such dimensions are likely to become principal components
• Normalization can help reduce this issue (we’ll see this in the future)
• PCA assumes the underlying subspace is linear (i.e., that variables are numeric) and
thus transformations may be required

40
DSCI 5240

Some Useful Resources


• Eigenvalue and Eigenvector Calculator
• Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension R
eduction

41

You might also like