Professional Documents
Culture Documents
Dimension Reduction
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240
Dimension Reduction
2
DSCI 5240
Dimensionality Reduction
• Dimensionality reduction is a set of
techniques used to reduce the amount of
data necessary for prediction while still
providing accurate models Airplane
Deicing
• Many data mining techniques are not Snowplow Costs
Costs
effective for high-dimensional data
• Two high level approaches
Heat
• Feature selection - selectively stroke
exclude dimensions from
consideration
• Feature extraction - mathematically
combine dimensions to produce
intrinsic/latent dimensions
Temperature?
3
DSCI 5240
Main Approaches
• Feature Selection
• Manual Feature Selection – The data analyst can examine the available
dimensions and exclude those they feel are not useful for modeling purposes
• Feature Selection based on Objective Function – A modeling approach is
used to identify the features that appear to have the most influence on the
dependent variable
4
DSCI 5240
The Curse of Dimensionality
(Bellman 1961)
• The curse of dimensionality is based
𝑛
2 >𝑛1
Classifier Performance
on the fact that statistical methods
count observations that occur in a
given space
• As dimensions increase
• The data needed to make accurate
inferences grows exponentially
𝑛2
• The observations become sparser
(more spread out)
𝑛1
• Model performance often suffers as
dimensionality increases
Dimensionality
5
DSCI 5240
Goal
6
DSCI 5240
Applications
• Dimension reduction is useful anytime there is high-dimension data
that may be simplified
• Text mining
• Image retrieval
• Intelligent character recognition
• Facial recognition
• Dimension reduction is used in the business world when producing
models with large datasets
• Customers
• Products
• Etc.
7
DSCI 5240
8
DSCI 5240
𝑎1 𝑏1
𝑎
…
𝑎𝑁
[] ′
𝑿 = 2 ⇒ 𝑃𝐶𝐴 ⇒ 𝑿 =
𝑤h𝑒𝑟𝑒 𝐾 ≪ 𝑁
[]
𝑏2
…
𝑏𝐾
9
DSCI 5240
PCA Requirements
1. PCA only involves predictors, not target
variables.
2. PCA can only be performed on dimensions
which are numeric in nature
10
DSCI 5240
11
DSCI 5240
5 2 4
[ −3
3
6
−3
2
1 ] 𝑡𝑟𝑎𝑐𝑒
=( 5+6+1 )= 12
Useful reference for linear algebra: Lay, D. C. (2003). Linear Algebra and its Applications 4th edition.
12
DSCI 5240
5 0 0 1 0 0
[ ]
[ 0
0
6
0
0
1 ] 𝐼= 0
0
1
0
0
1
𝑘 0 0 5 4
[ ]
𝑘𝐼 = 0
0 [ 𝑘
0
0
𝑘 ] 𝑨= 2
3 1
𝑻
3 ⇒𝑨 =
5
4 [ 2
3
3
1 ]
13
DSCI 5240
Matrix Operations
3 6 −6 1
[ ]
𝑨= 5
−2 [ ]8
9
𝑩= 0
8
9
3
Addition: Subtraction:
𝐙 = 𝐀+ 𝐁 ⟺ 𝑧 𝑖 , 𝑗 =𝑎 𝑖, 𝑗 +𝑏 𝑖, 𝑗
𝐙 = 𝐀 − 𝐁 ⟺ 𝑧 𝑖 , 𝑗=𝑎 𝑖 , 𝑗 −𝑏 𝑖 , 𝑗
3 6 −6 1 3 6 −6 1
𝑨 + 𝑩= 5
[
−2 9][
8 + 0
8
9
3 ] 𝑨 − 𝑩= 5
−2 [ ][
8 − 0
9 8
9
3 ]
−3 7 9 5
¿ 5
6[ ]
17
12
¿ 5
[
− 10 ]
−1
6
14
DSCI 5240
Matrix Operations
• Multiplication: where is a scalar
2 6
[ ]𝑨=
0 5𝑏=3
2 6 6 18
𝑨 ∗ 𝑏=
[ 0 5 ] [ 0 15 ]
∗ 3=
2 ∗3=6
6 ∗3=18
0 ∗3=0
5 ∗3=15
15
DSCI 5240
Matrix Operations
•Multiplication:
defined if number of columns in equals the number of rows in
𝒁 = 𝑨 ∗ 𝑩 ⟺ 𝑧 𝑖 , 𝑗=𝑎𝑖 ,1 ∗ 𝑏1 , 𝑗 +𝑎 𝑖, 2 ∗ 𝑏2 , 𝑗 +𝑎𝑖 , 3 ∗ 𝑏3 , 𝑗 +… 𝑎 𝑖 ,𝑚 ∗ 𝑏 𝑛 , 𝑗
4 1 9
[ ] [ ] 2 9
6 2 8 𝑩= 5
𝑨= 12
7 3 5 8 10
11 10 12
4 1 9 85 138
𝑨 ∗ 𝑩= 6
7
11
[ 2
3
10
8 ∗
5
12
2
5
8 ][ ][ ] 9
12
10
= 86
69
168
158
149
339
¿ +1∗ 5 +9 ∗ 8¿ ¿ 85
+1∗ 12 +9 ∗10 ¿ ¿ 138
¿ 16
DSCI 5240
Determinant
• Determinant: A function that associates a scalar to a square matrix
• Singularity
• A matrix with a nonzero determinant is called non-singular (it has an inverse)
• Conversely, a matrix with a zero determinant is called singular (it has no inverse)
• The determinant of matrix A is denoted |A| or det(A)
• Laplace’s Formula
𝑛 𝑛
𝑖+ 𝑗
| 𝑨|=∑ 𝑎𝑖𝑗 𝐶𝑖𝑗 =∑ 𝑎𝑖 , 𝑗 (− 1) 𝑀 𝑖 , 𝑗
𝑗 =1 𝑖=1
• Where
• : the i, j minor matrix of . Obtained by removing row i and column j
• : the scalar cofactor of
17
DSCI 5240
Determinant Example
𝑛 𝑛
| 𝑨|=∑ 𝑎𝑖𝑗 𝐶𝑖𝑗 =∑ 𝑎𝑖 , 𝑗 (− 1)𝑖+ 𝑗 𝑀 𝑖 , 𝑗
𝑗 =1 𝑖=1
2 x 2 Matrix 3 x 3 Matrix
𝑨= 1 3 5 1 2 3
[ 2 4 ]
𝑩= 4
[ 5 6
]
7 8 9
5 6 4 6 4 5
| 𝑨|= 𝑎1,1 𝑎2,2 − 𝑎1,2 𝑎2,1
| |=¿ 1
𝑩
∙ |
8 | −2 |
9
∙
7 |
∙
9 +3 7 | 8|
|𝑩|=1 ∙ ( ( 5 ∗ 9 ) −(8 ∗ 6) )
−2 ∙ ( ( 4 ∗ 9 ) −(7 ∗ 6) )
| 𝑨|= (13 )( 4 ) − ( 5 ) ( 2 ) =42
3 ∙ ( ( 4 ∗8 ) − ( 7 ∗5 ) ) =0
18
DSCI 5240
PCA Methodology
• Step 1: Calculate the mean of each dimension (variable)
• Step 2: Calculate the variance/covariance matrix
a. Calculate the variance of each attribute
b. Calculate the covariance of the attributes
c. Construct the matrix
• Step 3: Compute the Eigenvalues of the covariance matrix and order them from
largest to smallest
• Step 4: Compute the Eigenvectors of the covariance matrix for the associated with the
Eigenvalues
• Step 5: Keep terms corresponding to the K largest Eigenvalues
Useful reference for PCA: Alpaydin, E. (2020). Introduction to machine learning. MIT press.
19
DSCI 5240
Example Data
3.5
x2
1.5
3.100 3.000
2.300 2.700 1.0
20
DSCI 5240
Example Data
2.500 2.400
0.500 0.700
2.200 2.900
1.900 2.200
3.100 3.000
2.300 2.700
2.000 1.600
1.000 1.100
1.500 1.600
1.100 0.900
21
DSCI 5240
𝑛
∑ 𝑥1 𝑖
2.500 2.400 𝑥
´1 = 𝑖=1
0.500 0.700 𝑛
2.200 2.900 18.1
1.900 2.200 𝑥´1 = =1.810
3.100 3.000
10
2.300 2.700 𝑛
2.000 1.600
∑ 𝑥2 𝑖
𝑖=1
𝑥
´2 =
1.000 1.100 𝑛
1.500 1.600 19.1
1.100 0.900 𝑥´2 = =1.910
Sum: 18.100 19.100 10
22
DSCI 5240
𝑛
(1) (2) (3) (4) (5) (6)
∑ ( 𝑥1 𝑖 − 𝑥´1 )2
2.500 2.400 0.690 0.490 0.476 0.240
2.500 2.400 0.690 0.490 0.476 0.240 𝑠 2𝑥 = 𝑖=1
0.500 0.700 -1.310 -1.210 1.716 1.464
1
( 𝑛− 1)
0.500 0.700 -1.310 -1.210 1.716 1.464
2.200 2.900 0.390 0.990 0.152 0.980
2.200 2.900 0.390 0.990 0.152 0.980 2 5.549
1.900
1.900 2.200
2.200 0.090
0.090 0.290
0.290 0.008
0.008 0.084
0.084 𝑠 =
𝑥1 =0.617
3.100
3.100
3.000
3.000
1.290
1.290
1.090
1.090
1.664
1.664
1.188
1.188
9
2.300 2.700 0.490 0.790 0.240 0.624 𝑛
2.300 2.700 0.490 0.790 0.240 0.624
2.000 1.600 0.190 -0.310 0.036 0.096
∑ ( 𝑥2 𝑖 − 𝑥´2 )2
2.000 1.600 0.190 -0.310 0.036 0.096 2 𝑖=1
𝑠𝑥 =
1.000
1.000
1.100
1.100
-0.810
-0.810
-0.810
-0.810
0.656
0.656
0.656
0.656
2
(𝑛− 1)
1.500
1.500
1.600
1.600
-0.310
-0.310
-0.310
-0.310
0.096
0.096
0.096
0.096 2 6.449
1.100 0.900 -0.710 -1.010 0.504 1.020 𝑠 =
𝑥2 =0.717
1.100 0.900 -0.710 -1.010
Sum:
0.504
5.549
1.020
6.449 9
Sum: 5.549 6.449
23
DSCI 5240
0.617 0.615
0.617 0.615
0.615 0.717
0.615 0.717
Variance of
Covariance of and
25
DSCI 5240
Eigenvalue is a measure of the variation within the data along a particular path
• An
(Eigenvector)
• Given
• : Scalar (this is what we are solving for… our eigenvalues)
• : Identity Matrix
• : Non-singular matrix (our variance/covariance matrix)
• For what value of is ? In other words: for what value of does the matrix not have an
inverse?
0 .617 0.615 1 0
{1}
det ( 𝑨 − 𝜆 𝑰 )=
|[ 0.615 0.717 ] [
−𝜆
0 1 ]|
=0
0 .617 0.615 𝜆 0
{2}
¿
|[ 0.615 0.717 ] [
−
0 ]|
𝜆
=0
26
DSCI 5240
¿ 0 .617 − 𝜆 0.615
{1}
|[
0.615 0.717 − 𝜆 ]|
=0
2
{2} ¿ ( 0.617 − λ )( 0.717 − 𝜆 ) −0.615 =0
Characteristic Equation
27
DSCI 5240
2
− 𝑏 ± √ 𝑏 − 4 𝑎𝑐
𝑎=1
{2}
2𝑎
=0 𝑏=−1.333
𝑐=0.063
2
− ( −1.333 ) + √ (− 1.333) − 4(1)(0.063)
{3} 𝜆1 = =1.284
2(1)
− ( −1.333 ) − √ (− 1.333)2 − 4(1)(0.063)
{4} 𝜆2 = =0.050
2(1)
28
DSCI 5240
Observation
0 .617 0.615
𝑨=
[ 0.615 0.717 ]
𝑡𝑟𝑎𝑐𝑒 ( 𝑨 ) =( 0.617+0.717 )=1.334
(𝜆 ¿ ¿ 1+ 𝜆2 )=( 1.284 +0.050 )=1.334 ¿
Another Observation
0 .617 0.615
[
𝑨=
0.615 0.717 ]
𝑑𝑒𝑡 ( 𝑨 )= ( 0.617 ∗0.717 ) −(0.615 ∗0.615)=0.064
30
DSCI 5240
0 .617 0.615 − 𝜆 1 0
{1} ([ 0.615 0.717 ]𝑖
0 [ 1 ]) ⃗𝑍 = 0
0 .617 0.615 − 𝜆 𝑖 0
{2} ([ 0.615 0.717 ] [
0 𝜆𝑖 ]) 𝑍=0
⃗
0 .617 − 𝜆 𝑖 0.615
{3} ([ 0.615 0.717 − 𝜆 𝑖 ]) 𝑍 =0
⃗
31
DSCI 5240
Consider
•
{3} −0.667
𝑧 1 +0.615 𝑧2 =0
These are linearly dependent ({4} is {3}*(-0.922))
{4} 0.615
𝑧 1+(−0.567) 𝑧 2 =0
0.922 1.844 2.766 …
{5} 𝑧 1=0.922 𝑧2 1 2 3 …
1 2 3 …
32
DSCI 5240
Consider
•
{3} 0.567
𝑧 1+0.615 𝑧 2= 0
{4} 0.615
𝑧 1+0.667 𝑧 2= 0
-1.085 -2.170 -3.255 …
{5} 𝑧 1=−1.085 𝑧2 1 2 3 …
1 2 3 …
33
DSCI 5240
Normalizing Eigenvectors
• Dealing with infinite sets is difficult [ 𝑧1 , 𝑧2 ]
[ 𝑧 1 𝑠 , 𝑧 2 𝑠 ]= 2 2
• Statistical packages typically normalize Eigenvectors √ 𝑧1 + 𝑧2
{1}
𝑧 1 𝑠=
𝑧1
=
0.922
=0.678
2 2 2 2
0.922 1.844 2.766 … √ 𝑧 +𝑧
1 2 √ 0.922 +1
𝜆1 𝑧2
1 2 3 … 1
1 2 3 … 𝑧 2 𝑠= 2 2
= 2 2
=0.735
√ 𝑧 +𝑧
1 2 √ 0.922 +1
{1} 𝑧1 −1.085
-1.085 -2.170 -3.255 … 𝑧 1 𝑠= 2 2
= 2 2
=−0.735
𝜆2
√ 𝑧 +𝑧
1 2 √− 1.085 +1
1 2 3 … 𝑥2
1 2 3 … 1
𝑧 2 𝑠= 2 2
= 2 2
=0.678
√ 𝑧 +𝑧
1 2 √− 1.085 +1
34
DSCI 5240
• It is clear that PC1 is associated with significantly more variation than PC2
• Discarding PC2 allows us to reduce dimensions (2 1) and doesn’t cost us much
information
35
DSCI 5240
New reduced dimension are created by multiplying the matrix of retained Eigenvectors
by a matrix of the original dimensions minus their means (centered data)
[ ] []
0.500 0.700 -1.310 -1.210
2.200 2.900 0.390 0.990
− 1.31 − 1.21 − 1.778
2.200 2.900 0.390 0.990
0.39 0.99 0.992
1.900 2.200 0.090 0.290
1.900 2.200 0.090 0.290 0.09 0.29 0.274
3.100 3.000 1.290 1.090
1.29 1.09 0.6779 1.676
3.100
2.300
2.300
3.000
2.700
2.700
1.290
0.490
0.490
1.090
0.790
0.790 0.49 0.79
∙
[0.7352
=
] 0.913
2.000
2.000
1.600
1.600
0.190
0.190
-0.310
-0.310 0.19 − 0.31 − 0.099
1.000 1.100 -0.810 -0.810 − 0.81 − 0.81 − 1.145
1.000 1.100 -0.810 -0.810
1.500 1.600 -0.310 -0.310 − 0.31 − 0.31 − 0.438
1.500 1.600 -0.310 -0.310
1.100 0.900 -0.710 -1.010 − 0.71 − 1.01 − 1.224
1.100 0.900 -0.710 -1.010 36
DSCI 5240
Choosing K
• In an example with two dimensions this is easy… what if we had two hundred
dimensions? How many should we keep?
• There are two common approaches for choosing how many principal components to
keep
• Threshold – Determine how much of the information you want to retain, keep
enough components to satisfy that threshold
𝐾
∑ 𝜆 𝑖
𝑖=1
>𝑇h𝑟𝑒𝑠h𝑜𝑙𝑑 (𝑒 . 𝑔 . , 0.9 𝑜𝑟 0.95 )
𝑡𝑟𝑎𝑐𝑒
37
DSCI 5240
39
DSCI 5240
Some Issues
• Covariance is sensitive to large values
• Dimensions with large scales dominate
• Such dimensions are likely to become principal components
• Normalization can help reduce this issue (we’ll see this in the future)
• PCA assumes the underlying subspace is linear (i.e., that variables are numeric) and
thus transformations may be required
40
DSCI 5240
41