You are on page 1of 22

STATISTICAL PROGRAMMING -

PYTHON

Machine Learning:
dimensionality reduction

Pablo Monfort Instituto de Empresa


Summary

▪ Machine Learning overview


▪ Unsupervised learning
▪ clustering
▪ KMeans
▪ hierarchical clustering
▪ dimensionality reduction
▪ feature selection
▪ feature extraction
▪ PCA: principal component analysis
▪ LDA: linear discriminant analysis
▪ association rules
▪ Supervised learning
▪ classification
▪ regression
2
2
Machine learning: algorithms

Supervised learning:
- Classification: classification trees…
- Regression: linear regression, logistic regression, stepwise regression,
regression trees…

Unsupervised learning:
- Clustering: k-means, k-medians, hierarchical clustering…
- Dimensionality reduction: principal component analysis, discriminant
analysis…
- Association rule: a priori agorithm…

3
3
Machine Learning: use cases

Supervised learning:
- Classification: customer retention, fraud detection, image classification…
- Regression: market forecasting, population growth prediction…

Unsupervised learning:
- clustering: customer segmentation, recommender systems…
- dimensionality reduction: structure discovery, big data visualization…
- association rule: targetted marketing…

4
4
Summary

▪ Machine Learning overview


▪ Unsupervised learning
▪ clustering
▪ KMeans
▪ hierarchical clustering
▪ dimensionality reduction
▪ feature selection
▪ feature extraction
▪ PCA: principal component analysis
▪ LDA: linear discriminant analysis
▪ association rules
▪ Supervised learning
▪ classification
▪ regression
5
5
Dimensionality reduction: overview
Goal:
reducing the number of variables under consideration by obtaining a set of principal
variables.

How does it work?


Transforming the data in the high-dimensional space to a space in fewer
dimensions. Attending to the different transformations that can be applied, several
algorithms appear:
- PCA (not considering labels. Unsupervised. Optimization according to all data)
- kernel PCA (PCA nonlinear extension)

- LDA (considering labels. Supervised. Optimization according to each segment)


- GDA (LDA nonlinear extension)…

Usages:
a dimensionality reduction algorithm can be applied:
- prior to applying a K-nearest neighbors algorithm to avoid the curse of
dimensionality
- prior to regressions to avoid overfitting because of strong correlations
- for images treatment 6
6
Dimensionality reduction: overview
Goal:
reducing the number of variables under consideration by obtaining a set of
principal variables.

How does it work?


Transforming the data in the high-dimensional space to a space in fewer
dimensions. Attending to the different transformations that can be applied,
several algorithms appear:
- PCA
- kernel PCA (PCA nonlinear extension)
- LDA
- GDA (LDA nonlinear extension)…

Usages:
a dimensionality reduction algorithm can be applied:
- prior to applying a K-nearest neighbors algorithm to avoid the curse of
dimensionality
- prior to regressions to avoid overfitting because of strong correlations
- for images treatment
7
7
Summary

▪ Machine Learning overview


▪ Unsupervised learning
▪ clustering
▪ KMeans
▪ hierarchical clustering
▪ dimensionality reduction
▪ feature selection
▪ feature extraction
▪ PCA: principal component analysis
▪ LDA: linear discriminant analysis
▪ association rules
▪ Supervised learning
▪ classification
▪ regression
8
8
PCA: overview

Number of components: 2

2nd component
x+y
1st component
PC1: x-y

9
9
PCA: description
Goal:
- Reduction of the dimensionality (number of variables) maximizing the variance explained
How does it work?
- The first component is built according to a linear combination of other variables and having
the largest possible variance of the data.
The second component is built in a same way but orthogonal to the previous component(s).
…third, forth components…
At the end, a orthogonal basis set of vectors (principal components) has been created.
[The number of components will be the minimum between number of variables
and number of elements minus one.]
Advantages:
- Reduce the number of variables for a simpler understanding and visualization
- Improve the application of a regression analysis or a K-nearest neighbors
Disadvantages: 2nd component
- High difficulty to understand and x-2y 1st component
explain the meaning of a principal component 2x+y
- High computational cost with high amounts
of data (solution: incremental PCA) 10
10
PCA: description

Example 2:
2 components
Example 1:
Total variance using
2 components
only 1 PC: 65%
Total variance using
only 1 PC: 50%

Example 3:
2 components
Total variance using 1
PC: 90%

Remember to scale
the data!

11
11
PCA

from sklearn.decomposition import PCA

pca = PCA(…)
Arguments in PCA:
- n_components = number of components
- svd_solver = # 'randomized'…
- whiten = True #True or False

pca.fit(data)
Attributes:
- pca.explained_variance_ratio_ # np.cumsum(pca.explained_variance_ratio_)
- pca.components_ # coefficients of the linear transformation of the
original data to obtain the components
- pca.n_components_ # number of components

data_pca = pca.transform(data) # data coordinates using the principal components

12
12
PCA: exercise
Programming challenge DRED.1
Taking in consideration the iris dataset:

1. How many principal components can we consider?


2. How do you think is going to be the cumulated percentage of explained variance
attending to the number of components? Calculate it.
3. Consider the necessary number of components to explain at least a 99% of the
variance. Give the equations to calculate these components.
4. Calculate the new values for this decomposition and plot them.
5. Repeat the steps 3 and 4 taking a 95% of the variance.

13
13
PCA: exercise
Programming challenge DRED.1
Taking in consideration the iris dataset:

1. How many principal components can we consider?


2. How do you think is going to be the cumulated percentage of explained variance
attending to the number of components? Calculate it.
3. Consider the necessary number of components to explain at least a 99% of the
variance. Give the equations to calculate these components.
4. Calculate the new values for this decomposition and plot them.
5. Repeat the steps 3 and 4 taking a 95% of the variance.

Programming challenge DRED.2


1. Build a dataset with 4 variables where 100% of the explained variance is reached
with only 2 components.
2. Can you describe the variance explained in a dataset with 4 variables (X,Y,Z,T)
where cor(Z,T) = 1
3. Which is the number of principal components that we should take to explain the
100% of the variance for a dataset with 4 variables (X,Y,Z,T) where Z=Y, T=7,
X~N(1,1) and Y~N(1,1)? And if X~N(1,1000) and Y~N(1,1000)? 14
14
PCA: exercise

Programming challenge DRED.3


The digits dataset contains 1.797 images with size 8x8 with numbers from 0 to 9.
Taking in consideration this dataset:

1. How many principal components are required to explain at least a 45% of the
variance?
2. Calculate the new values for this decomposition. Which are the equations to
calculate these new coordinates?

15
15
PCA: exercise

Programming challenge DRED.4


Taking in consideration the previous digits dataset:

1. Which is the amount of variance explained taking only 2 principal components?


2. Calculate the new values for this decomposition.
3. Plot all the digits registers using these 2 principal components giving the color for
each target.

16
16
Summary

▪ Machine Learning overview


▪ Unsupervised learning
▪ clustering
▪ KMeans
▪ hierarchical clustering
▪ dimensionality reduction
▪ feature selection
▪ feature extraction
▪ PCA: principal component analysis
▪ LDA: linear discriminant analysis
▪ association rules
▪ Supervised learning
▪ classification
▪ regression
17
17
LDA vs PCA
LDA is a classification method using linear combination of variables while PCA is
dimension reduction method per se.

PCA seeks to find the direction that maximizes intra-cluster variance. The idea is to
project the cluster along a dimension such that all the data points are very well
separated.

LDA seeks to find a direction that maximizes inter-cluster variance. The idea is to
make different sets of data as distinguishable as possible.

18
18
LDA vs PCA
LDA is a classification method using linear combination of variables while PCA is
dimension reduction method per se.

PCA seeks to find the direction that maximizes intra-cluster variance. The idea is to
project the cluster along a dimension such that all the data points are very well
separated.

LDA seeks to find a direction that maximizes inter-cluster variance. The idea is to
make different sets of data as distinguishable as possible.

19
19
LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(…)
Arguments in LinearDiscriminantAnalysis:
- n_components = number of components
- solver = "svd' (default), "eigen"…

lda.fit(data, target_vector)
Attributes:
- lda.explained_variance_ratio_ # np.cumsum(pca.explained_variance_ratio_)
- lda.coef_ # coefficients of the linear transformation of the
original data to obtain the components

projected_data = lda.transform(data) # new data coordinates using the LDA components

Programming challenge DRED.5


Repeat the previous exercise using LDA instead of PCA 20
20
Session Wrap-up

21
Glossary

PCA(…)
.fit()
.explained_variance_ratio_
.components_
.n_components_
.transform(data)

22
22

You might also like