You are on page 1of 15

Principal Component Analysis, or PCA

• Is a dimensionality-reduction method

• It is often used to reduce the dimensionality of large data sets

How?
• by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
Principal Component Analysis, or PCA
• Reducing the number of variables of a data set naturally comes at the
expense of accuracy

• But the trick in dimensionality reduction is to trade a little accuracy


for simplicity.

• Because smaller data sets are easier to explore and visualize and
make analyzing data much easier and faster for machine learning
algorithms without extraneous variables to process.
Idea of PCA
• Reduce the number of variables of a data set, while preserving as
much information as possible.
Principal Component Analysis (PCA)

• Given a set of points, how do we know


if they can be compressed like in
the previous example?
– The answer is to look into the
correlation between the points
– The tool for doing this is
called PCA
PCA
• By finding the eigenvalues and eigenvectors of the
covariance matrix, we find that the eigenvectors
with the largest eigenvalues correspond to the
dimensions that have the strongest correlation in
the dataset.
• This is the principal component.
• PCA is a useful statistical technique that has
found application in:
– fields such as face recognition and image compression
– finding patterns in data of high dimension.
Imbalanced Data Set
Imbalanced Data Set
• Classification predictive modeling involves predicting a class label for
a given observation.

• An imbalanced classification problem is an example of a classification


problem where the distribution of examples across the known classes
is biased or skewed.

• The distribution can vary from a slight bias to a severe imbalance


where there is one example in the minority class for hundreds,
thousands, or millions of examples in the majority class or classes.
Example
• Cancer Prediction

No Cancer – 900 --- Majority Class


Yes Cancer – 100 ----Minority Class

If 1000 record are given which is biased towards NC – still Accuracy is 90%

Most algorithm work towards Majority class

Business Problems Minority class is the focus class eg: Spam / Non Spam

If accuracy is taken as metric algorithm tend to bias towards majority class


Methods to handle
• Under sampling

100 – NC
100 – C

====
200 -- perfectly balanced
========
• ML data is very important , loosing data is not recommended
Methods to handle
• Over Sampling

900 – NC
900 – C
===================================
Cancer
Take 30 records randomly
Till Reach – 900
Random Duplication
Few records may be more duplicated , few records less duplicated
900 – 800 are duplicates
===================================
1800 -- perfectly balanced --- focus is on minority class
===================================
• ML data is very important , loosing data is not recommended
Under Sampling vs Over Sampling
Methods to handle
• SMOTE (Synthetic Minority Oversampling Technique )
SMOTE
• Calculate the linear distance
between two vectors and
SMOTE multiply it by random number
between 0 -1 and plot the new
data point with the output

• The new point is the synthetic


data point continue
SMOTE – Repeat the process till you reach
the desired points

You might also like