Professional Documents
Culture Documents
• Is a dimensionality-reduction method
How?
• by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
Principal Component Analysis, or PCA
• Reducing the number of variables of a data set naturally comes at the
expense of accuracy
• Because smaller data sets are easier to explore and visualize and
make analyzing data much easier and faster for machine learning
algorithms without extraneous variables to process.
Idea of PCA
• Reduce the number of variables of a data set, while preserving as
much information as possible.
Principal Component Analysis (PCA)
If 1000 record are given which is biased towards NC – still Accuracy is 90%
Business Problems Minority class is the focus class eg: Spam / Non Spam
100 – NC
100 – C
====
200 -- perfectly balanced
========
• ML data is very important , loosing data is not recommended
Methods to handle
• Over Sampling
900 – NC
900 – C
===================================
Cancer
Take 30 records randomly
Till Reach – 900
Random Duplication
Few records may be more duplicated , few records less duplicated
900 – 800 are duplicates
===================================
1800 -- perfectly balanced --- focus is on minority class
===================================
• ML data is very important , loosing data is not recommended
Under Sampling vs Over Sampling
Methods to handle
• SMOTE (Synthetic Minority Oversampling Technique )
SMOTE
• Calculate the linear distance
between two vectors and
SMOTE multiply it by random number
between 0 -1 and plot the new
data point with the output