Unit 2

UNIT 2 :
DATA PREPROCESSING, ANALYSIS AND VISUALIZATION
DATA SCIENCE AND

MACHINE LEARNING
AMAL YADAV
MPIT AMROHA
DATA PREPROCESSING
Data Preprocessing in ML
 Data preprocessing is a process of preparing the

raw data and making it suitable for a machine
learning model.
 It is the first and crucial step while creating a
machine learning model.
 A real-world data generally contains noises, missing
values, and maybe in an unusable format which
cannot be directly used for machine learning
models.
 It also increases the accuracy and efficiency of a
machine learning model.
Data Preprocessing Techniques
1) Data Cleaning:
(a) Missing Data :
 Dropping rows/columns: drop rows/columns having NaN values
 Checking for duplicates: keep first instance only
 Estimate missing values: with feature’s mean, mode or median
(b) Noisy Data:
 Binning Method: divide data into equal-size parts and then
data can be replaced by mean and boundary values
 Clustering: related data grouped into cluster. Outliers may go
unnoticed, or they may fall outside of clusters.
 Regression:By fitting data to a regression function, data can be
smoothed out.
2) Data Transformation: This stage is used to convert the data into a format
that can be used in the data analysis process.
a) Normalization:
 It is done to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
b) Concept Hierarchy Generation:
 Here attributes are converted from lower level to higher level in hierarchy.
 For Example-The attribute “city” can be converted to “country”.
c) Smoothing:
 techniques include binning, clustering, and regression.
d) Aggregation:
 process of applying summary or aggregation operations on data.
 Daily sales data, for example, might be combined to calculate monthly and
annual totals.
3) Data Integration: 4) Data Reduction:

 It is involved in a data analysis task that
combines data from multiple sources into a
coherent data store. a) Dimensionality Reduction :
 These sources may include multiple
 there could be hundreds of features,
databases.
also known as dimensions, here we
minimize the number of features
b) Numerosity Reduction:
 Data is replaced or estimated using
alternative and smaller data
representations.
Preprocessing Techniques in ML
• Mean removal:
 It involves removing the mean from each feature so that it is centered on
zero. Mean removal helps in removing any bias from the features.
• Scaling:
 The values of every feature in a data point can vary between random
values. So, it is important to scale them so that this matches specified
rules.
• Normalization
 Normalization involves adjusting the values in the feature vector so as to
measure them on a common scale.
• Binarization
 Binarization is used to convert a numerical feature vector into a Boolean
vector.
Preprocessing Techniques in ML
• One Hot Encoding:

 It may be required to deal with numerical values that are few and
scattered, and you may not need to store these values. In such situations
you can use One Hot Encoding technique.
 If the number of distinct values is k, it will transform the feature into a k-
dimensional vector where only one value is 1 and all other values are 0.
• Label Encoding
 Label encoding refers to changing the word labels into numbers so that
the algorithms can understand how to work on them.
Data Analyses
1) Loading the dataset
2) Summarizing the dataset
 See Jupyter Notebook for example

Data Visualization
1) Univariate Plots
2) Multivariate Plots
Data Visualization
Univariate Plots Multivariate Plots

• In univariate analysisit explores each variable
separately • Multivariate analysis is required
when more than two variables have
a) Histogram b) Bar Chart c) Pie Chart to be analyzed simultaneously.
Training vs Test Data
 The main difference between training data and testing data is

that training data is the subset of original data that is used to
train the machine learning model, whereas testing data is used to
check the accuracy of the model.
 The training dataset is generally larger in size compared to the
testing dataset. The general ratios of splitting train and test
datasets are 80:20, 70:30, or 90:10.
 Training data is well known to the model as it is used to train the
model, whereas testing data is like unseen/new data to the
model.
Training vs Test Data
Features Training Data Testing Data

The machine-learning model is trained using
training data. The more training data a model Testing data is used to evaluate the model’s
Purpose has, the more accurate predictions it can performance.
make.
Until evaluation, the testing data is not

By using the training data, the model can gain
exposed to the model. This guarantees that the
Exposure knowledge and become more accurate in its
model cannot learn the testing data by heart
predictions.
and produce flawless forecasts.
This training data distribution should be similar

The distribution of the testing data and the
Distribution to the distribution of actual data that the
data from the real world differs greatly.
model will use.
By making predictions on the testing data and

Use To stop overfitting, training data is utilized. comparing them to the actual labels, the
performance of the model is assessed.
Size Typically larger Typically smaller
Attributes and its Types in Data Analytics
Performance Measures
 Performance metrics in machine learning are used to evaluate the

performance of a machine learning model.
 To evaluate the performance of a classification model:
1) Accuracy
2) Confusion Matrix
3) Precision
4) Recall
5) F-Score
6) AUC(Area Under the Curve)-ROC
1) Accuracy 2) Precision 3) Recall or Sensitivity

• It can be determined as the • It is calculated as the
• It is calculated as the
number of correct number of true positive
number of true positive
predictions to the total instances divided by the
instances divided by the
number of predictions. sum of true positive and
sum of true positive and
false positive instances.
false negative instances.
4) F1 Score: 5) ROC AUC Score: 6) Confusion Matrix:

• F1 score is the harmonic • ROC AUC (Receiver Operating
• A confusion matrix is a table that is used to
mean of precision and Characteristic Area Under the
Curve) score is a measure of the evaluate the performance of a classification
recall. It is a balanced model.
ability of a classifier to
measure that takes into distinguish between positive
account both precision and and negative instances
recall.
1.True Positive(TP): In this case, the prediction

outcome is true, and it is true in reality, also.
2.True Negative(TN): in this case, the prediction
outcome is false, and it is false in reality, also.
3.False Positive(FP): In this case, prediction
outcomes are true, but they are false in actuality.
4.False Negative(FN): In this case, predictions
are false, and they are true in actuality.
THANK YOU

Unit 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

UNIT 2 :

DATA PREPROCESSING, ANALYSIS AND VISUALIZATION

DATA SCIENCE AND

 Data preprocessing is a process of preparing the

3) Data Integration: 4) Data Reduction:

• One Hot Encoding:

1) Loading the dataset

2) Summarizing the dataset

 See Jupyter Notebook for example

Univariate Plots Multivariate Plots

 The main difference between training data and testing data is

Features Training Data Testing Data

Until evaluation, the testing data is not

This training data distribution should be similar

By making predictions on the testing data and

 Performance metrics in machine learning are used to evaluate the

1) Accuracy 2) Precision 3) Recall or Sensitivity

4) F1 Score: 5) ROC AUC Score: 6) Confusion Matrix:

1.True Positive(TP): In this case, the prediction

You might also like