You are on page 1of 70

FACULTY DEVELOPMENT

PROGRAM
ON
ARTIFICIAL INTELLIGENCE &
MACHINE LEARNING FOR
ENGINEERING APPLICATIONS
Dr. G.MADHU
Associate Professor
E-mail: madhu_g@vnrvjiet.in
Pre-Processing in Machine Learning
Objectives

• Recognize the importance of data


preparation in Machine Learning
• Identify the meaning and aspects
of feature engineering
• Standardize data features with
feature scaling
• Analyze missing datasets and its
examples
• Explain dimensionality reduction
with Principal Component
Analysis (PCA)

AI/ML for Engineering Applications


Dr.G.Madhu 3
What is Pre-Processing?
• Data Pre-processing is a process of preparing the raw data
and making it suitable for a machine learning model.
• It is the first and crucial step while creating a machine
learning model.
• In other words, whenever the data is collected from
different sources which is in raw format then it is not
feasible for the analysis.

• The set of steps are


known as Data Pre-
processing. This includes
• Data Cleaning,
• Data Integration,
• Data Transformation
• Data Reduction.
AI/ML for Engineering Applications
4
Dr.G.Madhu
AI/ML for Engineering Applications
5
Dr.G.Madhu
Why Data Preprocessing?

• A real-world data generally contains noises,


missing values, and may be in an unusable format
which cannot be directly used for machine
learning models.
• Data preprocessing is required tasks for cleaning
the data and making it suitable for a machine
learning model which also increases the accuracy
and efficiency of a machine learning model.

AI/ML for Engineering Applications


6
Dr.G.Madhu
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

AI/ML for Engineering Applications


7
Dr.G.Madhu
Steps in Data Pre-processing in Machine Learning

• Acquire the dataset


• Import all required libraries
• Import the dataset
• Identifying and handling the missing values
• Encoding the categorical data
• Splitting the dataset
• Feature scaling
AI/ML for Engineering Applications
Dr.G.Madhu 8
• To build and develop Machine
Learning models, you must first
acquire the relevant dataset.
• This dataset will be comprised of
data gathered from multiple and
disparate sources which are then
combined in a proper format to
Acquire the Dataset form a dataset.
• Dataset formats differ according
to use cases.
• For example, a business dataset
will be entirely different from a
medical dataset.

AI/ML for Engineering Applications


Dr.G.Madhu 9
The Three core Python libraries used for
this data pre-processing in Machine
Learning are:
• NumPy – NumPy is the fundamental
package for scientific calculation in
Python.
Import all Required Libraries
• Pandas – Pandas is an excellent open-
source Python library for data
manipulation and analysis.
• Matplotlib – Matplotlib is a Python 2D
plotting library that is used to plot any
type of charts in Python.

AI/ML for Engineering Applications


Dr.G.Madhu 10
Import the Dataset

 In this step, you need to import the dataset/s


that you have gathered for the ML project at
hand.
 However, before you can import the dataset/s,
you must set the current directory as the
working directory.
 You can set the working directory in Spyder IDE
in three simple steps:

AI/ML for Engineering Applications


Dr.G.Madhu 11
Identifying and Handling the Missing Values

• In data pre-processing, it is pivotal to


identify and correctly handle the missing
values, failing to do this, you might draw
inaccurate and faulty conclusions and
inferences from the data.
• Needless to say, this will hamper your ML
project.
AI/ML for Engineering Applications
Dr.G.Madhu 12
Encoding the Categorical Data

• Categorical data refers to the information


that has specific categories within the
dataset.

AI/ML for Engineering Applications


Dr.G.Madhu 13
Splitting the Dataset

• Every dataset for Machine Learning model


must be split into two separate sets – training
set and test set.

AI/ML for Engineering Applications


Dr.G.Madhu 14
Feature Scaling

• It is a method to standardize the independent variables


of a dataset within a specific range.
• In other words, feature scaling limits the range of
variables so that you can compare them on common
grounds.
• Feature scaling would prevent the mentioned problem
and improve the overall performance of the model.

AI/ML for Engineering Applications


Dr.G.Madhu 15
AI/ML for Engineering Applications
16
Dr.G.Madhu
• In the dataset, you can notice that the age and salary columns do not have the
same scale.
• In such a scenario, if you compute any two values from the age and salary
columns, the salary values will dominate the age values and deliver incorrect
results.
• Thus, you must remove this issue by performing feature scaling for Machine
AI/ML for Engineering Applications
Learning. Dr.G.Madhu
17
You can perform feature scaling in Machine Learning in Three ways:

Standardization
Robust Scaler

It is similar to Min-Max Scaler


However, it uses the interquartile range
Instead of the min-max.
Min-Max Scaler

AI/ML for Engineering Applications


Dr.G.Madhu 18
AI/ML for Engineering Applications
19
Dr.G.Madhu
AI/ML for Engineering Applications
20
Dr.G.Madhu
What is Missing Value?

• Missing values means that some variable do


not have measurement.
Attributes

A1 A2 A3 A4 Class
Instances

12 ? 2 5 Yes
14.5 B 4 7 Yes
10.7 A 6 9 No

AI/ML for Engineering Applications


21
Dr.G.Madhu
For example, the famous “Wine Quality” dataset contains quite a lot of missing values:

AI/ML for Engineering Applications


22
Dr.G.Madhu
Illustrative Example: Missing Value Problem
Record A1 A2 A3 A4 Decision
R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 ? 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 ? CLASS-2
R8 20 3 2 7 CLASS-1
R9 25 2 1 9 CLASS-2

Step-1: Form Two clusters (k=2) with the given dataset having four features that have
nine data samples. (Randomly chosen clusters)
Cluster-1 R4 R5 R6 R9
Cluster-2 R1 R2 R8

AI/ML for Engineering Applications


23
Dr.G.Madhu
 Step-2: Calculate each cluster centroid using
Record A1 A2 A3 A4 Decision
the following formula: R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 ? 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 ? CLASS-2
R8 20 3 2 7 CLASS-1
In Cluster-1 : R9 25 2 1 9 CLASS-2

For Sample
Cluster-1 R4 R5 R6 R9
A1: Centroid ( A1) =(30+35+25+25)/4 = 115/4=28.75 Cluster-2 R1 R2 R8

A2: Centroid (A2) = (4+2+4+2)/4 = 12/4 =3


A3: Centroid ( A3) = (1+1+2+1)/4 = 5/4 =1.25
A4: Centroid (A4) = (9+7+9+9)/4 = 34/4 = 8.5

AI/ML for Engineering Applications


24
Dr.G.Madhu
 Step-2: Calculate each cluster centroid using
Record A1 A2 A3 A4 Decision
the following formula: R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 ? 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 ? CLASS-2
R8 20 3 2 7 CLASS-1
R9 25 2 1 9 CLASS-2
In Cluster-2:
For Sample
Cluster-1 R4 R5 R6 R9
A1: Centroid ( A1) =(15+20+20)/3 = 55/3=18.33333 Cluster-2 R1 R2 R8

A2: Centroid (A2) = (1+3+3)/3 = 7/3 =2.3333

A3: Centroid ( A3) = (1+2+2)/3 = 5/3 =1.6666

A4: Centroid (A4) = (9+7+7)/4 = 23/3 = 7.6666

AI/ML for Engineering Applications


25
Dr.G.Madhu
AI/ML for Engineering Applications
26
Dr.G.Madhu
Step-3: Calculate the distance from each of these centroids to all the data samples in
Cluster-1 ( used Euclidean Distance )

 D( R1, Centroid-1) =


= 13.91492005
D( R2, Centroid-1) =
= 8.909264841
D( R4, Centroid-1) =

= 1.695585496 AI/ML for Engineering Applications


27
Dr.G.Madhu
 D( R5, Centroid-1) =
= 6.33442973
D( R6, Centroid-1) =
= 3.984344363
D( R8, Centroid-1) =
= 8.909264841
D( R9, Centroid-1) =
=3.921096785

AI/ML for Engineering Applications


28
Dr.G.Madhu
Step-4: Calculate the distance from each of these centroids to all the data samples in
Cluster-2 ( used Euclidean Distance )

 D( R1, Centroid-2) =


= 3.887301263
D( R2, Centroid-2) =
= 1.943650632
D( R4, Centroid-2) =
= 11.87901979
D( R5, Centroid-2) =
=16.69663972

AI/ML for Engineering Applications


29
Dr.G.Madhu
 D( R6, Centroid-2) =
= 7.007932014
D( R8, Centroid-2) =
= 1.943650632
D( R9, Centroid-2) =
=6.839428176

AI/ML for Engineering Applications


30
Dr.G.Madhu
Step-4: After calculating the Euclidean distance with centroids from each of the data
samples, we come up with the following data table.

C1 Distance C2 Distance
13.91492005 3.887301263
8.909264841 1.943650632
5.22613624 7.187952884
1.695582496 11.87901979
6.33442973 16.69663972
3.984344363 7.007932014
13.80670127 3.366501646
8.909264841 1.943650632
3.921096785 6.839428176

AI/ML for Engineering Applications


31
Dr.G.Madhu
Step-5: Form Mapping of these two distances

C1 Distance C2 Distance Mapping


13.91492005 3.887301263 17.80222
8.909264841 1.943650632 10.85292
5.22613624 7.187952884 12.41409
1.695582496 11.87901979 13.5746
6.33442973 16.69663972 23.03107
3.984344363 7.007932014 10.99228
13.80670127 3.366501646 17.1732
8.909264841 1.943650632 10.85292
3.921096785 6.839428176 10.76052

AI/ML for Engineering Applications


Dr.G.Madhu 32
Step-6: Calculate the distance from missing records with centroids C1 & C2 using Euclidean
Distance.
 For cluster-1:
D( R3, Centroid-1) =
= 5.22613624
D( R7, Centroid-1) =
= 13.80670127

 For cluster-2:
D( R3, Centroid-2) =
= 7.187952884
D( R7, Centroid-2) =
= 3.366501646
AI/ML for Engineering Applications
33
Dr.G.Madhu
Step-7: Again form Mapping of these two distances ( missing records )
C2 Distance C1 Distance Mapping
R3 7.187953 5.18411 12.37206
R7 3.366502 13.8067 17.1732

Step-8: Calculate the distance between missing value records to other


records ( R3 & R7).

R3 ( Missing record to other records) R7 ( Missing record to other records)


5.430158091 0.629018396
-1.51914775 -6.320287445
0.042025902 -4.759113793
1.202539061 -3.598600634
10.65900622 5.857866527
-1.379786846 -6.180926541
4.801139695 0
-1.51914775 -6.320287445
-1.611538261 -6.412677956

AI/ML for Engineering Applications


34
Dr.G.Madhu
Step-9: Select the minimum distance form both records R3 & R7, from the step-8 we
conclude that R9 is nearest to R3 & R7. ( R9 distance is the lower distance from all others
records).

records).
R3 25 ? 2 5 CLASS-1
R9 25 2 1 9 CLASS-2

R7 15 2 2 ? CLASS-2
R9 25 2 1 9 CLASS-2

So, impute a suitable value in missing cells.

AI/ML for Engineering Applications


35
Dr.G.Madhu
Step-10: Finally, complete dataset

A1 A2 A3 A4 Decision
R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 2 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 9 CLASS-2
R8 20 3 2 7 CLASS-1
R9 25 2 1 9 CLASS-2

AI/ML for Engineering Applications


36
Dr.G.Madhu
How to Handling Missing Data with Python Libraries
AI/ML for Engineering Applications
Dr.G.Madhu 38
• The StringIO() function allows us to read the
string assigned to csv_data into a pandas
DataFrame via the read_csv() function as if it
was a regular CSV file on our hard drive.

We can use isnull() method to check whether a


cell contains a numeric value ( False ) or if data
is missing ( True ):

AI/ML for Engineering Applications


39
Dr.G.Madhu
we may want to use the sum() method
which returns the number of missing
values per column:

AI/ML for Engineering Applications


40
Dr.G.Madhu
Eliminating samples/features with missing cells
via pandas.DataFrame.dropna()

Pandas dropna() method allows the user


to analyze and drop Rows/Columns with
Null values in different ways.

AI/ML for Engineering Applications


41
Dr.G.Madhu
We can drop columns that have at least
one NaN in any row by setting the axis
argument to 1:where axis : {0 or 'index',
1 or 'columns'}.

AI/ML for Engineering Applications


42
Dr.G.Madhu
AI/ML for Engineering Applications
Dr.G.Madhu 43
Estimating -Missing Values via Interpolation

• Mean imputation is a method replacing the


missing values with the mean value of the
entire feature column.
• While this method maintains the sample size
and is easy to use, the variability in the data is
reduced, so the standard deviations and the
variance estimates tend to be underestimated.

AI/ML for Engineering Applications


44
Dr.G.Madhu
AI/ML for Engineering Applications
Dr.G.Madhu 45
AI/ML for Engineering Applications
46
Dr.G.Madhu
Dealing with categorical data

Here are examples of categorical data:


– The blood type of a person: A, B, AB or O.
– The state that a resident of the United States lives
in.
– T-shirt size. XL > L > M
– T-shirt color.

AI/ML for Engineering Applications


47
Dr.G.Madhu
AI/ML for Engineering Applications Dr.G.Madhu 48
Ordinal Feature Mapping
• In order for out learning algorithm to interpret
the ordinal features correctly, we should
convert the categorical string values into
integers.
• However, since there is no convenient function
that can automatically derive the correct order
of the labels of our size feature, we have to
define the mapping manually.
• Let's assume that we know the difference
between features such as XL = L + 1 = M + 2.
AI/ML for Engineering Applications
49
Dr.G.Madhu
AI/ML for Engineering Applications
Dr.G.Madhu 50
• If we want to transform the integer values
back to the original string representation,
"inv_size_mapping"

AI/ML for Engineering Applications


51
Dr.G.Madhu
Class Labels Encoding

AI/ML for Engineering Applications


Dr.G.Madhu 52
Reverse the key-value pairs

AI/ML for Engineering Applications


Dr.G.Madhu 53
Dimensionality Reduction
in
Machine Learning
What is dimensionality reduction?

• Reducing the dimension of the feature space


is called “dimensionality reduction.”
• There are many ways to achieve
dimensionality reduction, but most of these
techniques fall into one of two classes:
– Feature Elimination
– Feature Extraction

AI/ML for Engineering Applications


55
Dr.G.Madhu
Principal Component Analysis

• PCA was invented in 1901 by


Karl Pearson, as an analogue
of the principal axis theorem
in mechanics;
• It was later independently
developed and named by
Harold Hotelling in the 1930s.
What is Principal Component Analysis (PCA)?

• Principal Component Analysis (PCA) is an


unsupervised, non-parametric statistical
technique primarily used for dimensionality
reduction in machine learning.
• PCA enables you to identify correlations and
patterns in a data set.
• So that it can be transformed into lower
dimension without loss of any important
information.
• Principal component analysis is a technique
for feature extraction 
Machine Learning
57
Dr.G.Madhu
Step By Step Computation of PCA
The below steps need to be followed to perform
dimensionality reduction by using PCA:
– Standardization of the data
– Computing the covariance matrix
– Calculating the eigenvectors and eigenvalues
– Computing the Principal Components
– Reducing the dimensions of the data set

Machine Learning
58
Dr.G.Madhu
Step 1: Normalize the data
• First step is to normalize the data that we have
so that PCA works properly.
• This is done by subtracting the respective
means from the numbers in the respective
column.
• So if we have two dimensions X and Y, all X
become 𝔁 and all Y become 𝒚.
• This produces a dataset whose mean is zero.

Machine Learning
59
Dr.G.Madhu
Machine Learning
60
Dr.G.Madhu
Step 2: Computing the covariance matrix

• PCA helps to identify the correlation and


dependencies among the features in a data set.
• A covariance matrix expresses the correlation
between the different variables in the data set.
• It is essential to identify heavily dependent
variables because they contain biased and
redundant information which reduces the overall
performance of the model.

Machine Learning
61
Dr.G.Madhu
• Mathematically, a covariance matrix is a p × p
matrix, where p represents the dimensions of
the data set.
• Each entry in the matrix represents the
covariance of the corresponding variables.
• Consider 2-Dimensional data set with
variables a and b, the covariance matrix is a
2×2 matrix as shown below:

Machine Learning
62
Dr.G.Madhu
How to Compute Covariance

Machine Learning
Dr.G.Madhu 63
Machine Learning Dr.G.Madhu 64
Step 3: Calculating the Eigenvectors and Eigenvalues

• Next step is to calculate the eigenvalues and eigenvectors


for the covariance matrix.
• The same is possible because it is a square matrix.
• ƛ is an eigenvalue for a matrix A if it is a solution of the
characteristic equation:
det( ƛI - A ) = 0
Where, I is the identity matrix of the same dimension as A.
For each eigenvalue ƛ, a corresponding eigen-vector v, can
be found by solving:
( ƛI - A )v = 0
Machine Learning
65
Dr.G.Madhu
Step 4: Choosing components and forming a feature vector

• We order the eigenvalues from largest to smallest so that it gives


us the components in order or significance. Here comes the
dimensionality reduction part.
• If we have a dataset with n variables, then we have the
corresponding n eigenvalues and eigenvectors.
• It turns out that the eigenvector corresponding to the highest
eigenvalue is the principal component of the dataset and it is our
call as to how many eigenvalues we choose to proceed our
analysis with.
• To reduce the dimensions, we choose the first p eigenvalues and
ignore the rest.
• We do lose out some information in the process, but if the
eigenvalues are small, Machine
we do not lose much.
Learning
66
Dr.G.Madhu
• Next we form a feature vector which is a matrix
of vectors, in our case, the eigenvectors.
• Since we just have 2 dimensions in the running
example, we can either choose the one
corresponding to the greater eigenvalue or
simply take both.

Feature Vector = (eig1, eig2)

Machine Learning
67
Dr.G.Madhu
Step 5: Forming Principal Components

• This is the final step where we actually form the principal


components using all the math we did till here.
• For the same, we take the transpose of the feature vector and
left-multiply it with the transpose of scaled version of original
dataset.

NewData = FeatureVectorT x ScaledDataT

• Here, NewData is the Matrix consisting of the principal


components,
• FeatureVector is the matrix we formed using the eigenvectors
we chose to keep, and
• ScaledData is the scaled version ofDr.G.Madhu
Machine Learning
original dataset 68
AI/ML for Engineering Applications
Dr.G.Madhu 69
AI/ML for Engineering Applications Dr.G.Madhu 70

You might also like