Faculty Development Program ON Artificial Intelligence & Machine Learning For Engineering Applications

FACULTY DEVELOPMENT
PROGRAM
ON
ARTIFICIAL INTELLIGENCE &
MACHINE LEARNING FOR
ENGINEERING APPLICATIONS
Dr. G.MADHU
Associate Professor
E-mail: madhu_g@vnrvjiet.in
Pre-Processing in Machine Learning
Objectives
• Recognize the importance of data

preparation in Machine Learning
• Identify the meaning and aspects
of feature engineering
• Standardize data features with
feature scaling
• Analyze missing datasets and its
examples
• Explain dimensionality reduction
with Principal Component
Analysis (PCA)
AI/ML for Engineering Applications

Dr.G.Madhu 3
What is Pre-Processing?
• Data Pre-processing is a process of preparing the raw data
and making it suitable for a machine learning model.
• It is the first and crucial step while creating a machine
learning model.
• In other words, whenever the data is collected from
different sources which is in raw format then it is not
feasible for the analysis.
• The set of steps are

known as Data Pre-
processing. This includes
• Data Cleaning,
• Data Integration,
• Data Transformation
• Data Reduction.
4
Dr.G.Madhu
5
Dr.G.Madhu
Why Data Preprocessing?
• A real-world data generally contains noises,

missing values, and may be in an unusable format
which cannot be directly used for machine
learning models.
• Data preprocessing is required tasks for cleaning
the data and making it suitable for a machine
learning model which also increases the accuracy
and efficiency of a machine learning model.

6
Dr.G.Madhu
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records

7
Dr.G.Madhu
Steps in Data Pre-processing in Machine Learning
• Acquire the dataset

• Import all required libraries
• Import the dataset
• Identifying and handling the missing values
• Encoding the categorical data
• Splitting the dataset
• Feature scaling
Dr.G.Madhu 8
• To build and develop Machine
Learning models, you must first
acquire the relevant dataset.
• This dataset will be comprised of
data gathered from multiple and
disparate sources which are then
combined in a proper format to
Acquire the Dataset form a dataset.
• Dataset formats differ according
to use cases.
• For example, a business dataset
will be entirely different from a
medical dataset.

Dr.G.Madhu 9
The Three core Python libraries used for
this data pre-processing in Machine
Learning are:
• NumPy – NumPy is the fundamental
package for scientific calculation in
Python.
Import all Required Libraries
• Pandas – Pandas is an excellent open-
source Python library for data
manipulation and analysis.
• Matplotlib – Matplotlib is a Python 2D
plotting library that is used to plot any
type of charts in Python.

Dr.G.Madhu 10
Import the Dataset
 In this step, you need to import the dataset/s

that you have gathered for the ML project at
hand.
 However, before you can import the dataset/s,
you must set the current directory as the
working directory.
 You can set the working directory in Spyder IDE
in three simple steps:

Dr.G.Madhu 11
Identifying and Handling the Missing Values
• In data pre-processing, it is pivotal to

identify and correctly handle the missing
values, failing to do this, you might draw
inaccurate and faulty conclusions and
inferences from the data.
• Needless to say, this will hamper your ML
project.
Dr.G.Madhu 12
Encoding the Categorical Data
• Categorical data refers to the information

that has specific categories within the
dataset.

Dr.G.Madhu 13
Splitting the Dataset
• Every dataset for Machine Learning model

must be split into two separate sets – training
set and test set.

Dr.G.Madhu 14
Feature Scaling
• It is a method to standardize the independent variables

of a dataset within a specific range.
• In other words, feature scaling limits the range of
variables so that you can compare them on common
grounds.
• Feature scaling would prevent the mentioned problem
and improve the overall performance of the model.

Dr.G.Madhu 15
16
Dr.G.Madhu
• In the dataset, you can notice that the age and salary columns do not have the
same scale.
• In such a scenario, if you compute any two values from the age and salary
columns, the salary values will dominate the age values and deliver incorrect
results.
• Thus, you must remove this issue by performing feature scaling for Machine
Learning. Dr.G.Madhu
17
You can perform feature scaling in Machine Learning in Three ways:
Standardization
Robust Scaler
It is similar to Min-Max Scaler

However, it uses the interquartile range
Instead of the min-max.
Min-Max Scaler

Dr.G.Madhu 18
19
Dr.G.Madhu
20
Dr.G.Madhu
What is Missing Value?
• Missing values means that some variable do

not have measurement.
Attributes
A1 A2 A3 A4 Class
Instances
12 ? 2 5 Yes
14.5 B 4 7 Yes
10.7 A 6 9 No

21
Dr.G.Madhu
For example, the famous “Wine Quality” dataset contains quite a lot of missing values:

22
Dr.G.Madhu
Illustrative Example: Missing Value Problem
Record A1 A2 A3 A4 Decision
R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 ? 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 ? CLASS-2
R8 20 3 2 7 CLASS-1
R9 25 2 1 9 CLASS-2
Step-1: Form Two clusters (k=2) with the given dataset having four features that have
nine data samples. (Randomly chosen clusters)
Cluster-1 R4 R5 R6 R9
Cluster-2 R1 R2 R8

23
Dr.G.Madhu
Step-2: Calculate each cluster centroid using
the following formula: R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 ? 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 ? CLASS-2
R8 20 3 2 7 CLASS-1
In Cluster-1 : R9 25 2 1 9 CLASS-2
For Sample
A1: Centroid ( A1) =(30+35+25+25)/4 = 115/4=28.75 Cluster-2 R1 R2 R8
A2: Centroid (A2) = (4+2+4+2)/4 = 12/4 =3

A3: Centroid ( A3) = (1+1+2+1)/4 = 5/4 =1.25
A4: Centroid (A4) = (9+7+9+9)/4 = 34/4 = 8.5

24
Dr.G.Madhu
Step-2: Calculate each cluster centroid using
the following formula: R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 ? 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 ? CLASS-2
R8 20 3 2 7 CLASS-1
R9 25 2 1 9 CLASS-2
In Cluster-2:
For Sample
A1: Centroid ( A1) =(15+20+20)/3 = 55/3=18.33333 Cluster-2 R1 R2 R8
A2: Centroid (A2) = (1+3+3)/3 = 7/3 =2.3333
A3: Centroid ( A3) = (1+2+2)/3 = 5/3 =1.6666
A4: Centroid (A4) = (9+7+7)/4 = 23/3 = 7.6666

25
Dr.G.Madhu
26
Dr.G.Madhu
Step-3: Calculate the distance from each of these centroids to all the data samples in
Cluster-1 ( used Euclidean Distance )
D( R1, Centroid-1) =

= 13.91492005
= 8.909264841
= 1.695585496 AI/ML for Engineering Applications

27
Dr.G.Madhu
= 6.33442973
= 3.984344363
= 8.909264841
=3.921096785

28
Dr.G.Madhu
Step-4: Calculate the distance from each of these centroids to all the data samples in
Cluster-2 ( used Euclidean Distance )

= 3.887301263
= 1.943650632
= 11.87901979
=16.69663972

29
Dr.G.Madhu
= 7.007932014
= 1.943650632
=6.839428176

30
Dr.G.Madhu
Step-4: After calculating the Euclidean distance with centroids from each of the data
samples, we come up with the following data table.
C1 Distance C2 Distance
13.91492005 3.887301263
8.909264841 1.943650632
5.22613624 7.187952884
1.695582496 11.87901979
6.33442973 16.69663972
3.984344363 7.007932014
13.80670127 3.366501646
8.909264841 1.943650632
3.921096785 6.839428176

31
Dr.G.Madhu
Step-5: Form Mapping of these two distances
C1 Distance C2 Distance Mapping

13.91492005 3.887301263 17.80222
8.909264841 1.943650632 10.85292
5.22613624 7.187952884 12.41409
1.695582496 11.87901979 13.5746
6.33442973 16.69663972 23.03107
3.984344363 7.007932014 10.99228
13.80670127 3.366501646 17.1732
8.909264841 1.943650632 10.85292
3.921096785 6.839428176 10.76052

Dr.G.Madhu 32
Step-6: Calculate the distance from missing records with centroids C1 & C2 using Euclidean
Distance.
For cluster-1:
= 5.22613624
= 13.80670127
For cluster-2:
= 7.187952884
= 3.366501646
33
Dr.G.Madhu
Step-7: Again form Mapping of these two distances ( missing records )
C2 Distance C1 Distance Mapping
R3 7.187953 5.18411 12.37206
R7 3.366502 13.8067 17.1732
Step-8: Calculate the distance between missing value records to other

records ( R3 & R7).
R3 ( Missing record to other records) R7 ( Missing record to other records)

5.430158091 0.629018396
-1.51914775 -6.320287445
0.042025902 -4.759113793
1.202539061 -3.598600634
10.65900622 5.857866527
-1.379786846 -6.180926541
4.801139695 0
-1.51914775 -6.320287445
-1.611538261 -6.412677956

34
Dr.G.Madhu
Step-9: Select the minimum distance form both records R3 & R7, from the step-8 we
conclude that R9 is nearest to R3 & R7. ( R9 distance is the lower distance from all others
records).
records).
R3 25 ? 2 5 CLASS-1
R9 25 2 1 9 CLASS-2
R7 15 2 2 ? CLASS-2
R9 25 2 1 9 CLASS-2
So, impute a suitable value in missing cells.

35
Dr.G.Madhu
Step-10: Finally, complete dataset
A1 A2 A3 A4 Decision
R1 15 1 1 9 CLASS-1
R2 20 3 2 7 CLASS-2
R3 25 2 2 5 CLASS-1
R4 30 4 1 9 CLASS-2
R5 35 2 1 7 CLASS-2
R6 25 4 2 9 CLASS-1
R7 15 2 2 9 CLASS-2
R8 20 3 2 7 CLASS-1
R9 25 2 1 9 CLASS-2

36
Dr.G.Madhu
How to Handling Missing Data with Python Libraries
Dr.G.Madhu 38
• The StringIO() function allows us to read the
string assigned to csv_data into a pandas
DataFrame via the read_csv() function as if it
was a regular CSV file on our hard drive.
We can use isnull() method to check whether a

cell contains a numeric value ( False ) or if data
is missing ( True ):

39
Dr.G.Madhu
we may want to use the sum() method
which returns the number of missing
values per column:

40
Dr.G.Madhu
Eliminating samples/features with missing cells
via pandas.DataFrame.dropna()
Pandas dropna() method allows the user

to analyze and drop Rows/Columns with
Null values in different ways.

41
Dr.G.Madhu
We can drop columns that have at least
one NaN in any row by setting the axis
argument to 1:where axis : {0 or 'index',
1 or 'columns'}.

42
Dr.G.Madhu
Dr.G.Madhu 43
Estimating -Missing Values via Interpolation
• Mean imputation is a method replacing the

missing values with the mean value of the
entire feature column.
• While this method maintains the sample size
and is easy to use, the variability in the data is
reduced, so the standard deviations and the
variance estimates tend to be underestimated.

44
Dr.G.Madhu
Dr.G.Madhu 45
46
Dr.G.Madhu
Dealing with categorical data
Here are examples of categorical data:

– The blood type of a person: A, B, AB or O.
– The state that a resident of the United States lives
in.
– T-shirt size. XL > L > M
– T-shirt color.

47
Dr.G.Madhu
AI/ML for Engineering Applications Dr.G.Madhu 48
Ordinal Feature Mapping
• In order for out learning algorithm to interpret
the ordinal features correctly, we should
convert the categorical string values into
integers.
• However, since there is no convenient function
that can automatically derive the correct order
of the labels of our size feature, we have to
define the mapping manually.
• Let's assume that we know the difference
between features such as XL = L + 1 = M + 2.
49
Dr.G.Madhu
Dr.G.Madhu 50
• If we want to transform the integer values
back to the original string representation,
"inv_size_mapping"

51
Dr.G.Madhu
Class Labels Encoding

Dr.G.Madhu 52
Reverse the key-value pairs

Dr.G.Madhu 53
Dimensionality Reduction
in
Machine Learning
What is dimensionality reduction?
• Reducing the dimension of the feature space

is called “dimensionality reduction.”
• There are many ways to achieve
dimensionality reduction, but most of these
techniques fall into one of two classes:
– Feature Elimination
– Feature Extraction

55
Dr.G.Madhu
Principal Component Analysis
• PCA was invented in 1901 by

Karl Pearson, as an analogue
of the principal axis theorem
in mechanics;
• It was later independently
developed and named by
Harold Hotelling in the 1930s.
What is Principal Component Analysis (PCA)?
• Principal Component Analysis (PCA) is an

unsupervised, non-parametric statistical
technique primarily used for dimensionality
reduction in machine learning.
• PCA enables you to identify correlations and
patterns in a data set.
• So that it can be transformed into lower
dimension without loss of any important
information.
• Principal component analysis is a technique
for feature extraction
Machine Learning
57
Dr.G.Madhu
Step By Step Computation of PCA
The below steps need to be followed to perform
dimensionality reduction by using PCA:
– Standardization of the data
– Computing the covariance matrix
– Calculating the eigenvectors and eigenvalues
– Computing the Principal Components
– Reducing the dimensions of the data set
Machine Learning
58
Dr.G.Madhu
Step 1: Normalize the data
• First step is to normalize the data that we have
so that PCA works properly.
• This is done by subtracting the respective
means from the numbers in the respective
column.
• So if we have two dimensions X and Y, all X
become 𝔁 and all Y become 𝒚.
• This produces a dataset whose mean is zero.
Machine Learning
59
Dr.G.Madhu
Machine Learning
60
Dr.G.Madhu
Step 2: Computing the covariance matrix
• PCA helps to identify the correlation and

dependencies among the features in a data set.
• A covariance matrix expresses the correlation
between the different variables in the data set.
• It is essential to identify heavily dependent
variables because they contain biased and
redundant information which reduces the overall
performance of the model.
Machine Learning
61
Dr.G.Madhu
• Mathematically, a covariance matrix is a p × p
matrix, where p represents the dimensions of
the data set.
• Each entry in the matrix represents the
covariance of the corresponding variables.
• Consider 2-Dimensional data set with
variables a and b, the covariance matrix is a
2×2 matrix as shown below:
Machine Learning
62
Dr.G.Madhu
How to Compute Covariance
Machine Learning
Dr.G.Madhu 63
Machine Learning Dr.G.Madhu 64
Step 3: Calculating the Eigenvectors and Eigenvalues
• Next step is to calculate the eigenvalues and eigenvectors

for the covariance matrix.
• The same is possible because it is a square matrix.
• ƛ is an eigenvalue for a matrix A if it is a solution of the
characteristic equation:
det( ƛI - A ) = 0
Where, I is the identity matrix of the same dimension as A.
For each eigenvalue ƛ, a corresponding eigen-vector v, can
be found by solving:
( ƛI - A )v = 0
Machine Learning
65
Dr.G.Madhu
Step 4: Choosing components and forming a feature vector
• We order the eigenvalues from largest to smallest so that it gives

us the components in order or significance. Here comes the
dimensionality reduction part.
• If we have a dataset with n variables, then we have the
corresponding n eigenvalues and eigenvectors.
• It turns out that the eigenvector corresponding to the highest
eigenvalue is the principal component of the dataset and it is our
call as to how many eigenvalues we choose to proceed our
analysis with.
• To reduce the dimensions, we choose the first p eigenvalues and
ignore the rest.
• We do lose out some information in the process, but if the
eigenvalues are small, Machine
we do not lose much.
Learning
66
Dr.G.Madhu
• Next we form a feature vector which is a matrix
of vectors, in our case, the eigenvectors.
• Since we just have 2 dimensions in the running
example, we can either choose the one
corresponding to the greater eigenvalue or
simply take both.
Feature Vector = (eig1, eig2)
Machine Learning
67
Dr.G.Madhu
Step 5: Forming Principal Components
• This is the final step where we actually form the principal

components using all the math we did till here.
• For the same, we take the transpose of the feature vector and
left-multiply it with the transpose of scaled version of original
dataset.
NewData = FeatureVectorT x ScaledDataT
• Here, NewData is the Matrix consisting of the principal

components,
• FeatureVector is the matrix we formed using the eigenvectors
we chose to keep, and
• ScaledData is the scaled version ofDr.G.Madhu
Machine Learning
original dataset 68
Dr.G.Madhu 69
AI/ML for Engineering Applications Dr.G.Madhu 70

Faculty Development Program ON Artificial Intelligence & Machine Learning For Engineering Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Faculty Development Program ON Artificial Intelligence & Machine Learning For Engineering Applications

Uploaded by

Copyright:

Available Formats

FACULTY DEVELOPMENT

• Recognize the importance of data

AI/ML for Engineering Applications

• The set of steps are

• A real-world data generally contains noises,

AI/ML for Engineering Applications

AI/ML for Engineering Applications

• Acquire the dataset

AI/ML for Engineering Applications

AI/ML for Engineering Applications

 In this step, you need to import the dataset/s

AI/ML for Engineering Applications

• In data pre-processing, it is pivotal to

• Categorical data refers to the information

AI/ML for Engineering Applications

• Every dataset for Machine Learning model

AI/ML for Engineering Applications

• It is a method to standardize the independent variables

AI/ML for Engineering Applications

It is similar to Min-Max Scaler

AI/ML for Engineering Applications

• Missing values means that some variable do

AI/ML for Engineering Applications

AI/ML for Engineering Applications

AI/ML for Engineering Applications

A2: Centroid (A2) = (4+2+4+2)/4 = 12/4 =3

AI/ML for Engineering Applications

A2: Centroid (A2) = (1+3+3)/3 = 7/3 =2.3333

A3: Centroid ( A3) = (1+2+2)/3 = 5/3 =1.6666

A4: Centroid (A4) = (9+7+7)/4 = 23/3 = 7.6666

AI/ML for Engineering Applications

D( R1, Centroid-1) =

= 1.695585496 AI/ML for Engineering Applications

AI/ML for Engineering Applications

D( R1, Centroid-2) =

AI/ML for Engineering Applications

AI/ML for Engineering Applications

AI/ML for Engineering Applications

C1 Distance C2 Distance Mapping

AI/ML for Engineering Applications

Step-8: Calculate the distance between missing value records to other

R3 ( Missing record to other records) R7 ( Missing record to other records)

AI/ML for Engineering Applications

So, impute a suitable value in missing cells.

AI/ML for Engineering Applications

AI/ML for Engineering Applications

We can use isnull() method to check whether a

AI/ML for Engineering Applications

AI/ML for Engineering Applications

Pandas dropna() method allows the user

AI/ML for Engineering Applications

AI/ML for Engineering Applications

• Mean imputation is a method replacing the

AI/ML for Engineering Applications

Here are examples of categorical data:

AI/ML for Engineering Applications

AI/ML for Engineering Applications

AI/ML for Engineering Applications

AI/ML for Engineering Applications

• Reducing the dimension of the feature space

AI/ML for Engineering Applications

• PCA was invented in 1901 by

• Principal Component Analysis (PCA) is an