Machine Learning With Python

Machine Learning with Python
PART - 3
Debdeep Chaudhuri | Machine Learning | 08.07.22

cdebdeep@gmail.com
Objective
This is the part-3 of my ongoing article on Machine Learning / Deep Learning / AI.
In the 2nd part we have used Pandas to load our data frame from a .csv file, removed the noise/outliers in
data.
We have used ColumnTransformer to encode data using OneHotEncoding/OrdinalEncoding, then we build

our very 1st ML model out of that data & did prediction & score/accuracy test.
Now in this part-3 we will take one step further.
We will make our steps more concise, we will create our own function for removing Outliers & use
PipeLines on top of that.
Already sounds interesting!!!
Let’s start our journey then …..
STEPS
For this purpose, we will be working on the following steps:
1. Creating our won function to remove Outliers in data.

2. Use this functions with Pipelines.
3. Use ColumnTransformer in Pipelines.
4. We will build our ML model & predict the car price
5. We will validate & find out the score / accuracy of our ML model.
For this demo purpose I will be writing my Python codes using Visual Studio Code but you can use any
Python NoteBook editor you like.
I am not going to write down the steps like “How to install Python/ Anaconda”, “How to setup VS code to
work with Python NoteBook” etc. because you can find many article in the cloud for this purpose using any
search engine like Google.
I will share links for this purpose in the “More to Read” section at the very last section of this article.
PAGE 1
Creating a panda data frame from a downloaded .csv file.
First we need to import some libraries to do our work in Python NoteBook.
Code Section (Copy & Paste)
import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as snb
import matplotlib as mat
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
copy & paste the above code in your python notebook & run.
If you are using VS code like me, before running the code do not forget to choose the kernel as per the
following image:
Now we are ready to create our Panda data frame from .csv file.
PAGE 2
For this purpose, do the following:
1st go to this url https://www.kaggle.com/datasets/shaistashaikh/carprice-assignment
Download the files & save it on your disk for offline access it will help.
Now load it from your hard drive, make sure to give the right path of the file.
It will be a lot easier if you save this .csv file under the same folder where you are saving this notebook.
df = pd.read_csv('CarPrice_Assignment.csv')
df.info()
If we run the code, we will get the following results
The output shows that our data frame contains 205 rows, indexing from 0 to 244 & 26 columns.
For this demo purpose I have not taken the total screen shot rather I am showing only 20 columns.
PAGE 3
But we are not going to analysis all column for this demo purpose, so we will take few columns from it.
dfOriginal = df[['CarName','fueltype','carbody','enginetype','cylindernumber','enginesize','fuelsystem','horsepower','price']]
dfOriginal
dfOriginal.head(2)
Here we are taking 8 input columns & one output column, the price column that we will predict.
By running the above code, we get the following result
So we have our initial data frame ready to start our work.
But just before starting the work or jumping into the world of making a ML data model it is always
recommended to have a close look into the data, in our case our data frame.
dfOriginal.info()
By running the code, we get the following output
As expected it is a pandas Data Frame object & it has 205 rows, 9 columns in it.
Among these 9 columns we will take first 8 columns as our input data to test our model prediction.
PAGE 4
Now look very carefully the first 8 columns, what did we find out!
Well, we are closely looking at the nature / data type of this columns.
Among these 8 columns the columns “enginesize” & “horsepower” are numerical, rest of it are type string,
categorical in nature.
Why it is so important for us!!
Well my friend as ML model only works with numerical values, not other data types, so we need to convert
our string type ordinal values to numerical.
CHECKING FOR NOISE / OUTLIER IN DATA

Now this is very 1st steps we need to perform.
Noise can be missing values also.
If you run this block of code “df.isnull().sum()” then it will show you if any null vales (NAN) is present in
the data frame or not.
Fortunately, we do not that in our working data frame.
But if you have that in your working data frame then you need to handle it first.
If you are wondering how to do it ! then read my 1st article , I have already explain that to you ☺
Now we will run one particular code block on our numerical columns.
Just run it, see the result & I will explain it to you.
dfOriginal.describe()
The Output
Now look into the “enginesize” & “horsepower” column carefully.
PAGE 5
Just look into the row “min”,“25%” , “50%” ,”75%” & max.
Let’s pick up the “enginesize” column values.
The values from min to 75% is changing but not very large difference (61,97,120,141) but from 75% to
max we can observe a huge change in value (141, 326).
You may be wondering what does it indicates!!
Well, the value of 25% indicates the total population of data points, enginesize below 25% of the data
sample size.
Its mean that from the value from 61 to 97 our 25% of data resided & so on for 50% & 75%.
So 75%, tell us that from the value 61 to 141 our 75% of data points, enginesize resided.
But from 141 (75%) to 326 (max) in this massive region only 25% data points, enginesize resided.
Min~75% (61~141) 75%~Max (141~326)
So the small bucket has more data points than the large bucket.
So we have possible noise / Outlier in our data. Let’s plot some graph to understand it more clearly.
Just run this code
snb.boxplot(data=dfOriginal,x='enginesize')
The output we get
This is the BoxPlot of Seaborn.
PAGE 6
Seaborn is excellent library is you want to analysis your data in graphical way.
Now in the above image you can see around 60 we have one bar & around 200 we have another. It defines
a range.
It is showing us the inter quartile distribution of our data points.
Do not get afraid as we are into statistics now ☺ .
For now, just watch we have some points after the mark of 200.
This data points are called OUTLIER or NOISE in data.
In ML it is better to have data points as much close as possible or pick up a segment for your analysis
where you can find more data points or you may call more data for analysis.
Whatever data resided apart from this dense data region you may call them Outlier which has no such
significant contribution in terms of ML model building & prediction.
We can also run the Histogram to view the distribution of data points.
snb.histplot(data=dfOriginal,x='enginesize',kde=1)
Now if you look into the Histogram, you can see more data points are plotted towards the left side.
This is called the Right Skewness of data.
In a nutshell more data points are available in the left than the right.
We can also get the value of the Skewness if we run the code
“dfOriginal['enginesize'].skew()”
For us the value is 1.95, the positive value indicates that it is Right Skewed.
So we need to exclude them from our data as much as possible for us.
The question you may ask me, well this is on theory man but how to do it!!
PAGE 7
UDF
In my previous article I have shown you the steps , breaking down every steps, now time for define our
own custom function to do the job ☺
I function is necessary when we talk about some reusable component & want to write some clean code for
easy maintenance in the future.
class CustomOutliresRemover(BaseEstimator,TransformerMixin):
def __init__(self,by=2,columns=None):
self.by=by
self.columns=columns
def fit(self,X,Y=None):
return self
def transform(self,X,Y=None):
column_to_transfer = list(X.columns)
if self.columns:
column_to_transfer=self.columns
zscores = z_scores(X[column_to_transfer])
abs_z_scores = np.abs(zscores)
filtered_entries = (abs_z_scores < self.by)
X = X[filtered_entries.values]
return X
We will this UDF of ours on column ‘enginesize’ & ‘horsepower’ in a Pipeline , for now leave it & lets us
focus on column ‘CarName’.
Working with the CarName column now.
We will convert those CarName whose frequency is 2 or less than that.
Let's see what we have in the CarName column, the frequency distribution of data.
PAGE 8
CarNameCount = dfOriginal['CarName'].value_counts()
CarNameCount
Result
Now we want to find out the count of ‘CarName’ whose count (Occurrence) is 2 or less here.
lessCount = 0
for v in CarNameCount.iteritems():
if v[1] <=2:
lessCount = lessCount+1
print(lessCount)
The Result is: 137.
So now we want all this ‘CarName’ values to be replaced by our own name like ‘unknown’.
Let’s make a function for that ….
PAGE 9
Creating our won Function to convert CarName to a custom name whose frequency count is lee than the
value we want.
class CustomGroupNameDefiner(BaseEstimator,TransformerMixin):
def __init__(self,columns=None,treshhold=2,renameTo='uncommon'):
self.columns=columns
self.treshhold=treshhold
self.renameTo=renameTo
def fit(self,X,Y=None):
return self
def transform(self,X,Y=None):
if self.columns:
column_to_transfer=self.columns
CarNameCount = X[column_to_transfer].value_counts()
replindex = CarNameCount[CarNameCount<=self.treshhold].index
X[column_to_transfer] =X[column_to_transfer].replace(replindex,self.renameTo).values
return X
So we have successfully created our UDFs.
Time to call them using Pipeline.
Now you may ask me ‘What is a Pipeline’ or ‘Why do we need them!’.
Well, I am not going to give you the definition of that, you can use any search engine & find that out.
I am going to answer you in my own way!
If have already 2nd part of this ongoing series article, you have already find out the mammoth steps we have
done to get the result. right?
We are here to make our life simple not complicated.
PAGE 10
PIPELINE WITH UDF
So, if we can cut down those steps, will it be not handy! And on top of that if we want to do some
prediction using our ML model then we have to follow the same steps again with the new set of test data.
So if we can create a rapper around our steps & follow that rapper in the future also it will be helpful for us.
Pipelines are nothing but this rappers!
So let’s create one for us now.
pipl1 = Pipeline(steps=[
( "trans1",CustomOutliresRemover(2,columns=['enginesize']) ),
( "trans2",CustomOutliresRemover(2,columns=['horsepower']) ),
( "trans3",CustomGroupNameDefiner(columns=['CarName'],treshhold=2,renameTo='unknown') )
])
dfx1 = pipl1.fit_transform(dfOriginal)
dfx1
Just run the code , & see the magic … ☺
Result
Wow !!! we have converted 3 steps to one step …. , enjoy ….. ☺
PAGE 11
PIPELINE WITH COLUMNTRANSFORMERS
Now run ‘dfx1.nunique() ‘& see the result
The ‘CarName’ has deduced to only 11 unique entries.
Now we will convert all the string/ordinal values to numeric with the help of OneHotEncoding &
OrdinalEncoding.
We will apply OrdinalEncoding on the column ‘cylindernumber’, so let’s have a close look into the values
of this column.
dfx1['cylindernumber'].value_counts()
Result
So we got our unique values in ‘cylindernumber’, let’s build our transformer then.
tfr3 = ColumnTransformer(
[
('ordle', OrdinalEncoder(categories=[['two','three','four','five','six']]),['cylindernumber'] )
],
remainder='passthrough')
PAGE 12
Now let’s build our ‘OneHotEncoding’ transformer.
tfr4 = ColumnTransformer(
[
( 'ohe', OneHotEncoder(sparse=False,drop='first'),[1,2,3,4,6] )
],
remainder='passthrough')
So we have our two column transformer ready, so let’s build a pipeline for them.
Wait, did you observe one thing?
For tfr3 we have specified our column by its name ‘cylindernumber’ but for tfr4 we have used index of the
column not the names!! & farther more if you closely look into the index values we have started with 1
instead off 0 here!! The ‘CarName’ index is 0 in dfx1, then why so?
Let me answer this two question for you before we can move on again….
My friends, in the next step we are going to create this pipeline, now here we are going to use ‘tfr3’, the
OrdinalEncoding 1st the we are going to apply ‘tfr4’ OneHotEncoding in the outcome of the 1st step.
Now any columntransformer will yield a numpy data array not a data frame & an array will not have any
column name like a data frame.
So if we want to work on with the output of 1st step, the ‘tfr3’ then we have to work with an array by
calling its column index.
That is why while defining out ‘tfr4’ we have used the index of the column not the name of that.
Now coming to the 2nd question …
Yes, in the ‘dfx1’ the ‘CarName’ column has the index value of 0 but when we are passing our ‘dfx1’
through ‘tfr3’ the index values will be changed.
To prove my theory let’s do one small excrement here.
PAGE 13
Here what we have done is that after creating the ‘tfr3’ we used it with our ‘dfx1’ & print the very 1 st
element of the array outcome.
As you can clearly see here the value in the 0 index position is ‘2.0’ & the 1 position is ‘unknown’.
It is because output if our ‘tfr3’, the OrdinalEncoder has taken the 0 index place in the array which
eventually pushed the array index of any element before ‘cylindernumber’ of dfx1 to one position.
So the index of those column has increased by 1.
That is why in the result array of ‘tfr3’ the index of ‘CarName’ has become / will become 1 instead off 0.
Complicated!! Yes a bit … ☺
So now let’s build our pipeline then for this two ColumnTransformers.
pipl2 = Pipeline(steps=[
('tfr3' ,tfr3),
('tfr4' ,tfr4 )
])
PAGE 14
Now we have new pipeline with the columntransformers ready …
So let’s call it.
dfx2 = pipl2.fit_transform(dfx1)
dfx2
Result
So this is the final output by which we will create our ML model now.
From here the steps are same as of my previous article (2nd one) of this series.
So I will not explain it again here now … please read my 2nd article on this.
PAGE 15
from sklearn.model_selection import train_test_split

X_Train,X_Test,Y_Train,Y_Test = train_test_split(dfx2[:,0:30],dfx2[:,-1],test_size=0.2)
print(X_Train.shape)
print(X_Test.shape)
from sklearn.linear_model import LinearRegression
lirM = LinearRegression()
lirM.fit(X_Train,Y_Train)
arryMyPredection = lirM.predict(X_Test)
print(arryMyPredection.shape)
print(" ")
print('The predected values are :')
print(arryMyPredection)
print("The actual Price is : ")

print(Y_Test)
print(" ")
Y_Test.shape
PAGE 16
lirM.score(X_Test,Y_Test)
Well, our model is 76% accurate in its prediction …. ☺
That’s all for now my friends.
If you want you can make our UDFs a part of Python Package, so that in future is needed we can simply
call our UDFs from that Package.
I leave this part up to you …. ☺
If you like my article, please leave your comments, give it a like & share in your social media circle.
Until then Good Bye …. Happy Coding … ☺ Thanks ……….
PAGE 17
IF YOU WANT MORE
https://www.python.org/
https://www.anaconda.com/products/distribution
https://numpy.org/learn/
https://pandas.pydata.org/docs/index.html
https://scikit-learn.org/stable/
https://code.visualstudio.com/docs/python/python-tutorial
https://www.geeksforgeeks.org/introduction-machine-learning-using-python/
https://www.tutorialspoint.com/machine_learning_with_python/index.htm
https://en.wikipedia.org/wiki/Machine_learning
https://azure.microsoft.com/en-in/overview/what-is-machine-learning-platform/
https://www.tensorflow.org/
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
PAGE 18

Machine Learning With Python - Part-3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning With Python - Part-3

Uploaded by

Copyright:

Available Formats

Debdeep Chaudhuri | Machine Learning | 08.07.22

We have used ColumnTransformer to encode data using OneHotEncoding/OrdinalEncoding, then we build

Now in this part-3 we will take one step further.

Already sounds interesting!!!

Let’s start our journey then …..

1. Creating our won function to remove Outliers in data.

First we need to import some libraries to do our work in Python NoteBook.

Code Section (Copy & Paste)

import seaborn as snb

import matplotlib as mat

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import OrdinalEncoder

from sklearn.impute import SimpleImputer

1st go to this url https://www.kaggle.com/datasets/shaistashaikh/carprice-assignment

Code Section (Copy & Paste)

If we run the code, we will get the following results

Code Section (Copy & Paste)

By running the above code, we get the following result

So we have our initial data frame ready to start our work.

Code Section (Copy & Paste)

By running the code, we get the following output

Why it is so important for us!!

CHECKING FOR NOISE / OUTLIER IN DATA

Noise can be missing values also.

Fortunately, we do not that in our working data frame.

Code Section (Copy & Paste)

Now look into the “enginesize” & “horsepower” column carefully.

Let’s pick up the “enginesize” column values.

You may be wondering what does it indicates!!

Min~75% (61~141) 75%~Max (141~326)

Just run this code

Code Section (Copy & Paste)

The output we get

This is the BoxPlot of Seaborn.

It is showing us the inter quartile distribution of our data points.

Do not get afraid as we are into statistics now ☺ .

This data points are called OUTLIER or NOISE in data.

Code Section (Copy & Paste)

This is called the Right Skewness of data.

Code Section (Copy & Paste)

Working with the CarName column now.

We will convert those CarName whose frequency is 2 or less than that.

Code Section (Copy & Paste)

The Result is: 137.

Let’s make a function for that ….

Code Section (Copy & Paste)

So we have successfully created our UDFs.

Time to call them using Pipeline.

Now you may ask me ‘What is a Pipeline’ or ‘Why do we need them!’.

I am going to answer you in my own way!

We are here to make our life simple not complicated.

Pipelines are nothing but this rappers!

So let’s create one for us now.

Code Section (Copy & Paste)

Just run the code , & see the magic … ☺

Wow !!! we have converted 3 steps to one step …. , enjoy ….. ☺

The ‘CarName’ has deduced to only 11 unique entries.

Code Section (Copy & Paste)

Code Section (Copy & Paste)

Code Section (Copy & Paste)

Wait, did you observe one thing?