You are on page 1of 19

Machine Learning with Python

PART - 3

Debdeep Chaudhuri | Machine Learning | 08.07.22


cdebdeep@gmail.com
Objective
This is the part-3 of my ongoing article on Machine Learning / Deep Learning / AI.

In the 2nd part we have used Pandas to load our data frame from a .csv file, removed the noise/outliers in
data.

We have used ColumnTransformer to encode data using OneHotEncoding/OrdinalEncoding, then we build


our very 1st ML model out of that data & did prediction & score/accuracy test.

Now in this part-3 we will take one step further.

We will make our steps more concise, we will create our own function for removing Outliers & use
PipeLines on top of that.

Already sounds interesting!!!

Let’s start our journey then …..

STEPS
For this purpose, we will be working on the following steps:

1. Creating our won function to remove Outliers in data.


2. Use this functions with Pipelines.
3. Use ColumnTransformer in Pipelines.
4. We will build our ML model & predict the car price
5. We will validate & find out the score / accuracy of our ML model.

For this demo purpose I will be writing my Python codes using Visual Studio Code but you can use any
Python NoteBook editor you like.

I am not going to write down the steps like “How to install Python/ Anaconda”, “How to setup VS code to
work with Python NoteBook” etc. because you can find many article in the cloud for this purpose using any
search engine like Google.

I will share links for this purpose in the “More to Read” section at the very last section of this article.

PAGE 1
Creating a panda data frame from a downloaded .csv file.

First we need to import some libraries to do our work in Python NoteBook.

Code Section (Copy & Paste)

import pandas as pd

import numpy as np

import sklearn as sk

import seaborn as snb

import matplotlib as mat

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import OrdinalEncoder

from sklearn.impute import SimpleImputer

copy & paste the above code in your python notebook & run.

If you are using VS code like me, before running the code do not forget to choose the kernel as per the
following image:

Now we are ready to create our Panda data frame from .csv file.

PAGE 2
For this purpose, do the following:

1st go to this url https://www.kaggle.com/datasets/shaistashaikh/carprice-assignment

Download the files & save it on your disk for offline access it will help.

Now load it from your hard drive, make sure to give the right path of the file.

It will be a lot easier if you save this .csv file under the same folder where you are saving this notebook.

Code Section (Copy & Paste)

df = pd.read_csv('CarPrice_Assignment.csv')
df.info()

If we run the code, we will get the following results

The output shows that our data frame contains 205 rows, indexing from 0 to 244 & 26 columns.

For this demo purpose I have not taken the total screen shot rather I am showing only 20 columns.

PAGE 3
But we are not going to analysis all column for this demo purpose, so we will take few columns from it.

Code Section (Copy & Paste)

dfOriginal = df[['CarName','fueltype','carbody','enginetype','cylindernumber','enginesize','fuelsystem','horsepower','price']]
dfOriginal
dfOriginal.head(2)

Here we are taking 8 input columns & one output column, the price column that we will predict.

By running the above code, we get the following result

So we have our initial data frame ready to start our work.

But just before starting the work or jumping into the world of making a ML data model it is always
recommended to have a close look into the data, in our case our data frame.

Code Section (Copy & Paste)

dfOriginal.info()

By running the code, we get the following output

As expected it is a pandas Data Frame object & it has 205 rows, 9 columns in it.

Among these 9 columns we will take first 8 columns as our input data to test our model prediction.

PAGE 4
Now look very carefully the first 8 columns, what did we find out!

Well, we are closely looking at the nature / data type of this columns.

Among these 8 columns the columns “enginesize” & “horsepower” are numerical, rest of it are type string,
categorical in nature.

Why it is so important for us!!

Well my friend as ML model only works with numerical values, not other data types, so we need to convert
our string type ordinal values to numerical.

CHECKING FOR NOISE / OUTLIER IN DATA


Now this is very 1st steps we need to perform.

Noise can be missing values also.

If you run this block of code “df.isnull().sum()” then it will show you if any null vales (NAN) is present in
the data frame or not.

Fortunately, we do not that in our working data frame.

But if you have that in your working data frame then you need to handle it first.

If you are wondering how to do it ! then read my 1st article , I have already explain that to you ☺

Now we will run one particular code block on our numerical columns.

Just run it, see the result & I will explain it to you.

Code Section (Copy & Paste)

dfOriginal.describe()

The Output

Now look into the “enginesize” & “horsepower” column carefully.

PAGE 5
Just look into the row “min”,“25%” , “50%” ,”75%” & max.

Let’s pick up the “enginesize” column values.

The values from min to 75% is changing but not very large difference (61,97,120,141) but from 75% to
max we can observe a huge change in value (141, 326).

You may be wondering what does it indicates!!

Well, the value of 25% indicates the total population of data points, enginesize below 25% of the data
sample size.

Its mean that from the value from 61 to 97 our 25% of data resided & so on for 50% & 75%.

So 75%, tell us that from the value 61 to 141 our 75% of data points, enginesize resided.

But from 141 (75%) to 326 (max) in this massive region only 25% data points, enginesize resided.

Min~75% (61~141) 75%~Max (141~326)

So the small bucket has more data points than the large bucket.

So we have possible noise / Outlier in our data. Let’s plot some graph to understand it more clearly.

Just run this code

Code Section (Copy & Paste)

snb.boxplot(data=dfOriginal,x='enginesize')

The output we get

This is the BoxPlot of Seaborn.

PAGE 6
Seaborn is excellent library is you want to analysis your data in graphical way.

Now in the above image you can see around 60 we have one bar & around 200 we have another. It defines
a range.

It is showing us the inter quartile distribution of our data points.

Do not get afraid as we are into statistics now ☺ .

For now, just watch we have some points after the mark of 200.

This data points are called OUTLIER or NOISE in data.

In ML it is better to have data points as much close as possible or pick up a segment for your analysis
where you can find more data points or you may call more data for analysis.

Whatever data resided apart from this dense data region you may call them Outlier which has no such
significant contribution in terms of ML model building & prediction.

We can also run the Histogram to view the distribution of data points.

Code Section (Copy & Paste)

snb.histplot(data=dfOriginal,x='enginesize',kde=1)

Now if you look into the Histogram, you can see more data points are plotted towards the left side.

This is called the Right Skewness of data.

In a nutshell more data points are available in the left than the right.

We can also get the value of the Skewness if we run the code

“dfOriginal['enginesize'].skew()”

For us the value is 1.95, the positive value indicates that it is Right Skewed.

So we need to exclude them from our data as much as possible for us.

The question you may ask me, well this is on theory man but how to do it!!

PAGE 7
UDF
In my previous article I have shown you the steps , breaking down every steps, now time for define our
own custom function to do the job ☺

I function is necessary when we talk about some reusable component & want to write some clean code for
easy maintenance in the future.

Code Section (Copy & Paste)

class CustomOutliresRemover(BaseEstimator,TransformerMixin):

def __init__(self,by=2,columns=None):
self.by=by
self.columns=columns

def fit(self,X,Y=None):
return self

def transform(self,X,Y=None):

column_to_transfer = list(X.columns)

if self.columns:
column_to_transfer=self.columns

zscores = z_scores(X[column_to_transfer])

abs_z_scores = np.abs(zscores)
filtered_entries = (abs_z_scores < self.by)

X = X[filtered_entries.values]

return X

We will this UDF of ours on column ‘enginesize’ & ‘horsepower’ in a Pipeline , for now leave it & lets us
focus on column ‘CarName’.

Working with the CarName column now.

We will convert those CarName whose frequency is 2 or less than that.

Let's see what we have in the CarName column, the frequency distribution of data.

PAGE 8
Code Section (Copy & Paste)

CarNameCount = dfOriginal['CarName'].value_counts()
CarNameCount

Result

Now we want to find out the count of ‘CarName’ whose count (Occurrence) is 2 or less here.

Code Section (Copy & Paste)

lessCount = 0
for v in CarNameCount.iteritems():
if v[1] <=2:
lessCount = lessCount+1

print(lessCount)

The Result is: 137.

So now we want all this ‘CarName’ values to be replaced by our own name like ‘unknown’.

Let’s make a function for that ….

PAGE 9
Creating our won Function to convert CarName to a custom name whose frequency count is lee than the
value we want.

Code Section (Copy & Paste)

class CustomGroupNameDefiner(BaseEstimator,TransformerMixin):

def __init__(self,columns=None,treshhold=2,renameTo='uncommon'):
self.columns=columns
self.treshhold=treshhold
self.renameTo=renameTo

def fit(self,X,Y=None):
return self

def transform(self,X,Y=None):

if self.columns:
column_to_transfer=self.columns
CarNameCount = X[column_to_transfer].value_counts()
replindex = CarNameCount[CarNameCount<=self.treshhold].index
X[column_to_transfer] =X[column_to_transfer].replace(replindex,self.renameTo).values
return X

So we have successfully created our UDFs.

Time to call them using Pipeline.

Now you may ask me ‘What is a Pipeline’ or ‘Why do we need them!’.

Well, I am not going to give you the definition of that, you can use any search engine & find that out.

I am going to answer you in my own way!

If have already 2nd part of this ongoing series article, you have already find out the mammoth steps we have
done to get the result. right?

We are here to make our life simple not complicated.

PAGE 10
PIPELINE WITH UDF
So, if we can cut down those steps, will it be not handy! And on top of that if we want to do some
prediction using our ML model then we have to follow the same steps again with the new set of test data.

So if we can create a rapper around our steps & follow that rapper in the future also it will be helpful for us.

Pipelines are nothing but this rappers!

So let’s create one for us now.

Code Section (Copy & Paste)

pipl1 = Pipeline(steps=[
( "trans1",CustomOutliresRemover(2,columns=['enginesize']) ),
( "trans2",CustomOutliresRemover(2,columns=['horsepower']) ),
( "trans3",CustomGroupNameDefiner(columns=['CarName'],treshhold=2,renameTo='unknown') )
])
dfx1 = pipl1.fit_transform(dfOriginal)
dfx1

Just run the code , & see the magic … ☺

Result

Wow !!! we have converted 3 steps to one step …. , enjoy ….. ☺

PAGE 11
PIPELINE WITH COLUMNTRANSFORMERS
Now run ‘dfx1.nunique() ‘& see the result

The ‘CarName’ has deduced to only 11 unique entries.

Now we will convert all the string/ordinal values to numeric with the help of OneHotEncoding &
OrdinalEncoding.

We will apply OrdinalEncoding on the column ‘cylindernumber’, so let’s have a close look into the values
of this column.

Code Section (Copy & Paste)

dfx1['cylindernumber'].value_counts()

Result

So we got our unique values in ‘cylindernumber’, let’s build our transformer then.

Code Section (Copy & Paste)

tfr3 = ColumnTransformer(
[
('ordle', OrdinalEncoder(categories=[['two','three','four','five','six']]),['cylindernumber'] )
],
remainder='passthrough')

PAGE 12
Now let’s build our ‘OneHotEncoding’ transformer.

Code Section (Copy & Paste)

tfr4 = ColumnTransformer(
[
( 'ohe', OneHotEncoder(sparse=False,drop='first'),[1,2,3,4,6] )
],
remainder='passthrough')

So we have our two column transformer ready, so let’s build a pipeline for them.

Wait, did you observe one thing?

For tfr3 we have specified our column by its name ‘cylindernumber’ but for tfr4 we have used index of the
column not the names!! & farther more if you closely look into the index values we have started with 1
instead off 0 here!! The ‘CarName’ index is 0 in dfx1, then why so?

Let me answer this two question for you before we can move on again….

My friends, in the next step we are going to create this pipeline, now here we are going to use ‘tfr3’, the
OrdinalEncoding 1st the we are going to apply ‘tfr4’ OneHotEncoding in the outcome of the 1st step.

Now any columntransformer will yield a numpy data array not a data frame & an array will not have any
column name like a data frame.

So if we want to work on with the output of 1st step, the ‘tfr3’ then we have to work with an array by
calling its column index.

That is why while defining out ‘tfr4’ we have used the index of the column not the name of that.

Now coming to the 2nd question …

Yes, in the ‘dfx1’ the ‘CarName’ column has the index value of 0 but when we are passing our ‘dfx1’
through ‘tfr3’ the index values will be changed.

To prove my theory let’s do one small excrement here.

PAGE 13
Here what we have done is that after creating the ‘tfr3’ we used it with our ‘dfx1’ & print the very 1 st
element of the array outcome.

As you can clearly see here the value in the 0 index position is ‘2.0’ & the 1 position is ‘unknown’.

It is because output if our ‘tfr3’, the OrdinalEncoder has taken the 0 index place in the array which
eventually pushed the array index of any element before ‘cylindernumber’ of dfx1 to one position.

So the index of those column has increased by 1.

That is why in the result array of ‘tfr3’ the index of ‘CarName’ has become / will become 1 instead off 0.

Complicated!! Yes a bit … ☺

So now let’s build our pipeline then for this two ColumnTransformers.

Code Section (Copy & Paste)

pipl2 = Pipeline(steps=[
('tfr3' ,tfr3),
('tfr4' ,tfr4 )
])

PAGE 14
Now we have new pipeline with the columntransformers ready …

So let’s call it.

Code Section (Copy & Paste)

dfx2 = pipl2.fit_transform(dfx1)
dfx2

Result

So this is the final output by which we will create our ML model now.

From here the steps are same as of my previous article (2nd one) of this series.

So I will not explain it again here now … please read my 2nd article on this.

PAGE 15
Code Section (Copy & Paste)

from sklearn.model_selection import train_test_split


X_Train,X_Test,Y_Train,Y_Test = train_test_split(dfx2[:,0:30],dfx2[:,-1],test_size=0.2)
print(X_Train.shape)
print(X_Test.shape)

from sklearn.linear_model import LinearRegression

lirM = LinearRegression()

lirM.fit(X_Train,Y_Train)

Code Section (Copy & Paste)

arryMyPredection = lirM.predict(X_Test)
print(arryMyPredection.shape)

print(" ")
print('The predected values are :')
print(arryMyPredection)

Code Section (Copy & Paste)

print("The actual Price is : ")


print(Y_Test)
print(" ")
Y_Test.shape

PAGE 16
Code Section (Copy & Paste)

lirM.score(X_Test,Y_Test)

Well, our model is 76% accurate in its prediction …. ☺

That’s all for now my friends.

If you want you can make our UDFs a part of Python Package, so that in future is needed we can simply
call our UDFs from that Package.

I leave this part up to you …. ☺

If you like my article, please leave your comments, give it a like & share in your social media circle.

Until then Good Bye …. Happy Coding … ☺ Thanks ……….

PAGE 17
IF YOU WANT MORE
https://www.python.org/

https://www.anaconda.com/products/distribution

https://numpy.org/learn/

https://pandas.pydata.org/docs/index.html

https://scikit-learn.org/stable/

https://code.visualstudio.com/docs/python/python-tutorial

https://www.geeksforgeeks.org/introduction-machine-learning-using-python/

https://www.tutorialspoint.com/machine_learning_with_python/index.htm

https://en.wikipedia.org/wiki/Machine_learning

https://azure.microsoft.com/en-in/overview/what-is-machine-learning-platform/

https://www.tensorflow.org/

https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

PAGE 18

You might also like