Professional Documents
Culture Documents
PART - 3
In the 2nd part we have used Pandas to load our data frame from a .csv file, removed the noise/outliers in
data.
We will make our steps more concise, we will create our own function for removing Outliers & use
PipeLines on top of that.
STEPS
For this purpose, we will be working on the following steps:
For this demo purpose I will be writing my Python codes using Visual Studio Code but you can use any
Python NoteBook editor you like.
I am not going to write down the steps like “How to install Python/ Anaconda”, “How to setup VS code to
work with Python NoteBook” etc. because you can find many article in the cloud for this purpose using any
search engine like Google.
I will share links for this purpose in the “More to Read” section at the very last section of this article.
PAGE 1
Creating a panda data frame from a downloaded .csv file.
import pandas as pd
import numpy as np
import sklearn as sk
copy & paste the above code in your python notebook & run.
If you are using VS code like me, before running the code do not forget to choose the kernel as per the
following image:
Now we are ready to create our Panda data frame from .csv file.
PAGE 2
For this purpose, do the following:
Download the files & save it on your disk for offline access it will help.
Now load it from your hard drive, make sure to give the right path of the file.
It will be a lot easier if you save this .csv file under the same folder where you are saving this notebook.
df = pd.read_csv('CarPrice_Assignment.csv')
df.info()
The output shows that our data frame contains 205 rows, indexing from 0 to 244 & 26 columns.
For this demo purpose I have not taken the total screen shot rather I am showing only 20 columns.
PAGE 3
But we are not going to analysis all column for this demo purpose, so we will take few columns from it.
dfOriginal = df[['CarName','fueltype','carbody','enginetype','cylindernumber','enginesize','fuelsystem','horsepower','price']]
dfOriginal
dfOriginal.head(2)
Here we are taking 8 input columns & one output column, the price column that we will predict.
But just before starting the work or jumping into the world of making a ML data model it is always
recommended to have a close look into the data, in our case our data frame.
dfOriginal.info()
As expected it is a pandas Data Frame object & it has 205 rows, 9 columns in it.
Among these 9 columns we will take first 8 columns as our input data to test our model prediction.
PAGE 4
Now look very carefully the first 8 columns, what did we find out!
Well, we are closely looking at the nature / data type of this columns.
Among these 8 columns the columns “enginesize” & “horsepower” are numerical, rest of it are type string,
categorical in nature.
Well my friend as ML model only works with numerical values, not other data types, so we need to convert
our string type ordinal values to numerical.
If you run this block of code “df.isnull().sum()” then it will show you if any null vales (NAN) is present in
the data frame or not.
But if you have that in your working data frame then you need to handle it first.
If you are wondering how to do it ! then read my 1st article , I have already explain that to you ☺
Now we will run one particular code block on our numerical columns.
Just run it, see the result & I will explain it to you.
dfOriginal.describe()
The Output
PAGE 5
Just look into the row “min”,“25%” , “50%” ,”75%” & max.
The values from min to 75% is changing but not very large difference (61,97,120,141) but from 75% to
max we can observe a huge change in value (141, 326).
Well, the value of 25% indicates the total population of data points, enginesize below 25% of the data
sample size.
Its mean that from the value from 61 to 97 our 25% of data resided & so on for 50% & 75%.
So 75%, tell us that from the value 61 to 141 our 75% of data points, enginesize resided.
But from 141 (75%) to 326 (max) in this massive region only 25% data points, enginesize resided.
So the small bucket has more data points than the large bucket.
So we have possible noise / Outlier in our data. Let’s plot some graph to understand it more clearly.
snb.boxplot(data=dfOriginal,x='enginesize')
PAGE 6
Seaborn is excellent library is you want to analysis your data in graphical way.
Now in the above image you can see around 60 we have one bar & around 200 we have another. It defines
a range.
For now, just watch we have some points after the mark of 200.
In ML it is better to have data points as much close as possible or pick up a segment for your analysis
where you can find more data points or you may call more data for analysis.
Whatever data resided apart from this dense data region you may call them Outlier which has no such
significant contribution in terms of ML model building & prediction.
We can also run the Histogram to view the distribution of data points.
snb.histplot(data=dfOriginal,x='enginesize',kde=1)
Now if you look into the Histogram, you can see more data points are plotted towards the left side.
In a nutshell more data points are available in the left than the right.
We can also get the value of the Skewness if we run the code
“dfOriginal['enginesize'].skew()”
For us the value is 1.95, the positive value indicates that it is Right Skewed.
So we need to exclude them from our data as much as possible for us.
The question you may ask me, well this is on theory man but how to do it!!
PAGE 7
UDF
In my previous article I have shown you the steps , breaking down every steps, now time for define our
own custom function to do the job ☺
I function is necessary when we talk about some reusable component & want to write some clean code for
easy maintenance in the future.
class CustomOutliresRemover(BaseEstimator,TransformerMixin):
def __init__(self,by=2,columns=None):
self.by=by
self.columns=columns
def fit(self,X,Y=None):
return self
def transform(self,X,Y=None):
column_to_transfer = list(X.columns)
if self.columns:
column_to_transfer=self.columns
zscores = z_scores(X[column_to_transfer])
abs_z_scores = np.abs(zscores)
filtered_entries = (abs_z_scores < self.by)
X = X[filtered_entries.values]
return X
We will this UDF of ours on column ‘enginesize’ & ‘horsepower’ in a Pipeline , for now leave it & lets us
focus on column ‘CarName’.
Let's see what we have in the CarName column, the frequency distribution of data.
PAGE 8
Code Section (Copy & Paste)
CarNameCount = dfOriginal['CarName'].value_counts()
CarNameCount
Result
Now we want to find out the count of ‘CarName’ whose count (Occurrence) is 2 or less here.
lessCount = 0
for v in CarNameCount.iteritems():
if v[1] <=2:
lessCount = lessCount+1
print(lessCount)
So now we want all this ‘CarName’ values to be replaced by our own name like ‘unknown’.
PAGE 9
Creating our won Function to convert CarName to a custom name whose frequency count is lee than the
value we want.
class CustomGroupNameDefiner(BaseEstimator,TransformerMixin):
def __init__(self,columns=None,treshhold=2,renameTo='uncommon'):
self.columns=columns
self.treshhold=treshhold
self.renameTo=renameTo
def fit(self,X,Y=None):
return self
def transform(self,X,Y=None):
if self.columns:
column_to_transfer=self.columns
CarNameCount = X[column_to_transfer].value_counts()
replindex = CarNameCount[CarNameCount<=self.treshhold].index
X[column_to_transfer] =X[column_to_transfer].replace(replindex,self.renameTo).values
return X
Well, I am not going to give you the definition of that, you can use any search engine & find that out.
If have already 2nd part of this ongoing series article, you have already find out the mammoth steps we have
done to get the result. right?
PAGE 10
PIPELINE WITH UDF
So, if we can cut down those steps, will it be not handy! And on top of that if we want to do some
prediction using our ML model then we have to follow the same steps again with the new set of test data.
So if we can create a rapper around our steps & follow that rapper in the future also it will be helpful for us.
pipl1 = Pipeline(steps=[
( "trans1",CustomOutliresRemover(2,columns=['enginesize']) ),
( "trans2",CustomOutliresRemover(2,columns=['horsepower']) ),
( "trans3",CustomGroupNameDefiner(columns=['CarName'],treshhold=2,renameTo='unknown') )
])
dfx1 = pipl1.fit_transform(dfOriginal)
dfx1
Result
PAGE 11
PIPELINE WITH COLUMNTRANSFORMERS
Now run ‘dfx1.nunique() ‘& see the result
Now we will convert all the string/ordinal values to numeric with the help of OneHotEncoding &
OrdinalEncoding.
We will apply OrdinalEncoding on the column ‘cylindernumber’, so let’s have a close look into the values
of this column.
dfx1['cylindernumber'].value_counts()
Result
So we got our unique values in ‘cylindernumber’, let’s build our transformer then.
tfr3 = ColumnTransformer(
[
('ordle', OrdinalEncoder(categories=[['two','three','four','five','six']]),['cylindernumber'] )
],
remainder='passthrough')
PAGE 12
Now let’s build our ‘OneHotEncoding’ transformer.
tfr4 = ColumnTransformer(
[
( 'ohe', OneHotEncoder(sparse=False,drop='first'),[1,2,3,4,6] )
],
remainder='passthrough')
So we have our two column transformer ready, so let’s build a pipeline for them.
For tfr3 we have specified our column by its name ‘cylindernumber’ but for tfr4 we have used index of the
column not the names!! & farther more if you closely look into the index values we have started with 1
instead off 0 here!! The ‘CarName’ index is 0 in dfx1, then why so?
Let me answer this two question for you before we can move on again….
My friends, in the next step we are going to create this pipeline, now here we are going to use ‘tfr3’, the
OrdinalEncoding 1st the we are going to apply ‘tfr4’ OneHotEncoding in the outcome of the 1st step.
Now any columntransformer will yield a numpy data array not a data frame & an array will not have any
column name like a data frame.
So if we want to work on with the output of 1st step, the ‘tfr3’ then we have to work with an array by
calling its column index.
That is why while defining out ‘tfr4’ we have used the index of the column not the name of that.
Yes, in the ‘dfx1’ the ‘CarName’ column has the index value of 0 but when we are passing our ‘dfx1’
through ‘tfr3’ the index values will be changed.
PAGE 13
Here what we have done is that after creating the ‘tfr3’ we used it with our ‘dfx1’ & print the very 1 st
element of the array outcome.
As you can clearly see here the value in the 0 index position is ‘2.0’ & the 1 position is ‘unknown’.
It is because output if our ‘tfr3’, the OrdinalEncoder has taken the 0 index place in the array which
eventually pushed the array index of any element before ‘cylindernumber’ of dfx1 to one position.
That is why in the result array of ‘tfr3’ the index of ‘CarName’ has become / will become 1 instead off 0.
So now let’s build our pipeline then for this two ColumnTransformers.
pipl2 = Pipeline(steps=[
('tfr3' ,tfr3),
('tfr4' ,tfr4 )
])
PAGE 14
Now we have new pipeline with the columntransformers ready …
dfx2 = pipl2.fit_transform(dfx1)
dfx2
Result
So this is the final output by which we will create our ML model now.
From here the steps are same as of my previous article (2nd one) of this series.
So I will not explain it again here now … please read my 2nd article on this.
PAGE 15
Code Section (Copy & Paste)
lirM = LinearRegression()
lirM.fit(X_Train,Y_Train)
arryMyPredection = lirM.predict(X_Test)
print(arryMyPredection.shape)
print(" ")
print('The predected values are :')
print(arryMyPredection)
PAGE 16
Code Section (Copy & Paste)
lirM.score(X_Test,Y_Test)
If you want you can make our UDFs a part of Python Package, so that in future is needed we can simply
call our UDFs from that Package.
If you like my article, please leave your comments, give it a like & share in your social media circle.
PAGE 17
IF YOU WANT MORE
https://www.python.org/
https://www.anaconda.com/products/distribution
https://numpy.org/learn/
https://pandas.pydata.org/docs/index.html
https://scikit-learn.org/stable/
https://code.visualstudio.com/docs/python/python-tutorial
https://www.geeksforgeeks.org/introduction-machine-learning-using-python/
https://www.tutorialspoint.com/machine_learning_with_python/index.htm
https://en.wikipedia.org/wiki/Machine_learning
https://azure.microsoft.com/en-in/overview/what-is-machine-learning-platform/
https://www.tensorflow.org/
https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
PAGE 18