You are on page 1of 9

04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

Kilaru Sravan

AP19110010030

CSE-A
In [1]:

import pandas as pd
import numpy as np

In [2]:

# reading data which is in csv


df=pd.read_csv('ipl.csv')
df #prints the data in csv file
Out[2]:

batsman total_runs out numberofballs average strikerate

0 V Kohli 5426 152 4111 35.697368 131.987351

1 SK Raina 5386 160 3916 33.662500 137.538304

2 RG Sharma 4902 161 3742 30.447205 130.999466

3 DA Warner 4717 114 3292 41.377193 143.286756

4 S Dhawan 4601 137 3665 33.583942 125.538881

... ... ... ... ... ... ...

511 ND Doshi 0 1 13 0.000000 0.000000

512 J Denly 0 1 1 0.000000 0.000000

513 S Ladda 0 2 9 0.000000 0.000000

514 V Pratap Singh 0 1 1 0.000000 0.000000

515 S Kaushik 0 1 1 0.000000 0.000000

516 rows × 6 columns

In [3]:

df.isnull().sum()
Out[3]:

batsman 0

total_runs 0

out 0

numberofballs 0

average 34

strikerate 0

dtype: int64

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 1/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

Removing the Unwanted data


In [4]:

df=df.drop(['batsman','numberofballs','strikerate'], axis=1)
df
Out[4]:

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

511 0 1 0.000000

512 0 1 0.000000

513 0 2 0.000000

514 0 1 0.000000

515 0 1 0.000000

516 rows × 3 columns

In [5]:

df.isnull().sum()
Out[5]:

total_runs 0

out 0

average 34

dtype: int64

In [6]:

df.isnull().sum().sum()
Out[6]:

34

Handling the missing data

Forward fill(ffill)
In [7]:

df1 = df.copy()

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 2/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [8]:

df1.ffill(inplace = True)
df1
Out[8]:

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

511 0 1 0.000000

512 0 1 0.000000

513 0 2 0.000000

514 0 1 0.000000

515 0 1 0.000000

516 rows × 3 columns

In [9]:

df1.isnull().sum().sum()
Out[9]:

Backward fill(bfill)
In [10]:

df2 = df.copy()

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 3/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [11]:

df2.bfill(inplace = True)
df2
Out[11]:

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

511 0 1 0.000000

512 0 1 0.000000

513 0 2 0.000000

514 0 1 0.000000

515 0 1 0.000000

516 rows × 3 columns

In [12]:

df2.isnull().sum().sum()
Out[12]:

Using mean, median, mode to replace missing


values(fillna)
In [13]:

df3 = df.copy()

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 4/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [14]:

df3.fillna(np.mean(df3["average"]),inplace = True)
df3
Out[14]:

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

511 0 1 0.000000

512 0 1 0.000000

513 0 2 0.000000

514 0 1 0.000000

515 0 1 0.000000

516 rows × 3 columns

In [15]:

df3.fillna(np.median(df3["average"]),inplace = True)

In [16]:

df3['average'].fillna(df3['average'].mode(), inplace=True)

In [17]:

df3.isnull().sum().sum()
Out[17]:

Filling with a global constant


In [18]:

df4 = df.copy()

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 5/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [19]:

df4.fillna(0.00000,inplace = True)
df4
Out[19]:

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

511 0 1 0.000000

512 0 1 0.000000

513 0 2 0.000000

514 0 1 0.000000

515 0 1 0.000000

516 rows × 3 columns

In [20]:

df4.isnull().sum().sum()
Out[20]:

Dropping the missing value rows(dropna)


In [21]:

df5 = df.copy()

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 6/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [22]:

df5.dropna(axis=0,inplace=True)
df5
Out[22]:

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

511 0 1 0.000000

512 0 1 0.000000

513 0 2 0.000000

514 0 1 0.000000

515 0 1 0.000000

482 rows × 3 columns

In [23]:

df5.isnull().sum().sum()
Out[23]:

Converting noisy data into smooth data


In [24]:

df6 = df.copy()

In [25]:

data = df6['average'] # selecting a row to smooth the noisy data present in that
data = data[:20] # Initially selecting 20 rows for our covinience
data = np.sort(data)
print(data)

[22.5511811 25.27522936 26.58695652 27.17647059 27.22641509 28.33333333

29.06140351 29.29090909 30.05263158 30.44720497 31.2173913 31.48507463

32.76923077 33.58394161 33.6625 35.69736842 37.71186441 41.13636364

41.37719298 42.44230769]

Creating bins manually


localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 7/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [26]:

b1=np.zeros((5,4))
b2=np.zeros((5,4))

Mean bin

In [27]:

for i in range (0,20,4):


k=int(i/4)
mean=(data[i] + data[i+1] + data[i+2] + data[i+3])/4
for j in range(4):
b1[k,j]=mean
print("Mean Bin:\n",b1)
Mean Bin:

[[25.39745939 25.39745939 25.39745939 25.39745939]

[28.47801526 28.47801526 28.47801526 28.47801526]

[30.80057562 30.80057562 30.80057562 30.80057562]

[33.9282602 33.9282602 33.9282602 33.9282602 ]

[40.66693218 40.66693218 40.66693218 40.66693218]]

Boundary bin

In [28]:

for i in range (0,20,4):


k=int(i/4)
for j in range (4):
if (data[i+j]-data[i]) < (data[i+2]-data[i+j]):
b2[k,j]=data[i]
else:
b2[k,j]=data[i+2]
print("Boundary Bin:\n",b2)
Boundary Bin:

[[22.5511811 26.58695652 26.58695652 26.58695652]

[27.22641509 29.06140351 29.06140351 29.06140351]

[30.05263158 30.05263158 31.2173913 31.2173913 ]

[32.76923077 33.6625 33.6625 33.6625 ]

[37.71186441 41.37719298 41.37719298 41.37719298]]

Using inbuilt functions to create bins


In [29]:

min_value = df6['average'].min()
max_value = df6['average'].max()

In [30]:

bins = np.linspace(min_value,max_value,30)
labels = bins[1:]

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 8/9
04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

In [31]:

df6['Average_Bin'] = pd.cut(df['average'], bins=bins, labels=labels, include_lowest=True)


df6
Out[31]:

total_runs out average Average_Bin

0 5426 152 35.697368 36.413793

1 5386 160 33.662500 36.413793

2 4902 161 30.447205 33.379310

3 4717 114 41.377193 42.482759

4 4601 137 33.583942 36.413793

... ... ... ... ...

511 0 1 0.000000 3.034483

512 0 1 0.000000 3.034483

513 0 2 0.000000 3.034483

514 0 1 0.000000 3.034483

515 0 1 0.000000 3.034483

516 rows × 4 columns

Finally we can see a column in added into the data set in the
ending which is the smoothed data of the column named
'average'

localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 9/9

You might also like