AP19110010030 Assignment-4 Lab

04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook
Kilaru Sravan
AP19110010030
CSE-A
In [1]:
import pandas as pd
import numpy as np
In [2]:
# reading data which is in csv

df=pd.read_csv('ipl.csv')
df #prints the data in csv file
Out[2]:
batsman total_runs out numberofballs average strikerate
0 V Kohli 5426 152 4111 35.697368 131.987351
1 SK Raina 5386 160 3916 33.662500 137.538304
2 RG Sharma 4902 161 3742 30.447205 130.999466
3 DA Warner 4717 114 3292 41.377193 143.286756
4 S Dhawan 4601 137 3665 33.583942 125.538881
... ... ... ... ... ... ...
511 ND Doshi 0 1 13 0.000000 0.000000
512 J Denly 0 1 1 0.000000 0.000000
513 S Ladda 0 2 9 0.000000 0.000000
514 V Pratap Singh 0 1 1 0.000000 0.000000
515 S Kaushik 0 1 1 0.000000 0.000000
516 rows × 6 columns
In [3]:
df.isnull().sum()
Out[3]:
batsman 0
total_runs 0
out 0
numberofballs 0
average 34
strikerate 0
dtype: int64
localhost:8888/notebooks/AP19110010030_Assignment-4.ipynb 1/9
Removing the Unwanted data

In [4]:
df=df.drop(['batsman','numberofballs','strikerate'], axis=1)
df
Out[4]:
total_runs out average
0 5426 152 35.697368
1 5386 160 33.662500
2 4902 161 30.447205
3 4717 114 41.377193
4 4601 137 33.583942
... ... ... ...
511 0 1 0.000000
512 0 1 0.000000
513 0 2 0.000000
514 0 1 0.000000
515 0 1 0.000000
In [5]:
df.isnull().sum()
Out[5]:
total_runs 0
out 0
average 34
dtype: int64
In [6]:
df.isnull().sum().sum()
Out[6]:
34
Handling the missing data
Forward fill(ffill)
In [7]:
df1 = df.copy()
In [8]:
df1.ffill(inplace = True)
df1
Out[8]:
0 5426 152 35.697368
1 5386 160 33.662500
2 4902 161 30.447205
3 4717 114 41.377193
4 4601 137 33.583942
... ... ... ...
511 0 1 0.000000
512 0 1 0.000000
513 0 2 0.000000
514 0 1 0.000000
515 0 1 0.000000
In [9]:
df1.isnull().sum().sum()
Out[9]:
Backward fill(bfill)
In [10]:
df2 = df.copy()
In [11]:
df2.bfill(inplace = True)
df2
Out[11]:
0 5426 152 35.697368
1 5386 160 33.662500
2 4902 161 30.447205
3 4717 114 41.377193
4 4601 137 33.583942
... ... ... ...
511 0 1 0.000000
512 0 1 0.000000
513 0 2 0.000000
514 0 1 0.000000
515 0 1 0.000000
In [12]:
Out[12]:
Using mean, median, mode to replace missing

values(fillna)
In [13]:
df3 = df.copy()
In [14]:
df3.fillna(np.mean(df3["average"]),inplace = True)
df3
Out[14]:
0 5426 152 35.697368
1 5386 160 33.662500
2 4902 161 30.447205
3 4717 114 41.377193
4 4601 137 33.583942
... ... ... ...
511 0 1 0.000000
512 0 1 0.000000
513 0 2 0.000000
514 0 1 0.000000
515 0 1 0.000000
In [15]:
df3.fillna(np.median(df3["average"]),inplace = True)
In [16]:
df3['average'].fillna(df3['average'].mode(), inplace=True)
In [17]:
Out[17]:
Filling with a global constant

In [18]:
df4 = df.copy()
In [19]:
df4.fillna(0.00000,inplace = True)
df4
Out[19]:
0 5426 152 35.697368
1 5386 160 33.662500
2 4902 161 30.447205
3 4717 114 41.377193
4 4601 137 33.583942
... ... ... ...
511 0 1 0.000000
512 0 1 0.000000
513 0 2 0.000000
514 0 1 0.000000
515 0 1 0.000000
In [20]:
Out[20]:
Dropping the missing value rows(dropna)

In [21]:
df5 = df.copy()
In [22]:
df5.dropna(axis=0,inplace=True)
df5
Out[22]:
0 5426 152 35.697368
1 5386 160 33.662500
2 4902 161 30.447205
3 4717 114 41.377193
4 4601 137 33.583942
... ... ... ...
511 0 1 0.000000
512 0 1 0.000000
513 0 2 0.000000
514 0 1 0.000000
515 0 1 0.000000
In [23]:
Out[23]:
Converting noisy data into smooth data

In [24]:
df6 = df.copy()
In [25]:
data = df6['average'] # selecting a row to smooth the noisy data present in that
data = data[:20] # Initially selecting 20 rows for our covinience
data = np.sort(data)
print(data)
[22.5511811 25.27522936 26.58695652 27.17647059 27.22641509 28.33333333
29.06140351 29.29090909 30.05263158 30.44720497 31.2173913 31.48507463
32.76923077 33.58394161 33.6625 35.69736842 37.71186441 41.13636364
41.37719298 42.44230769]
Creating bins manually

In [26]:
b1=np.zeros((5,4))
b2=np.zeros((5,4))
Mean bin
In [27]:
for i in range (0,20,4):

k=int(i/4)
mean=(data[i] + data[i+1] + data[i+2] + data[i+3])/4
for j in range(4):
b1[k,j]=mean
print("Mean Bin:\n",b1)
Mean Bin:
[[25.39745939 25.39745939 25.39745939 25.39745939]
[28.47801526 28.47801526 28.47801526 28.47801526]
[30.80057562 30.80057562 30.80057562 30.80057562]
[33.9282602 33.9282602 33.9282602 33.9282602 ]
[40.66693218 40.66693218 40.66693218 40.66693218]]
Boundary bin
In [28]:
for i in range (0,20,4):

k=int(i/4)
for j in range (4):
if (data[i+j]-data[i]) < (data[i+2]-data[i+j]):
b2[k,j]=data[i]
else:
b2[k,j]=data[i+2]
print("Boundary Bin:\n",b2)
Boundary Bin:
[[22.5511811 26.58695652 26.58695652 26.58695652]
[27.22641509 29.06140351 29.06140351 29.06140351]
[30.05263158 30.05263158 31.2173913 31.2173913 ]
[32.76923077 33.6625 33.6625 33.6625 ]
[37.71186441 41.37719298 41.37719298 41.37719298]]
Using inbuilt functions to create bins

In [29]:
min_value = df6['average'].min()
max_value = df6['average'].max()
In [30]:
bins = np.linspace(min_value,max_value,30)
labels = bins[1:]
In [31]:
df6['Average_Bin'] = pd.cut(df['average'], bins=bins, labels=labels, include_lowest=True)

df6
Out[31]:
total_runs out average Average_Bin
0 5426 152 35.697368 36.413793
1 5386 160 33.662500 36.413793
2 4902 161 30.447205 33.379310
3 4717 114 41.377193 42.482759
4 4601 137 33.583942 36.413793
... ... ... ... ...
511 0 1 0.000000 3.034483
512 0 1 0.000000 3.034483
513 0 2 0.000000 3.034483
514 0 1 0.000000 3.034483
515 0 1 0.000000 3.034483
Finally we can see a column in added into the data set in the
ending which is the smoothed data of the column named
'average'

AP19110010030 Assignment-4 Lab

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AP19110010030 Assignment-4 Lab

Uploaded by

Copyright:

Available Formats

04/09/2021 AP19110010030_Assignment-4 - Jupyter Notebook

# reading data which is in csv

batsman total_runs out numberofballs average strikerate

0 V Kohli 5426 152 4111 35.697368 131.987351

1 SK Raina 5386 160 3916 33.662500 137.538304

2 RG Sharma 4902 161 3742 30.447205 130.999466

3 DA Warner 4717 114 3292 41.377193 143.286756

4 S Dhawan 4601 137 3665 33.583942 125.538881

... ... ... ... ... ... ...

511 ND Doshi 0 1 13 0.000000 0.000000

512 J Denly 0 1 1 0.000000 0.000000

513 S Ladda 0 2 9 0.000000 0.000000

514 V Pratap Singh 0 1 1 0.000000 0.000000

515 S Kaushik 0 1 1 0.000000 0.000000

516 rows × 6 columns

Removing the Unwanted data

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

516 rows × 3 columns

Handling the missing data

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

516 rows × 3 columns

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

516 rows × 3 columns

Using mean, median, mode to replace missing

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

516 rows × 3 columns

Filling with a global constant

total_runs out average

0 5426 152 35.697368

1 5386 160 33.662500

2 4902 161 30.447205

3 4717 114 41.377193

4 4601 137 33.583942

... ... ... ...

516 rows × 3 columns

Dropping the missing value rows(dropna)

total_runs out average

0 5426 152 35.697368