You are on page 1of 6

Perform the following operations using Python on the Air quality and Heart Diseases data sets

1. Data cleaning
2. Data integration
3. Data transformation
4. Error correcting
5. Data model building

Roll no: 22
Class:TEIT
Batch:A

import pandas as pd
url="https://drive.google.com/file/d/1xWOnjyeDdxfYSqGPHMc89zkI_8N888bj/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
i=pd.read_csv(url,sep = ';')

Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2

0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0

1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0

2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0

3 10/03/2004 21.00.00 2,2 1376.0 80.0 9,2 948.0 172.0 1092.0

4 10/03/2004 22.00.00 1,6 1272.0 51.0 6,5 836.0 131.0 1205.0

... ... ... ... ... ... ... ... ... ...

9466 NaN NaN NaN NaN NaN NaN NaN NaN NaN

9467 NaN NaN NaN NaN NaN NaN NaN NaN NaN

9468 NaN NaN NaN NaN NaN NaN NaN NaN NaN

9469 NaN NaN NaN NaN NaN NaN NaN NaN NaN

9470 NaN NaN NaN NaN NaN NaN NaN NaN NaN

Data Cleaning

i.isna().any()

Date True

Time True

CO(GT) True

PT08.S1(CO) True

NMHC(GT) True

C6H6(GT) True

PT08.S2(NMHC) True

NOx(GT) True

PT08.S3(NOx) True

NO2(GT) True

PT08.S4(NO2) True

PT08.S5(O3) True

T True

RH True

AH True

Unnamed: 15 True

Unnamed: 16 True

dtype: bool

i.isnull().sum()

Date 114

Time 114

CO(GT) 114

PT08.S1(CO) 114

NMHC(GT) 114

C6H6(GT) 114

PT08.S2(NMHC) 114

NOx(GT) 114

PT08.S3(NOx) 114

NO2(GT) 114

PT08.S4(NO2) 114

PT08.S5(O3) 114

T 114

RH 114

AH 114

Unnamed: 15 9471

Unnamed: 16 9471

dtype: int64

i.drop(['Unnamed: 15','Unnamed: 16'], inplace =True, axis =1)

Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) P

0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0

1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0

2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0

3 10/03/2004 21.00.00 2,2 1376.0 80.0 9,2 948.0 172.0

4 10/03/2004 22.00.00 1,6 1272.0 51.0 6,5 836.0 131.0

... ... ... ... ... ... ... ... ...

9466 NaN NaN NaN NaN NaN NaN NaN NaN

9467 NaN NaN NaN NaN NaN NaN NaN NaN

9468 NaN NaN NaN NaN NaN NaN NaN NaN

9469 NaN NaN NaN NaN NaN NaN NaN NaN

9470 NaN NaN NaN NaN NaN NaN NaN NaN

9471 rows × 15 columns

i.duplicated().sum()

113

a=i.drop_duplicates(subset=None, keep=False, ignore_index=True)

a.duplicated().sum()

a.isna().sum()

Date 0

Time 0

CO(GT) 0

PT08.S1(CO) 0

NMHC(GT) 0

C6H6(GT) 0

PT08.S2(NMHC) 0

NOx(GT) 0

PT08.S3(NOx) 0

NO2(GT) 0

PT08.S4(NO2) 0

PT08.S5(O3) 0

T 0

RH 0

AH 0

dtype: int64

Error Correction

a.describe()

PT08.S1(CO) NMHC(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.

count 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357.000000 9357

mean 1048.990061 -159.090093 894.595276 168.616971 794.990168 58.148873 1391

std 329.832710 139.789093 342.333252 257.433866 321.993552 126.940455 467

min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200

25% 921.000000 -200.000000 711.000000 50.000000 637.000000 53.000000 1185

50% 1053.000000 -200.000000 895.000000 141.000000 794.000000 96.000000 1446

75% 1221.000000 -200.000000 1105.000000 284.000000 960.000000 133.000000 1662

max 2040.000000 1189.000000 2214.000000 1479.000000 2683.000000 340.000000 2775

a = a.fillna(a.median())

a.isnull().sum()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Dropping of nuisance columns


"""Entry point for launching an IPython kernel.

Date 0

Time 0

CO(GT) 0

PT08.S1(CO) 0

NMHC(GT) 0

C6H6(GT) 0

PT08.S2(NMHC) 0

NOx(GT) 0

PT08.S3(NOx) 0

NO2(GT) 0

PT08.S4(NO2) 0

PT08.S5(O3) 0

T 0

RH 0

AH 0

dtype: int64

a.head(3)

Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08

0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0

1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0

2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0


Transformation

a['Date']=pd.to_datetime(a.Date, format='%d/%m/%Y')

a['MONTH']= a['Date'].dt.month  

a['Hour']=a['Time'].apply(lambda x: int(x.split('.')[0]))

a['CO(GT)']=a['CO(GT)'].apply(lambda x: float(x.replace(',', '.')))

a['C6H6(GT)']=a['C6H6(GT)'].apply(lambda x: float(x.replace(',', '.')))

a['T']=a['T'].apply(lambda x: float(x.replace(',', '.')))

a['RH']=a['RH'].apply(lambda x: float(x.replace(',', '.')))

a['AH']=a['AH'].apply(lambda x: float(x.replace(',', '.')))

a.head(3)

Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(N

2004-
0 18.00.00 2.6 1360.0 150.0 11.9 1046.0 166.0 10
03-10

2004-
1 19.00.00 2.0 1292.0 112.0 9.4 955.0 103.0 11
03-10

Data Integration

x=i.iloc[0:3,[13,14]]

y=i.iloc[0:3,[12,13]]

x.merge(y,how="right")

RH AH T

0 48,9 0,7578 13,6

1 47,7 0,7255 13,3

2 54,0 0,7502 11,9

Model Building

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler 

from sklearn.linear_model import LinearRegression

from sklearn.metrics import accuracy_score

X=a.drop(['Date','Time','RH'],1)     

Y=a['RH']   

X.head()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: In a future v


"""Entry point for launching an IPython kernel.

CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT

0 2.6 1360.0 150.0 11.9 1046.0 166.0 1056.0 113.0

1 2.0 1292.0 112.0 9.4 955.0 103.0 1174.0 92.0

2 2.2 1402.0 88.0 9.0 939.0 131.0 1140.0 114.0

3 2.2 1376.0 80.0 9.2 948.0 172.0 1092.0 122.0

4 1.6 1272.0 51.0 6.5 836.0 131.0 1205.0 116.0


Y.head()

0 48.9

1 47.7

2 54.0

3 60.0

4 59.6

Name: RH, dtype: float64

X_train, X_test, y_train, y_test=train_test_split(X,Y,test_size=0.3)

model= LinearRegression()

model.fit(X_train,y_train)

LinearRegression()

prediction = model.predict(X_test)

y_test

1979 81.3

4661 43.0

2592 30.4

4348 35.2

3735 44.4

...

478 30.6

8947 25.5

2227 18.2

7752 33.9

34 65.9

Name: RH, Length: 2808, dtype: float64

Accuracy Calculation

model.score(X_test,y_test)

0.9812791411366613

You might also like