Professional Documents
Culture Documents
Date Time Co (GT) Pt08.S1 (Co) NMHC (GT) C6H6 (GT) Pt08.S2 (NMHC) Nox (GT) Pt08.S3 (Nox) No2 0 1 2 3 4 ... 9466 9467 9468 9469 9470
Date Time Co (GT) Pt08.S1 (Co) NMHC (GT) C6H6 (GT) Pt08.S2 (NMHC) Nox (GT) Pt08.S3 (Nox) No2 0 1 2 3 4 ... 9466 9467 9468 9469 9470
1. Data cleaning
2. Data integration
3. Data transformation
4. Error correcting
5. Data model building
Roll no: 22
Class:TEIT
Batch:A
import pandas as pd
url="https://drive.google.com/file/d/1xWOnjyeDdxfYSqGPHMc89zkI_8N888bj/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
i=pd.read_csv(url,sep = ';')
Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2
... ... ... ... ... ... ... ... ... ...
9466 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9467 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9468 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9469 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9470 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Data Cleaning
i.isna().any()
Date True
Time True
CO(GT) True
PT08.S1(CO) True
NMHC(GT) True
C6H6(GT) True
PT08.S2(NMHC) True
NOx(GT) True
PT08.S3(NOx) True
NO2(GT) True
PT08.S4(NO2) True
PT08.S5(O3) True
T True
RH True
AH True
Unnamed: 15 True
Unnamed: 16 True
dtype: bool
i.isnull().sum()
Date 114
Time 114
CO(GT) 114
PT08.S1(CO) 114
NMHC(GT) 114
C6H6(GT) 114
PT08.S2(NMHC) 114
NOx(GT) 114
PT08.S3(NOx) 114
NO2(GT) 114
PT08.S4(NO2) 114
PT08.S5(O3) 114
T 114
RH 114
AH 114
Unnamed: 15 9471
Unnamed: 16 9471
dtype: int64
i.drop(['Unnamed: 15','Unnamed: 16'], inplace =True, axis =1)
i.duplicated().sum()
113
a=i.drop_duplicates(subset=None, keep=False, ignore_index=True)
a.duplicated().sum()
a.isna().sum()
Date 0
Time 0
CO(GT) 0
PT08.S1(CO) 0
NMHC(GT) 0
C6H6(GT) 0
PT08.S2(NMHC) 0
NOx(GT) 0
PT08.S3(NOx) 0
NO2(GT) 0
PT08.S4(NO2) 0
PT08.S5(O3) 0
T 0
RH 0
AH 0
dtype: int64
Error Correction
a.describe()
a = a.fillna(a.median())
a.isnull().sum()
Date 0
Time 0
CO(GT) 0
PT08.S1(CO) 0
NMHC(GT) 0
C6H6(GT) 0
PT08.S2(NMHC) 0
NOx(GT) 0
PT08.S3(NOx) 0
NO2(GT) 0
PT08.S4(NO2) 0
PT08.S5(O3) 0
T 0
RH 0
AH 0
dtype: int64
a.head(3)
a['Date']=pd.to_datetime(a.Date, format='%d/%m/%Y')
a['MONTH']= a['Date'].dt.month
a['Hour']=a['Time'].apply(lambda x: int(x.split('.')[0]))
a['CO(GT)']=a['CO(GT)'].apply(lambda x: float(x.replace(',', '.')))
a['C6H6(GT)']=a['C6H6(GT)'].apply(lambda x: float(x.replace(',', '.')))
a['T']=a['T'].apply(lambda x: float(x.replace(',', '.')))
a['RH']=a['RH'].apply(lambda x: float(x.replace(',', '.')))
a['AH']=a['AH'].apply(lambda x: float(x.replace(',', '.')))
a.head(3)
2004-
0 18.00.00 2.6 1360.0 150.0 11.9 1046.0 166.0 10
03-10
2004-
1 19.00.00 2.0 1292.0 112.0 9.4 955.0 103.0 11
03-10
Data Integration
x=i.iloc[0:3,[13,14]]
y=i.iloc[0:3,[12,13]]
x.merge(y,how="right")
RH AH T
Model Building
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
X=a.drop(['Date','Time','RH'],1)
Y=a['RH']
X.head()
0 48.9
1 47.7
2 54.0
3 60.0
4 59.6
X_train, X_test, y_train, y_test=train_test_split(X,Y,test_size=0.3)
model= LinearRegression()
model.fit(X_train,y_train)
LinearRegression()
prediction = model.predict(X_test)
y_test
1979 81.3
4661 43.0
2592 30.4
4348 35.2
3735 44.4
...
478 30.6
8947 25.5
2227 18.2
7752 33.9
34 65.9
Accuracy Calculation
model.score(X_test,y_test)
0.9812791411366613