Rainfall - Prediction - Ipynb - Colaboratory

21/11/2023, 00:01 Rainfall_prediction.
ipynb - Colaboratory
Rainfall prediction
from google.colab import drive

drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
Dataset contains about 10 years of daily weather observations from numerous weather stations across Australia.
Location : The common name of the location of the weather station

MinTemp : The minimum temperature in degrees celsius
MaxTemp : The maximum temperature in degrees celsius
Rainfall : The amount of rainfall recorded for the day in mm
Evaporation : The so-called Class A pan evaporation (mm) in the 24 hours to 9am
Sunshine : The number of hours of bright sunshine in the day.
WindGustDir : The direction of the strongest wind gust in the 24 hours to midnight
WindGustSpeed : The speed (km/h) of the strongest wind gust in the 24 hours to midnight
WindDir9am : Direction of the wind at 9am
WindDir3pm : Direction of the wind at 3pm
WindSpeed9am : Speed of the wind at 9am
WindSpeed3pm : Speed of the wind at 3pm
Humidity9am : Relative humidity at 9 am
Humidity3pm : Relative humidity at 3 pm
Pressure9am : Atmospheric pressure reduced to mean sea level at 9 am
Pressure3pm : Atmospheric pressure reduced to mean sea level at 3 pm
Cloud9am : Fraction of sky obscured by cloud at 9 am
Cloud3pm : Fraction of sky obscured by cloud at 3 pm
Temp9am : Temprature recorded at 9am
Temp3pm : Temprature recorded at 3pm
RainToday : Precipitatio(rainfall) today
RainTomorrow : Precipitatio(rainfall) next day
In this notebook we will train a model to predict whether it will rain next day
# reading data
data = pd.read_csv("weather_data.csv")
data.shape
(145460, 23)
data.head()
https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 1/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustD
2008-
0 Albury 13.4 22.9 0.6 NaN NaN
12-01
2008-
Exploration
1
12-02
Albury 7.4 25.1 0.0 NaN NaN WN
2008-
2 Albury 12.9 25.7 0.0 NaN NaN WS
print(data.info())
12-03
2008-'pandas.core.frame.DataFrame'>
<class
3 Albury 9.2 28.0 0.0 NaN NaN N
12-04
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
2008-
4# Column Albury Non-Null
17.5 Count
32.3 Dtype1.0 NaN NaN
12-05
--- ------ -------------- -----
5 0rows Date
× 23 columns 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
None
data.describe()
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGu
count 143975.000000 144199.000000 142199.000000 82670.000000 75625.000000 13519
mean 12.194034 23.221348 2.360918 5.468232 7.611178 4
std 6.398495 7.119049 8.478060 4.193704 3.785483 1
min -8.500000 -4.800000 0.000000 0.000000 0.000000
25% 7.600000 17.900000 0.000000 2.600000 4.800000 3
50% 12.000000 22.600000 0.000000 4.800000 8.400000 3
75% 16.900000 28.200000 0.800000 7.400000 10.600000 4
max 33.900000 48.100000 371.000000 145.000000 14.500000 13
data.describe(include=[object])
Date Location WindGustDir WindDir9am WindDir3pm RainToday RainTomorrow
count 145460 145460 135134 134894 141232 142199 142193
unique 3436 49 16 16 16 2 2
top 2013-11-12 Canberra W N SE No No
freq 49 3436 9915 11758 10838 110319 110316
data.RainTomorrow.unique()
array(['No', 'Yes', nan], dtype=object)
data.RainToday.unique()
array(['No', 'Yes', nan], dtype=object)
data.WindGustDir.unique()
array(['W', 'WNW', 'WSW', 'NE', 'NNW', 'N', 'NNE', 'SW', nan, 'ENE',
'SSE', 'S', 'NW', 'SE', 'ESE', 'E', 'SSW'], dtype=object)
data.WindGustDir.value_counts()
W 9915
SE 9418
N 9313
SSE 9216
E 9181
S 9168
WSW 9069
SW 8967
SSW 8736
WNW 8252
NW 8122
ENE 8104
ESE 7372
NE 7133
NNW 6620
NNE 6548
Name: WindGustDir, dtype: int64
categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']

print("Number of Categorical Features: {}".format(len(categorical_features)))
print("Categorical Features: ",categorical_features)
Number of Categorical Features: 7

Categorical Features: ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']
numerical_features = [column_name for column_name in data.columns if data[column_name].dtype != 'O']

print("Number of Numerical Features: {}".format(len(numerical_features)))
print("Numerical Features: ",numerical_features)
Number of Numerical Features: 16

Numerical Features: ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'Win
data['RainTomorrow'].value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d94962b10>
sns.distplot(data['MaxTemp'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d9493be10>
sns.boxplot(x='RainTomorrow', y="Temp3pm", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d947cac50>
sns.boxplot(x='RainTomorrow', y="Temp9am", data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d947c0d90>
Create features
data['Date'] = pd.to_datetime(data['Date'],format='%y-%m-%d',infer_datetime_format=True)
data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
data.drop('Date', axis = 1, inplace = True)

data.head()
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am WindDir3pm ... Pre
0 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W WNW ...
1 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW WSW ...
2 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W WSW ...
3 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE E ...
4 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE NW ...
5 rows × 25 columns
Missing value treatment
categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']

data[categorical_features].isnull().sum()
Location 0
WindGustDir 10326
WindDir9am 10566
WindDir3pm 4228
RainToday 3261
RainTomorrow 3267
dtype: int64
# categorical feature mode imputation
categorical_features_with_null = [feature for feature in categorical_features if data[feature].isnull().sum()]
for each_feature in categorical_features_with_null:
mode_val = data[each_feature].mode()[0]
data[each_feature].fillna(mode_val,inplace=True)
numerical_features = [column_name for column_name in data.columns if data[column_name].dtype != 'O']

data[numerical_features].isnull().sum()
MinTemp 1485
MaxTemp 1261
Rainfall 3261
Evaporation 62790
Sunshine 69835
WindGustSpeed 10263
WindSpeed9am 1767
WindSpeed3pm 3062
Humidity9am 2654
Humidity3pm 4507
Pressure9am 15065
Pressure3pm 15028
Cloud9am 55888
Cloud3pm 59358
Temp9am 1767
Temp3pm 3609
year 0
month 0
day 0
dtype: int64
Outlier treatment
sns.boxplot(x='WindGustSpeed',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d946d4490>
data.WindGustSpeed.mean()
40.03523007167319
features_with_outliers = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'WindGustSpeed','WindSpeed9am', 'WindSpeed3pm', '

for feature in features_with_outliers:
q1 = data[feature].quantile(0.25)
q3 = data[feature].quantile(0.75)
IQR = q3-q1
lower_limit = q1 - (IQR*1.5)
upper_limit = q3 + (IQR*1.5)
data.loc[data[feature]<lower_limit,feature] = lower_limit
data.loc[data[feature]>upper_limit,feature] = upper_limit
data.WindGustSpeed.mean()
39.83779225870396
sns.boxplot(x='WindGustSpeed',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d946cc110>
## missing value imputation for numerical features

numerical_features_with_null = [feature for feature in numerical_features if data[feature].isnull().sum()]
for feature in numerical_features_with_null:
mean_value = data[feature].mean()
data[feature].fillna(mean_value,inplace=True)
Feature eng
data['RainToday'].replace({'No':0, 'Yes': 1}, inplace = True)
data['RainTomorrow'].replace({'No':0, 'Yes': 1}, inplace = True)
data = pd.get_dummies(data, columns=['WindGustDir','WindDir9am','WindDir3pm','Location'])
data.head()
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm
0 13.4 22.9 0.6 5.318667 7.611178 44.0 20.0 24.0 71.0 22.0
1 7.4 25.1 0.0 5.318667 7.611178 44.0 4.0 22.0 44.0 25.0
2 12.9 25.7 0.0 5.318667 7.611178 46.0 19.0 26.0 38.0 30.0
3 9.2 28.0 0.0 5.318667 7.611178 24.0 11.0 9.0 45.0 16.0
4 17.5 32.3 1.0 5.318667 7.611178 41.0 7.0 20.0 82.0 33.0
# plt.figure(figsize=(20,20))
# sns.heatmap(data.corr(), linewidths=0.5, annot=False, fmt=".2f", cmap = 'viridis')
X = data.drop(['RainTomorrow'],axis=1)
y = data['RainTomorrow']
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
y_train.value_counts()
0 90857
1 25511
Name: RainTomorrow, dtype: int64
y_test.value_counts()
0 22726
1 6366
Name: RainTomorrow, dtype: int64
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.linear_model import LogisticRegression

classifier_logreg = LogisticRegression(solver='liblinear', random_state=0)
classifier_logreg.fit(X_train, y_train)
LogisticRegression(random_state=0, solver='liblinear')
y_pred_logreg_proba = classifier_logreg.predict_proba(X_test)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_logreg_proba[:,1])
plt.figure(figsize=(6,4))
plt.plot(fpr,tpr,'-g',linewidth=1)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for Logistic Regression Model')
plt.xlabel("False Positive Rate")
plt.ylabel('True Positive Rate')
plt.show()
pd.DataFrame({"fpr":fpr, "tpr":tpr, "threshold":thresholds})
fpr tpr threshold
0 0.000000 0.000000 1.997204
1 0.000000 0.000157 0.997204
2 0.000000 0.006755 0.986600
3 0.000044 0.006755 0.986260
4 0.000044 0.007069 0.985841
... ... ... ...
6417 0.974347 0.999686 0.005672
6418 0.974347 0.999843 0.005666
6419 0.992388 0.999843 0.003269
6420 0.992388 1.000000 0.003264
6421 1.000000 1.000000 0.000584
thresholds[np.argmax(tpr - fpr)]
0.18556315429146203
pred_proba = y_pred_logreg_proba[:,1]
preds = np.where(pred_proba>0.18, 1, 0)
from sklearn.metrics import confusion_matrix,accuracy_score

cm = confusion_matrix(y_test, preds)
s = sns.heatmap(cm ,annot=True ,fmt='d')
s.set(xlabel='Predicted', ylabel='Actual')
print("Model accuracy:",accuracy_score(y_test, preds))
Model accuracy: 0.7684930565103809
Decision trees
from sklearn.tree import DecisionTreeClassifier

# from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
classifier_dt = DecisionTreeClassifier(max_depth=8,random_state=0)
classifier_dt.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=8, random_state=0)
# Get probabilities of records belonging to each class

y_pred_logreg_proba_dt = classifier_dt.predict_proba(X_test)
y_pred_logreg_proba_dt
array([[0.95227766, 0.04772234],
[0.69072165, 0.30927835],
[0.90505079, 0.09494921],
...,
[0.62937063, 0.37062937],
[0.83574879, 0.16425121],
[0.96869176, 0.03130824]])
# Get values rather than probabilities

y_pred_logreg_val_dt = classifier_dt.predict(X_test)
y_pred_logreg_val_dt
array([0, 0, 0, ..., 0, 0, 0])

fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test, y_pred_logreg_proba_dt[:,1])
plt.plot(fpr_dt,tpr_dt,'-g',linewidth=1)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for Decision Tree Model')
plt.show()
pred_proba_dt = y_pred_logreg_proba_dt[:,1]
preds_dt = np.where(pred_proba_dt>0.70, 1, 0)
cm_dt = confusion_matrix(y_test, preds_dt)

ConfusionMatrixDisplay(confusion_matrix=cm_dt).plot()
print("Model accuracy:",accuracy_score(y_test, preds_dt))
Model accuracy: 0.8321875429671387
Random Forest
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(n_estimators= 20 ,max_depth=8, random_state=0)

classifier_rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=8, n_estimators=20, random_state=0)
y_pred_logreg_proba_rf = classifier_rf.predict_proba(X_test)
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_pred_logreg_proba_rf[:,1])
plt.plot(fpr_rf,tpr_rf,'-g',linewidth=1)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for Random forest Model')
plt.show()
pred_proba_rf = y_pred_logreg_proba_rf[:,1]
preds_rf = np.where(pred_proba_rf>0.70, 1, 0)
cm_rf = confusion_matrix(y_test, preds_rf)

ConfusionMatrixDisplay(confusion_matrix=cm_rf).plot()
print("Model accuracy:",accuracy_score(y_test, preds_rf))
Model accuracy: 0.7959576515880654

Rainfall - Prediction - Ipynb - Colaboratory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rainfall - Prediction - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

21/11/2023, 00:01 Rainfall_prediction.

from google.colab import drive

Location : The common name of the location of the weather station

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustD

MinTemp MaxTemp Rainfall Evaporation Sunshine WindGu

count 143975.000000 144199.000000 142199.000000 82670.000000 75625.000000 13519

mean 12.194034 23.221348 2.360918 5.468232 7.611178 4

std 6.398495 7.119049 8.478060 4.193704 3.785483 1

min -8.500000 -4.800000 0.000000 0.000000 0.000000

25% 7.600000 17.900000 0.000000 2.600000 4.800000 3

50% 12.000000 22.600000 0.000000 4.800000 8.400000 3

75% 16.900000 28.200000 0.800000 7.400000 10.600000 4

max 33.900000 48.100000 371.000000 145.000000 14.500000 13

Date Location WindGustDir WindDir9am WindDir3pm RainToday RainTomorrow

count 145460 145460 135134 134894 141232 142199 142193

top 2013-11-12 Canberra W N SE No No

freq 49 3436 9915 11758 10838 110319 110316

array(['No', 'Yes', nan], dtype=object)

array(['No', 'Yes', nan], dtype=object)

categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']

Number of Categorical Features: 7

numerical_features = [column_name for column_name in data.columns if data[column_name].dtype != 'O']

Number of Numerical Features: 16

sns.boxplot(x='RainTomorrow', y="Temp9am", data=data)

data.drop('Date', axis = 1, inplace = True)

0 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W WNW ...

3 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE E ...

4 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE NW ...

Missing value treatment

categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']

numerical_features = [column_name for column_name in data.columns if data[column_name].dtype != 'O']

features_with_outliers = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'WindGustSpeed','WindSpeed9am', 'WindSpeed3pm', '

## missing value imputation for numerical features

data['RainToday'].replace({'No':0, 'Yes': 1}, inplace = True)

data['RainTomorrow'].replace({'No':0, 'Yes': 1}, inplace = True)

data = pd.get_dummies(data, columns=['WindGustDir','WindDir9am','WindDir3pm','Location'])

5 rows × 118 columns

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

pd.DataFrame({"fpr":fpr, "tpr":tpr, "threshold":thresholds})

fpr tpr threshold

0 0.000000 0.000000 1.997204

1 0.000000 0.000157 0.997204

2 0.000000 0.006755 0.986600

3 0.000044 0.006755 0.986260

4 0.000044 0.007069 0.985841

... ... ... ...

6417 0.974347 0.999686 0.005672

6418 0.974347 0.999843 0.005666

6419 0.992388 0.999843 0.003269

6420 0.992388 1.000000 0.003264

6421 1.000000 1.000000 0.000584

6422 rows × 3 columns

from sklearn.metrics import confusion_matrix,accuracy_score

Model accuracy: 0.7684930565103809

from sklearn.tree import DecisionTreeClassifier

# Get probabilities of records belonging to each class

# Get values rather than probabilities

array([0, 0, 0, ..., 0, 0, 0])

from sklearn.metrics import roc_curve

cm_dt = confusion_matrix(y_test, preds_dt)

Model accuracy: 0.8321875429671387