You are on page 1of 10

21/11/2023, 00:01 Rainfall_prediction.

ipynb - Colaboratory

Rainfall prediction

from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remoun

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')

Dataset contains about 10 years of daily weather observations from numerous weather stations across Australia.

Location : The common name of the location of the weather station


MinTemp : The minimum temperature in degrees celsius
MaxTemp : The maximum temperature in degrees celsius
Rainfall : The amount of rainfall recorded for the day in mm
Evaporation : The so-called Class A pan evaporation (mm) in the 24 hours to 9am
Sunshine : The number of hours of bright sunshine in the day.
WindGustDir : The direction of the strongest wind gust in the 24 hours to midnight
WindGustSpeed : The speed (km/h) of the strongest wind gust in the 24 hours to midnight
WindDir9am : Direction of the wind at 9am
WindDir3pm : Direction of the wind at 3pm
WindSpeed9am : Speed of the wind at 9am
WindSpeed3pm : Speed of the wind at 3pm
Humidity9am : Relative humidity at 9 am
Humidity3pm : Relative humidity at 3 pm
Pressure9am : Atmospheric pressure reduced to mean sea level at 9 am
Pressure3pm : Atmospheric pressure reduced to mean sea level at 3 pm
Cloud9am : Fraction of sky obscured by cloud at 9 am
Cloud3pm : Fraction of sky obscured by cloud at 3 pm
Temp9am : Temprature recorded at 9am
Temp3pm : Temprature recorded at 3pm
RainToday : Precipitatio(rainfall) today
RainTomorrow : Precipitatio(rainfall) next day

In this notebook we will train a model to predict whether it will rain next day

# reading data
data = pd.read_csv("weather_data.csv")

data.shape

(145460, 23)

data.head()

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 1/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustD

2008-
0 Albury 13.4 22.9 0.6 NaN NaN
12-01

2008-
Exploration
1
12-02
Albury 7.4 25.1 0.0 NaN NaN WN

2008-
2 Albury 12.9 25.7 0.0 NaN NaN WS
print(data.info())
12-03

2008-'pandas.core.frame.DataFrame'>
<class
3 Albury 9.2 28.0 0.0 NaN NaN N
12-04
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
2008-
4# Column Albury Non-Null
17.5 Count
32.3 Dtype1.0 NaN NaN
12-05
--- ------ -------------- -----
5 0rows Date
× 23 columns 145460 non-null object
1 Location 145460 non-null object
2 MinTemp 143975 non-null float64
3 MaxTemp 144199 non-null float64
4 Rainfall 142199 non-null float64
5 Evaporation 82670 non-null float64
6 Sunshine 75625 non-null float64
7 WindGustDir 135134 non-null object
8 WindGustSpeed 135197 non-null float64
9 WindDir9am 134894 non-null object
10 WindDir3pm 141232 non-null object
11 WindSpeed9am 143693 non-null float64
12 WindSpeed3pm 142398 non-null float64
13 Humidity9am 142806 non-null float64
14 Humidity3pm 140953 non-null float64
15 Pressure9am 130395 non-null float64
16 Pressure3pm 130432 non-null float64
17 Cloud9am 89572 non-null float64
18 Cloud3pm 86102 non-null float64
19 Temp9am 143693 non-null float64
20 Temp3pm 141851 non-null float64
21 RainToday 142199 non-null object
22 RainTomorrow 142193 non-null object
dtypes: float64(16), object(7)
memory usage: 25.5+ MB
None

data.describe()

MinTemp MaxTemp Rainfall Evaporation Sunshine WindGu

count 143975.000000 144199.000000 142199.000000 82670.000000 75625.000000 13519

mean 12.194034 23.221348 2.360918 5.468232 7.611178 4

std 6.398495 7.119049 8.478060 4.193704 3.785483 1

min -8.500000 -4.800000 0.000000 0.000000 0.000000

25% 7.600000 17.900000 0.000000 2.600000 4.800000 3

50% 12.000000 22.600000 0.000000 4.800000 8.400000 3

75% 16.900000 28.200000 0.800000 7.400000 10.600000 4

max 33.900000 48.100000 371.000000 145.000000 14.500000 13

data.describe(include=[object])

Date Location WindGustDir WindDir9am WindDir3pm RainToday RainTomorrow

count 145460 145460 135134 134894 141232 142199 142193

unique 3436 49 16 16 16 2 2

top 2013-11-12 Canberra W N SE No No

freq 49 3436 9915 11758 10838 110319 110316

data.RainTomorrow.unique()

array(['No', 'Yes', nan], dtype=object)

data.RainToday.unique()

array(['No', 'Yes', nan], dtype=object)

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 2/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory

data.WindGustDir.unique()

array(['W', 'WNW', 'WSW', 'NE', 'NNW', 'N', 'NNE', 'SW', nan, 'ENE',
'SSE', 'S', 'NW', 'SE', 'ESE', 'E', 'SSW'], dtype=object)

data.WindGustDir.value_counts()

W 9915
SE 9418
N 9313
SSE 9216
E 9181
S 9168
WSW 9069
SW 8967
SSW 8736
WNW 8252
NW 8122
ENE 8104
ESE 7372
NE 7133
NNW 6620
NNE 6548
Name: WindGustDir, dtype: int64

categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']


print("Number of Categorical Features: {}".format(len(categorical_features)))
print("Categorical Features: ",categorical_features)

Number of Categorical Features: 7


Categorical Features: ['Date', 'Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday', 'RainTomorrow']

numerical_features = [column_name for column_name in data.columns if data[column_name].dtype != 'O']


print("Number of Numerical Features: {}".format(len(numerical_features)))
print("Numerical Features: ",numerical_features)

Number of Numerical Features: 16


Numerical Features: ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'Win

data['RainTomorrow'].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d94962b10>

sns.distplot(data['MaxTemp'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d9493be10>

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 3/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory
sns.boxplot(x='RainTomorrow', y="Temp3pm", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d947cac50>

sns.boxplot(x='RainTomorrow', y="Temp9am", data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d947c0d90>

Create features

data['Date'] = pd.to_datetime(data['Date'],format='%y-%m-%d',infer_datetime_format=True)

data['year'] = data['Date'].dt.year
data['month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day

data.drop('Date', axis = 1, inplace = True)


data.head()

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am WindDir3pm ... Pre

0 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W WNW ...

1 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW WSW ...

2 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W WSW ...

3 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE E ...

4 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE NW ...

5 rows × 25 columns

Missing value treatment

categorical_features = [column_name for column_name in data.columns if data[column_name].dtype == 'O']


data[categorical_features].isnull().sum()

Location 0
WindGustDir 10326
WindDir9am 10566
WindDir3pm 4228
RainToday 3261
RainTomorrow 3267
dtype: int64

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 4/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory
# categorical feature mode imputation
categorical_features_with_null = [feature for feature in categorical_features if data[feature].isnull().sum()]
for each_feature in categorical_features_with_null:
mode_val = data[each_feature].mode()[0]
data[each_feature].fillna(mode_val,inplace=True)

numerical_features = [column_name for column_name in data.columns if data[column_name].dtype != 'O']


data[numerical_features].isnull().sum()

MinTemp 1485
MaxTemp 1261
Rainfall 3261
Evaporation 62790
Sunshine 69835
WindGustSpeed 10263
WindSpeed9am 1767
WindSpeed3pm 3062
Humidity9am 2654
Humidity3pm 4507
Pressure9am 15065
Pressure3pm 15028
Cloud9am 55888
Cloud3pm 59358
Temp9am 1767
Temp3pm 3609
year 0
month 0
day 0
dtype: int64

Outlier treatment

sns.boxplot(x='WindGustSpeed',data=data)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d946d4490>

data.WindGustSpeed.mean()

40.03523007167319

features_with_outliers = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'WindGustSpeed','WindSpeed9am', 'WindSpeed3pm', '


for feature in features_with_outliers:
q1 = data[feature].quantile(0.25)
q3 = data[feature].quantile(0.75)
IQR = q3-q1
lower_limit = q1 - (IQR*1.5)
upper_limit = q3 + (IQR*1.5)
data.loc[data[feature]<lower_limit,feature] = lower_limit
data.loc[data[feature]>upper_limit,feature] = upper_limit

data.WindGustSpeed.mean()

39.83779225870396

sns.boxplot(x='WindGustSpeed',data=data)

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 5/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d946cc110>

## missing value imputation for numerical features


numerical_features_with_null = [feature for feature in numerical_features if data[feature].isnull().sum()]
for feature in numerical_features_with_null:
mean_value = data[feature].mean()
data[feature].fillna(mean_value,inplace=True)

Feature eng

data['RainToday'].replace({'No':0, 'Yes': 1}, inplace = True)

data['RainTomorrow'].replace({'No':0, 'Yes': 1}, inplace = True)

data = pd.get_dummies(data, columns=['WindGustDir','WindDir9am','WindDir3pm','Location'])

data.head()

MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm

0 13.4 22.9 0.6 5.318667 7.611178 44.0 20.0 24.0 71.0 22.0

1 7.4 25.1 0.0 5.318667 7.611178 44.0 4.0 22.0 44.0 25.0

2 12.9 25.7 0.0 5.318667 7.611178 46.0 19.0 26.0 38.0 30.0

3 9.2 28.0 0.0 5.318667 7.611178 24.0 11.0 9.0 45.0 16.0

4 17.5 32.3 1.0 5.318667 7.611178 41.0 7.0 20.0 82.0 33.0

5 rows × 118 columns

# plt.figure(figsize=(20,20))
# sns.heatmap(data.corr(), linewidths=0.5, annot=False, fmt=".2f", cmap = 'viridis')

X = data.drop(['RainTomorrow'],axis=1)
y = data['RainTomorrow']

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

y_train.value_counts()

0 90857
1 25511
Name: RainTomorrow, dtype: int64

y_test.value_counts()

0 22726
1 6366
Name: RainTomorrow, dtype: int64

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegression


classifier_logreg = LogisticRegression(solver='liblinear', random_state=0)
classifier_logreg.fit(X_train, y_train)

LogisticRegression(random_state=0, solver='liblinear')

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 6/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory
y_pred_logreg_proba = classifier_logreg.predict_proba(X_test)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_logreg_proba[:,1])
plt.figure(figsize=(6,4))
plt.plot(fpr,tpr,'-g',linewidth=1)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for Logistic Regression Model')
plt.xlabel("False Positive Rate")
plt.ylabel('True Positive Rate')
plt.show()

pd.DataFrame({"fpr":fpr, "tpr":tpr, "threshold":thresholds})

fpr tpr threshold

0 0.000000 0.000000 1.997204

1 0.000000 0.000157 0.997204

2 0.000000 0.006755 0.986600

3 0.000044 0.006755 0.986260

4 0.000044 0.007069 0.985841

... ... ... ...

6417 0.974347 0.999686 0.005672

6418 0.974347 0.999843 0.005666

6419 0.992388 0.999843 0.003269

6420 0.992388 1.000000 0.003264

6421 1.000000 1.000000 0.000584

6422 rows × 3 columns

thresholds[np.argmax(tpr - fpr)]

0.18556315429146203

pred_proba = y_pred_logreg_proba[:,1]

preds = np.where(pred_proba>0.18, 1, 0)

from sklearn.metrics import confusion_matrix,accuracy_score


cm = confusion_matrix(y_test, preds)
s = sns.heatmap(cm ,annot=True ,fmt='d')
s.set(xlabel='Predicted', ylabel='Actual')
print("Model accuracy:",accuracy_score(y_test, preds))

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 7/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory

Model accuracy: 0.7684930565103809

Decision trees

from sklearn.tree import DecisionTreeClassifier


# from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

classifier_dt = DecisionTreeClassifier(max_depth=8,random_state=0)
classifier_dt.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=8, random_state=0)

# Get probabilities of records belonging to each class


y_pred_logreg_proba_dt = classifier_dt.predict_proba(X_test)
y_pred_logreg_proba_dt

array([[0.95227766, 0.04772234],
[0.69072165, 0.30927835],
[0.90505079, 0.09494921],
...,
[0.62937063, 0.37062937],
[0.83574879, 0.16425121],
[0.96869176, 0.03130824]])

# Get values rather than probabilities


y_pred_logreg_val_dt = classifier_dt.predict(X_test)
y_pred_logreg_val_dt

array([0, 0, 0, ..., 0, 0, 0])

from sklearn.metrics import roc_curve


fpr_dt, tpr_dt, thresholds_dt = roc_curve(y_test, y_pred_logreg_proba_dt[:,1])
plt.figure(figsize=(6,4))
plt.plot(fpr_dt,tpr_dt,'-g',linewidth=1)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for Decision Tree Model')
plt.xlabel("False Positive Rate")
plt.ylabel('True Positive Rate')
plt.show()

pred_proba_dt = y_pred_logreg_proba_dt[:,1]

preds_dt = np.where(pred_proba_dt>0.70, 1, 0)

cm_dt = confusion_matrix(y_test, preds_dt)


ConfusionMatrixDisplay(confusion_matrix=cm_dt).plot()
print("Model accuracy:",accuracy_score(y_test, preds_dt))

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 8/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory

Model accuracy: 0.8321875429671387

Random Forest

from sklearn.ensemble import RandomForestClassifier

classifier_rf = RandomForestClassifier(n_estimators= 20 ,max_depth=8, random_state=0)


classifier_rf.fit(X_train, y_train)

RandomForestClassifier(max_depth=8, n_estimators=20, random_state=0)

y_pred_logreg_proba_rf = classifier_rf.predict_proba(X_test)
from sklearn.metrics import roc_curve
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test, y_pred_logreg_proba_rf[:,1])
plt.figure(figsize=(6,4))
plt.plot(fpr_rf,tpr_rf,'-g',linewidth=1)
plt.plot([0,1], [0,1], 'k--' )
plt.title('ROC curve for Random forest Model')
plt.xlabel("False Positive Rate")
plt.ylabel('True Positive Rate')
plt.show()

pred_proba_rf = y_pred_logreg_proba_rf[:,1]

preds_rf = np.where(pred_proba_rf>0.70, 1, 0)

cm_rf = confusion_matrix(y_test, preds_rf)


ConfusionMatrixDisplay(confusion_matrix=cm_rf).plot()
print("Model accuracy:",accuracy_score(y_test, preds_rf))

Model accuracy: 0.7959576515880654

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 9/10
21/11/2023, 00:01 Rainfall_prediction.ipynb - Colaboratory

https://colab.research.google.com/drive/1XM89JsPOm5UTWy8aQCcyBupVlhFp3m3Y#printMode=true 10/10

You might also like