You are on page 1of 38

RAINFALL PREDICTION USING

MACHINE LEARNING MODELS


REPORT OF MINI PROJECT-I

Submitted in partial fulfilment of the requirements for the degree of

BACHELOR OF TECHNOLOGY

In

CIVIL ENGINEERING
By

ALLEN VARGHESE PAUL (201CV103)


RAKSHITH SAJJAN (201CV142)
UJWAL B (201CV258)
SHANNON BRITNEY CARLO (201CV249)
Under the guidance of

Dr. Subrahmanya Kundapura

DEPARTMENT OF WATER RESOURCES AND OCEAN

ENGINEERING

NATIONAL INSTITUE OF TECHNOLOGY KARNATAKA

SURATHKAL, MANGALORE-575025

NOVEMBER 2022
DECLARATION

We declare that the Report of the Mini project-I entitled “RAINFALL PREDICTION USING
MACHINE LEARNING”, which is being submitted to National Institute of Technology
Karnataka, Surathkal in partial fulfilment of requirements of the Degree of Bachelor of
Technology in Civil Engineering is a bonafide report of the project worked carried out by us.
The material contained in this report has not been submitted to any university or Institution for
the award of any degree.

ALLEN VARGHESE PAUL (201CV103)


RAKSHITH SAJJAN (201CV142)

UJWAL B (201CV258)

SHANNON BRITNEY CARLO (201CV249)

Place: NITK, Surathkal


Date: 07 - 11 - 2022
Certificate

This is to certify that this report entitled RAINFALL PREDICTION USING MACHINE
LEARNING being submitted by ALLEN VARGHESE PAUL (201CV103), RAKSHITH
SAJJAN (201CV142), UJWAL B (201CV258) and SHANNON BRITNEY CARLO
(201CV249) is accepted as the record of work carried out by them as the part of a Mini project-
I in partial fulfilment of the requirements for the award of the degree of Bachelor of Technology
in Civil Engineering of the Department of Civil Engineering, National Institute of Technology
Karnataka, Surathkal, Mangaluru.

Dr Subrahmanya Kundapura Dr B R Jayalekshmi


Professor & Professor and Head &
Mini project supervisor Chairman (DUGC)
Department of Water Resources and Ocean Engineering Department of Civil Engineering
National Institute of Technology National Institute of Technology
Karnataka, Surathkal Karnataka, Surathkal
ABSTRACT

The study of precipitation and rainfall trends is critically important for a country like India
whose food security and economy are dependent on the timely availability of water. In this
work, trends of rainfall have been studied using daily data series using different types of
machine learning models. Different machine learning models have different accuracies,
ranging from 50% to 85%.
Previously, predicting rainfalls was particularly difficult, having to use historical data and
empirical formulas. However, with the advent of machine learning, rainfall can be predicted
much more accurately. Factors like humidity, wind speed, sunshine hours, wind direction,
minimum and maximum temperatures during the day are considered while predicting the
rainfall of a certain area. Models like Linear Regression, KNN, SVM, etc., have been used to
predict the output with the given data. Artificial Intelligence tools can replace the simulation
models by using input and output data sets without considering some of the complex relations
of the system to be modeled. The aim of the study is to predict the rainfall in Surathkal,
Mangalore, India.
The data used are from the Department of Water Resources and Ocean Engineering.
Keywords: Rainfall, Parameters, Artificial Intelligence, Machine Learning
TABLE OF CONTENTS

Abstract i

Content ii

List of figures iv

List of tables v

List of Abbreviations vi

1. Introduction

1.1. General 1

1.2. Methods of measurement of Rainfall 2

1.3. Factors affecting Rainfall analysis 3

1.4. Artificial Intelligence 3

1.5. Machine Learning 4

1.6. Difference between AI and ML 6

1.7. Advantages of AI over ML 7

2. Literature Review

2.1. General 8

2.2. Rainfall studies 8

2.3. Models used for rainfall prediction 9

2.4. Concepts of Machine Learning 11


3. Data 13

4. Study Area 13

5. Methodology 15

5.1. Data pre-processing and exploratory data analysis 15

5.2. Model training and validation 15

5.3. Model performance evaluation 22

5.4. Selection of best model and result analysis 25

6. Result 25

7. Observation 26

8. Conclusion 27

9. References 28
LIST OF TABLES

Table No. Description Page No.

1.1 Difference between AI and ML 6

3.1 Data and Sources of Data 13


LIST OF FIGURES

Fig. No Description Page No

1.1 Standard Rain gauge 2

1.2 Hydrometer 2

1.3 Relation between AI and it’s sub-fields 4

1.4 Machine Learning and its types 5

1.5 Relation between AI, ML and DL 6

4.1 Location map of study area 14

5.1 Building Machine Learning model 15

5.2 Heatmap 16

5.3 Scatterplot between maximum temperature and minimum temperature 17

5.4 Scatterplot between evaporation and humidity 17

5.5 Outlier detection for maximum temperature 18

5.6 Outlier detection for minimum temperature 18

5.7 Outlier detection for wind gust speed 19

5.8 Outlier detection for sunshine hours 19

5.9 Outlier detection for humidity 20

5.10 Outlier detection for evaporation 20

5.11 Outlier detection for rainfall 21

5.12 Multivariate analysis 21

5.13 Positive correlation graph 24

5.14 Negative correlation graph 24


5.15 No correlation graph 24

6.1 Trial 1 25

6.2 Trial 2 26

6.3 Trial 3 26
ABBREVIATIONS

AI Artificial Intelligence

ML Machine Learning

DL Deep Learning

KNN K- Nearest neighbours

SVM Support Vector Machine

SVR Support Vector Regression

RE Relative Error

RMSE Root Mean Square Error

EDA Exploratory Data Analysis


1. INTRODUCTION

1.1. General

Rainfall is the major product of the condensation of atmospheric water vapor that fall under
gravitational pull from clouds. It occurs when a portion of the atmosphere becomes saturated
with water vapor (reaching so that the water condenses and "precipitates" or falls. Rainfall
(including drizzle and rain) is usually measured using a rain gauge and expressed
in units of millimeters (mm) of height or depth Rainfall is the predominant form of
precipitation causing streamflow, especially the flood flow in a majority of rivers in India.
The magnitude of rainfall varies with time and space. Differences in the magnitude of rainfall
in various parts of the country at a given time and variation of rainfall at a place in the
various seasons of the year are obvious and need no elaboration. Rainfall can be classified
based on the rate of precipitation as follows: –

1. Light rainfall- 2.5 mm/h

2. Moderate rainfall – 2.5mm/h – 7.5 mm/h

3. Heavy rainfall – 7.5 mm/h

Based on the seasons the different amounts of rainfall are given below: -

1. South West monsoon (June - September) – The south west monsoon is the principal
rainy season of India (75% annual rainfall). Precipitation about 100-200 mm per day.

2. Transition -1 post monsoon (October - November) – This air mass strikes the east
coast of the southern and causes rainfall. The cyclone forms in Bay of Bengal are
about twice as many as in the Arabian sea.

3. Winter season (December - February) – By about mid-December disturbance of extra


tropical origin travel eastward. They cause moderate to heavy rainfall and snow fall.

4. Transition -2 post summer (March - may) – there is very little rainfall in India in this
season.

India has 4 % of the world’s freshwater which must cater to 17% of the world’s population.
As per NITI Aayog report released in June 2019, India is facing the worst-ever water crisis in
history. Approximately 600 million people or roughly around 45% of the population in India
is facing high to severe water stress. As per the report, 21 Indian cities will run out of their
main source of water i.e., groundwater by 2020. The report goes on to say that nearly 40% of
the population will have absolutely no access to drinking water by 2030 and 6% of India’s
GDP will be lost by 2050 due to the water crisis. The water footprint network has developed
an interactive tool to calculate and map the water footprint by different users, assess its
sustainability, and identify strategic interventions for improving water use. Hence, to develop
these efficient systems, we can use Artificial Intelligence and Machine Learning.

1.2. Methods for measurement of rainfall

The standard way of measuring rainfall or snowfall is the standard rain gauge, which can be
found in 100 mm plastic and 200 mm metal varieties. The inner cylinder is filled by 25 mm
of rain, with overflow flowing into the outer cylinder. Other types of gauges include the
popular wedge gauge, the tipping bucket rain gauge, and the weighing rain gauge.

Fig 1.1 Rain gauge Fig 1.2 Hydrometer

A concept used in precipitation measurement is the hydrometeor. Hydrometeorology is a


branch of meteorology and hydrology that studies the transfer of water and energy between
the land surface and the lower atmosphere. Hydrologists often use data provided by
meteorologists. As an example, a meteorologist might forecast 2–3 inches (51–76 mm)
of rain in a specific area, and a hydrologist might then forecast what the specific impact of
that rain would be on the local terrain.

Satellite sensors work by remotely sensing precipitation—recording various parts of


the electromagnetic spectrum that theory and practice show are related to the occurrence and
intensity of precipitation. Satellite sensors now in practical use for precipitation fall into two
categories. Thermal infrared (IR) sensors and Microwave sensors.
Since the late 1990s, several algorithms have been developed to combine precipitation data
from multiple satellites' sensors, seeking to emphasize the strengths and minimize the
weaknesses of the individual input data sets. The best analyses of gauge data take two months
or more after the observation time to undergo the necessary transmission, assembly,
processing, and quality control. Thus, precipitation estimates that include gauge data tend to
be produced further after the observation time than the no-gauge estimates. As a result, while
estimates that include gauge data may provide a more accurate depiction of the "true"
precipitation, they are generally not suited for real- or near-real-time applications.

1.3. Factors affecting rainfall

1. Temperature

Temperature affects how much water evaporates off the surface of the ground. If the
temperature is high, then less moisture will be lost. However, if the temperature is low, then
more moisture will be lost.

2. Humidity

When humidity is high, the air becomes saturated and does not hold any additional water
molecules. In contrast, when humidity is low, the air holds more water molecules than what is
already present.

3. Wind

Wind speed and direction affect the movement of air across the earth’s surface. Strong winds
cause the air to move rapidly over the land, causing evaporation to occur at a faster rate. As a
result, the air becomes drier. Conversely, calm wind conditions allow the air to remain still,
which causes the air to become wetter.

4. Evaporation

Evaporation is the transition of the liquid particles into the gaseous phase. Rainfall is affected
by the rate of evaporation as it is the amount of water entering the atmosphere from the
surface of the Earth.

5. Cloud Cover

Cloud cover is the percentage of sky covered by clouds. Clouds reflect sunlight back into
space, thereby cooling the planet. On average, cloud cover increases precipitation.

1.4. Artificial Intelligence

AI refers to the development of computer systems able to perform tasks that normally require
human intelligence, such as visual perception, speech recognition, decision-making, and
translation between languages. Artificial intelligence was founded as an academic discipline
in 1956, and in the years since has experienced several waves of optimism. Some of the AI
applications include advanced web search engines recommendation systems (used
by YouTube and Amazon ), understanding human speech (such as Siri and Alexa), self-
driving cars (e.g., Tesla) AI researchers have adapted and integrated a wide range of
problem-solving techniques – including search and mathematical optimization, formal
logic, artificial neural networks, and methods based on statistics, probability and economics.

Fig. 1.3 Relation between AI and it’s sub-fields

1.5. Machine Learning

Machine learning is a part of artificial intelligence. Machine learning algorithms build a


model based on sample data, known as training data, in order to make predictions or
decisions without being explicitly programmed to do so. Machine learning algorithms are
used in a wide variety of applications, such as in medicine, email filtering, speech
recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop
conventional algorithms to perform the needed tasks. Machine learning (ML), reorganized as
a separate field, started to flourish in the 1990s. The field changed its goal from achieving
artificial intelligence to tackling solvable problems of a practical nature. The difference
between ML and AI is frequently misunderstood. ML learns and predicts based on passive
observations, whereas AI implies an agent interacting with the environment to learn and take
actions that maximize its chance of successfully achieving its goals. Machine learning can be
divided into 3 types:

1. Supervised Learning
Supervised learning algorithms build a mathematical model of a set of data that contains both
the inputs and the desired outputs. The data is known as training data, and consists of a set of
training examples. Each training example has one or more inputs and the desired output, also
known as a supervisory signal. In the mathematical model, each training example is
represented by an array or vector, sometimes called a feature vector, and the training data is
represented by a matrix. Types of supervised-learning algorithms include some major
algorithms like active learning, classification and regression.

2. Unsupervised learning

Unsupervised learning algorithms take a set of data that contains only inputs, and find
structure in the data, like grouping or clustering of data points. The algorithms, therefore,
learn from test data that has not been labelled, classified, or categorized. Instead of
responding to feedback, unsupervised learning algorithms identify commonalities in the data
and react based on the presence or absence of such commonalities in each new piece of data.

3. Reinforcement learning

Reinforcement learning is an area of machine learning concerned with how software


agents ought to take actions in an environment so as to maximize some notion of cumulative
reward. Due to its generality, the field is studied in many other disciplines, such as game
theory, control theory, operations research, information theory, simulation-based
optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms.
Many reinforcement learning algorithms use dynamic programming techniques.

Fig. 1.4 Machine Learning and its types


1.6. Differences between AI and Machine Learning

Machine learning refers to a subset of artificial intelligence that allows machines to learn and
improve automatically based on past data without the need for explicit programming.
Artificial intelligence aims at producing smart computer systems that can solve complex
human problems faster than humans can do. In the case of ML, we basically teach different
machines involving data to come up with accurate results by performing a task on its own
whereas for AI we try to develop a system which can perform the task like a human being
would.

Artificial Intelligence Machine Learning


AI stands for Artificial ML stands for Machine
Intelligence and is a Learning and is a subset of
superset of ML AI.
The main aim of AI is not The main aim of ML is to
increasing the chances of increase accuracy regardless
success regardless of of outcome
accuracy.
AI simulates natural ML learns from pre-existing
intelligence to solve huge data and uses that to
problems. maximise performance.
AI has a very broad variety The scope of Machine
of applications learning is limited.
AI is basically decision ML allows the system to
making learn from the data

Table 1.1 Difference between AI and ML

Fig 1.5 Relation between AI, ML and DL


1.7. Advantages of AI over Numerical Modelling

Artificial intelligence is preferred over numerical modelling for many problems due to errors in
the simulation process. The complexity of simulations is much higher than that of artificial
intelligence. Artificial Intelligence tools can replace simulation models and decrease
computational efforts by using input and output data sets without considering complex
relations of the system to be modelled.

The four basic types of rainfall models that have been the focus of most of the recent research
on rainfall modelling:

1. Generalized linear models (GLMs)

2. Hidden Markov models (HMMs)

3. Nonparametric models

4. “Mechanistic” models.

No matter what type of model is fit, a common goal is to simulate rainfall from the fitted
model. There are two sources of variation which are variation built into the model, and
variation associated with the uncertainty with which the parameters of the model are
estimated during the training phase of the data analysis. This second source of variation is
often overlooked.

In the case of rainfall models, there are variations in parameters as wind speed as these
cannot be exactly simulated with a high accuracy. Hence, numerical modelling is not suitable
for solving the rainfall prediction problem. Instead, it is recommended to feed the data
collected through experimentation to a model which can learn and predict the data accurately.
2. LITERATURE REVIEW

2.1 General

The review of literature consists of various sections (a) Rainfall studies and models, (b)
Models used for rainfall prediction, (c) Concepts of Machine Learning,

2.2 Rainfall studies and models

People are working on to detect the patterns in climate change as it affects the economy in
production to infrastructure. So as in rainfall also making prediction of rainfall is a challenging
task with a good accuracy rate. Making prediction on rainfall cannot be done by the traditional
way, so scientist is using machine learning and deep learning to find out the pattern for rainfall
prediction. Here are different techniques used for the prediction of rainfall such as Regression
analysis, clustering, and Artificial Neural Networks (ANN). Fundamentally, two approaches
are used for predicting rainfall. One is the

Empirical approach and the other is Dynamical approach. The empirical approach is based on
an analysis of historical data of the rainfall and its relationship to a variety of atmospheric and
oceanic variables over different parts of the world. The most widely used empirical approaches,
which are used for climate prediction, are regression, artificial neural network, fuzzy logic, and
group method of data handling. On the other hand, in a dynamical approach, predictions are
generated by physical models based on systems of equations that predict the evolution of the
global climate system in response to initial atmospheric conditions.

The different rainfall estimation models were developed by Ozlem Terzi by using the monthly
rainfall data of Isparta, Senirkent, Uluborlu, Egirdir, and Yalvac stations of Turki. Rainfall
estimation models were built using Decision Table, KNN, Multilinear Regression, M5Rules,
Multilayer Perceptron, RBF Network, Random Subspace, and Simple Linear Regression
algorithms and quality of these models were tested using the chosen coefficient of
determination (R2) and root-mean-squared error (RMSE) which are the most well-known and
the commonly used performance criteria. Using different combinations of Input given to the
above-developed Models, he has generated the MLR model that gives the best results to
estimate rainfall over the Isparta region. J.M. Spate et al has prepared a model to measure
streamflow from the measured and estimated/interpolated rainfall. K-medoid algorithm on
clustering has been discussed to clustering shapes/peaks. The paper has discussed the various
classification and association rule extraction methods. Instead, they have selected all those
catchments in their region of interest where high-intensity rainfall data does exist for at least
some temporal interval. Then they applied some simple criteria to the high-intensity data; for
example, so much rain must fall in such a small- time interval on a given day for that fall to be
flagged as an intense event. Having generated a Boolean series with 1s on every day with an
intense event and 0s elsewhere, they use data mining to automatically extract those
combinations of daily data characteristics that tend to occur on a day with 1 in the Boolean
series.
Pratap Singh Solanki et al reviewed the studies related to the use of data mining techniques in
the field of water resource sector for Water Management. Presently, Water Resource
Management has become the most challenging, interesting, and fascinating domain around the
world since last many years. Scientists tried to predict the Rainfall, Flood Warning, Water
Inflow, Water Availability and Requirements, etc. based on huge available metadata using
various methods. In this article, they tried to search the use of data mining techniques for
predicting the inflow, drought possibility, weather report, rainfall, evaporation, temperature,
wind speed, etc. This paper provides a survey of some literature and work done by the
researchers using various algorithms and modelling method viz. Associations rules,
Classification, Clustering, Decision Tree, and Artificial Neural Network, etc.

Pinky Saikia Dutta in her Project, Rainfall prediction is implemented with the use of the
empirical statistical technique. She used 6 years (2007-2012) datasets such as minimum
temperature, maximum temperature, pressure, wind direction, relative humidity, etc., and
performed prediction of Rainfall using Multiple Linear Regression (MLR). This model
forecasts the monthly rainfall amount in the summer monsoon season (in mm). Regression is
a statistical empirical technique that utilizes the relation between two or more quantitative
variables on an observational database so that the outcome variable can be predicted from the
others. One of the purposes of a regression model is to find out to what extent the outcome
(dependent variable) can be predicted by the independent variables. Predictors selected for the
model are minimum temperature, maximum temperature, mean sea level pressure, wind speed,
and rainfall.

Jyothis Joseph described the empirical method technique belonging to the clustering and
classification approach. ANNs are used to implement these techniques. He used Relative
Humidity, Pressure, Temperature, Precipitable Water, Wind Speed. In this paper subtractive
clustering is used. Subtractive clustering is a fast, one-pass algorithm for estimating the number
of clusters and the cluster centres in a set of data. Applying subtractive clustering, the optimum
numbers of clusters are obtained. The rainfall values are categorized as low, medium & heavy.
The classifier model has been evaluated against a confusion matrix and the results have been
obtained. This paper applies a neural network for rainfall prediction. In this paper, two methods
such as classification and clustering are implemented. The neural network Bayesian
regularization has been applied in the implementation.

K. Poorani, K Brindha in has used Principal Component Analysis method for forecasting of
rainfall. The proposed PCA method is used when there is a vital inter- correlation between the
predictors. The PCA model avoids the inter-correlation and support to reduce the degrees of
liberty by controlling the number of predictors. Their experiment studies, therefore, suggest
that PCA has some more benefits over ANN in analysing climatic time series such as rainfall,
particularly with regards to the interpretability of the extracted signals.

2.3 Models and methods used for rainfall prediction

1. Linear regression
Linear regression is the simplest machine learning model in which we try to predict one output
variable using one or more input variables. The representation of linear regression is a linear
equation, which combines a set of input values (x) and predicted output(y) for the set of those
input values. It is represented in the form of a line.

2. Decision Tree:

Decision trees are the popular machine learning models that can be used for both regression
and classification problems.

A decision tree uses a tree-like structure of decisions along with their possible consequences
and outcomes. In this, each internal node is used to represent a test on an attribute; each branch
is used to represent the outcome of the test. The more nodes a decision tree has, the more
accurate the result will be. The advantage of decision trees is that they are intuitive and easy to
implement, but they lack accuracy.

3. Random Forest:

Random Forest is the ensemble learning method, which consists of many decision trees. Each
decision tree in a random forest predicts an outcome, and the prediction with most votes is
considered as the outcome.

A random forest model can be used for both regression and classification problems.

4. SVM:

Support Vector Machine (SVM) is a relatively simple Supervised Machine Learning Algorithm
used for classification and/or regression. It is more preferred for classification but is sometimes
very useful for regression as well. An SVM outputs a map of the sorted data with the margins
between the two as far apart as possible. SVMs are used in text categorization, image
classification, handwriting recognition and in the sciences.

5. KNN:

The k-nearest neighbours’ algorithm, also known as KNN or k-NN, is a non-parametric,


supervised learning classifier, which uses proximity to make classifications or predictions
about the grouping of an individual data point. the average the k nearest neighbours is taken to
make a prediction with continuous values.

6. Gradient Boosting:
Gradient boosting is a technique used in creating models for prediction. The technique is mostly
used in regression. Gradient boosting presents model building in stages, just like other boosting
methods, while allowing the generalization and optimization of differentiable loss functions.

7. ADA Boosting:

It is a technique used as an Ensemble Method in Machine Learning. Boosting is used to reduce


bias as well as variance for supervised learning. It works on the principle of learners growing
sequentially. It makes ‘n’ number of decision trees during the data training period. As the first
decision tree/model is made, the incorrectly classified record in the first model is given priority.
Only these records are sent as input for the second model. The process goes on until we specify
several base learners we want to create.

2.4 Concepts of Artificial Intelligence and Machine Learning

Artificial Intelligence:

Several definitions of artificial intelligence (AI) have surfaced over the last few decades. John
McCarthy offers the following definition in the 2004 paper resides "It is the science and
engineering of making intelligent machines, especially intelligent computer programs. It is
related to the similar task of using computers to understand human intelligence, but AI does
not have to confine itself to methods that are biologically observable."

However, decades before this definition, the artificial intelligence conversation began with
Alan Turing's 1950 work "Computing Machinery and Intelligence" .In this paper, Turing,
often referred to as the "father of computer science", asks the following question: "Can
machines think?" From there, he offers a test, now famously known as the "Turing Test",
where a human interrogator would try to distinguish between a computer and human text
response. While this test has undergone much scrutiny since its publication, it remains an
important part of the history of AI.

One of the leading AI textbooks is “Artificial Intelligence: A Modern Approach” by Stuart


Russell and Peter Norvig. In the book, they delve into four potential goals or definitions of
AI, which differentiate computer systems as follows:

Human approach:

1. Systems that think like humans


2. Systems that act like humans

Ideal approach:

1. Systems that think rationally


2. Systems that act rationally
Machine Learning:

Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy. Machine learning is an important component of the growing field of
data science.

Using statistical methods, algorithms are trained to make classifications or predictions, and to
uncover key insights in data mining projects. These insights subsequently drive decision
making within applications and businesses, ideally impacting key growth metrics. As big data
continues to expand and grow, the market demand for data scientists will increase. They will
be required to help identify the most relevant business questions and the data to answer them.

Machine learning algorithms are typically created using frameworks that accelerate solution
development, such as TensorFlow and PyTorch.

the learning system of a machine learning algorithm into three main parts.

1. A Decision Process: In general, machine learning algorithms are used to make a


prediction or classification. Based on some input data, which can be labelled or
unlabelled, your algorithm will produce an estimate about a pattern in the data.

2. An Error Function: An error function evaluates the prediction of the model. If there
are known examples, an error function can make a comparison to assess the
accuracy of the model.

3. A Model Optimization Process: If the model can fit better to the data points in the
training set, then weights are adjusted to reduce the discrepancy between the known
example and the model estimate. The algorithm will repeat this “evaluate and
optimize” process, updating weights autonomously until a threshold of accuracy has
been met.

While a lot of public perception of artificial intelligence centres around job losses, this
concern should probably be reframed. With every disruptive, new technology, we see that the
market demand for specific job roles shifts. For example, when we look at the automotive
industry, many manufacturers, like GM, are shifting to focus on electric vehicle production to
align with green initiatives. The energy industry isn’t going away, but the source of energy is
shifting from a fuel economy to an electric one.

In a similar way, artificial intelligence will shift the demand for jobs to other areas. There will
need to be individuals to help manage AI systems. The biggest challenge with artificial
intelligence and its effect on the job market will be helping people to transition to new roles
that are in demand.
3. DATA

The data used consists of the various parameters which affect rainfall. Humidity affects
rainfall because the air becomes saturated and cannot hold any additional water molecules.
Wind direction and speed changes with time and drastically affects the pattern of rainfall.
High wind speeds cause the air to move rapidly which results in the evaporation to occur at a
faster rate. Rate of evaporation is the amount of water being converted to vapour which will
later condense to water at high altitudes. The amount of sunshine is another factor to
consider. The data is taken from the years 2001 – 2008 on a daily basis. The various climate
changes in Surathkal are noted and taken into consideration while recording the data. The
table below (Table 3.1) shows the data as well as the source of the data collected.

Data Source
Rainfall parameters NITK Surathkal , Water
Resources and Ocean
Engineering Dept. data reports

Rainfall NITK Surathkal , Water


Resources and Ocean
Engineering Dept. data reports

Table. 3.1 Data and Source of Data

4. STUDY AREA

Surathkal is one of the major localities in the northern part of Mangalore city located on NH-
66 in the Dakshina Kannada district, Karnataka. Surathkal is located at 12°58'60 N 74° 46'
60E. The maximum and minimum temperature in a year varies between 37 °C and 25 °C. But
ambient temperature occasionally touches 40 °C during summer season (usually March,
April, May) recorded in 21st century.

Mangalore is located on the western coast of India at 12.87°N 74.88°E in Dakshina Kannada
district, Karnataka state. It has an average elevation of 22 m (72 ft) above mean sea level.
Mangalore has a tropical monsoon climate and is under the direct influence of the Arabian
Sea branch of the southwest monsoon. It receives about 95 percent of its total annual rainfall
between May to September but remains extremely dry from December to March. Humidity is
approximately 75 percent on average and peaks during June, July and August. During this
time of year temperatures during the day stay below 34 °C (93 °F) and drop to about 19 °C
(66 °F) at night.
Fig 3.1. Location map of study area
5. METHODOLOGY

The methodology for developing the model is the classical approach and consists of data
cleaning, building the model and testing it. This is to increase the efficiency of the process
and reduce the time consumption as well as other resources spent.

The overview of the proposed model is shown in the figure below based on Machine
Learning for predicting rainfall using the rainfall parameters gathered in the dataset.

The methodology can be divided into four divisions:

1. Data pre-processing and Exploratory Data Analysis


2. Model training and validation
3. Model performance evaluation
4. Selection of best model and result analysis

Fig 5.1 Building Machine Learning model

5.1. Data pre-processing and Exploratory Data Analysis

Data pre-processing is an integral step in Machine Learning as the quality of data and the
useful information that can be derived from it directly affects the ability of our model to learn;
therefore, it is extremely important that we pre-process our data before feeding it into our
model. EDA is basically an approach to analyse datasets to summarize their main
characteristics Some of the various EDA techniques are multivariate analysis, outlier
detection, and feature scaling. The steps in data pre-processing are:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Feature scaling

Below are some of the diagrams taken from the code showing the multivariate analysis as well
as scatterplots and outlier detection.

Fig 5.2 Heatmap


Fig 5.3 Scatterplot between maximum temperature and minimum temperature

Fig.5.4 Scatterplot between evaporation and humidity


Fig.5.5 Outlier detection for maximum temperature

Fig.5.6 Outlier detection for minimum temperature


Fig.5.7 Outlier detection for wind gust speed

Fig.5.8 Outlier detection for sunshine hours


Fig.5.9 Outlier detection for humidity

Fig.5.10 Outlier detection Evaporation


Fig.5.11 Outlier detection for rainfall

Fig.5.12 Multivariate Analysis


5.2. Model training and validation

Model training is at the heart of the data science development lifecycle where the data
science team works to fit the best weights and biases to an algorithm to minimize the
loss function over prediction range. When a supervised learning technique is used,
model training creates a mathematical representation of the relationship between the
data features and a target label. In unsupervised learning, it creates a mathematical
representation among the data features themselves. Model training is the primary step
in machine learning, resulting in a working model that can then be validated, tested,
and deployed. The steps in model training are:

1. Splitting the dataset


2. Selecting Algorithms to test
3. Hyperparameter tuning
4. Fit and tune models

The training model is used to run the input data through the algorithm to correlate the
processed output against the sample output. Later model validation is carried out.

5.3. Model performance evaluation

In Machine Learning, models are only as useful as their quality of predictions; hence,
fundamentally our goal is not to create models but to create high-quality models with
promising predictive power. The performance is measured by Accuracy, Root Mean Squared
Error, the Relative Error and the Coefficient of Correlation.

1. Accuracy

Accuracy is, simply put, the total proportion of observations that have been correctly
predicted.

• TP represents the number of True Positives. This refers to the total number of
observations that belong to the positive class and have been predicted correctly.
• TN represents the number of True Negatives. This is the total number of observations
that belong to the negative class and have been predicted correctly.
• FP is the number of False Positives. It is also known as a Type 1 Error. This is the
total number of observations that have been predicted to belong to the positive class,
but instead belong to the negative class.
• FN is the number of False Negatives. It may be
referred to as a Type 2 Error. This is the total number of observations that
have been predicted to be a part of the negative class but instead belong to
the positive class.

2. Root Mean Squared Error (RMSE)

The Root-Mean-Square Error (RMSE) is a frequently used measure of the differences


between values (sample and population values) predicted by a model and the values
observed. In data science, RMSE has a double purpose:

1. To serve as a heuristic for training models

2. To evaluate trained models for usefulness / accuracy

3. Relative Error (RE)

Relative Error can be defined as the average value of the relative differences between the
observed and predicted values of concentration with respect to observed concentrations. Using
this method, we can determine the magnitude of the absolute error in terms of the actual size of
the measurement. If the true measurement of the object is not known, then the relative error can
be found using the measured value. The relative error gives an indication of how good
measurement is relative to the size of the object being measured.

4. Coefficient of correlation

Coefficient of correlation is a statistical concept, which helps in establishing a relation


between predicted and actual values obtained in a statistical experiment. The calculated value
of the correlation coefficient explains the exactness between the predicted and actual values.
It is the sum of the products of the deviation of each quantity from its respective mean,
divided by the product of the number in the set and the standard deviations.
The numerical value of correlation of coefficient will be in between -1 to + 1. It is known as
real number value. The negative value of coefficient suggests that the correlation is strong
and negative. And if ‘r’ goes on approaching toward -1 then it means that the relationship is
going towards the negative side.
When ‘r’ approaches to the side of +1 then it means the relationship is strong and positive.
By this we can say that if +1 is the result of the correlation then the relationship is in a
positive state.

Fig.5.13 Positive correlation graph

Fig.5.14 Negative correlation graph

Fig.5.15 No correlation graph


5.4. Selection of best model and result analysis

Model selection is the process of choosing one among many candidate models for a
predictive modelling problem. There may be many competing concerns when performing
model selection beyond model performance, such as complexity, maintainability, and
available resources. Model selection is a process that can be applied both across different
types of models (e.g., logistic regression, SVM, KNN, etc.) and across models of the same
type configured with different model hyperparameters.

All models have some predictive error, given the statistical noise in the data, the
incompleteness of the data sample, and the limitations of each different model type. Using the
best-trained model with the selected hyperparameters and important variables, we can predict
the rainfall with the given parameters in the dataset.

6. Results

After testing various models such as KNN, Linear Regression, SVM, etc., and their
respective K fold cross validations, the results have been calculated on the accuracy of each
of the models as well as other errors. Three trials have been carried out and the differences in
the accuracies of each of the models have been noted down. Different K-values have been
used as well. The contribution of the K-value has been taken into consideration as well.
Below are the tables after running the code successfully.

Fig.6.1 Trial 1
Fig.6.2 Trial 2

Fig.6.3 Trial 3

7. Observation

It is observed that the KNN K-fold as well as the AdaBoost Regressor K-fold have the
highest accuracies. The KNN K-fold has an accuracy of around 84.9% every trial. The
AdaBoost regressor has an accuracy of around 74%. These can be improved by
hyperparameter tuning. The other models have low accuracy due to the problem of overfitting
which resulted in few models having higher accuracies. If hyperparameter tuning is done,
then the accuracies of all models can be brought above 80%.
8. Conclusion

The KNN K-fold model has the highest accuracy and can be adopted for rainfall prediction
with some hyperparameter tuning. The parameters for measuring the rainfall are correct and
need to be accounted for overfitting.
9. References

You might also like