You are on page 1of 52

INDUSTRIAL TRAINING REPORT

Prediction Of Covid-19 Cases Using Machine Learning

Submitted in partial fulfillment of the


Requirements for the award of
Degree of Bachelor of Technology
in
Computer Science & Engineering

NNN
Submitted By

Name: Sheetal Gupta


University Reg No. : 11018210053

SUBMITTED TO:

Department of Computer Science & Engineering


SRM UNIVERSITY DELHI-NCR, SONIPAT
HARYANA-131029
DECLARATION

I hereby declare that the Industrial Training Report entitled "Prediction Of Covvid-19
cases using Machine Learning" is an authentic record of my own work as requirements of
Industrial Training during the period from 1st May 2020 to 1st July 2020 for the award of
degree of B.Tech. (Computer Science & Engineering & Engineering), SRM University,
Delhi-NCR, Haryana, under the guidance of Verzeo team and mentor Smith Shah.

Date: 4 October 2020 Sheetal Gupta


11018210053
Acknowledgement

We are greatly indebted to the authorities of SRM University, Delhi-NCR, Sonepat,


Haryana for providing us all the guidance to successfully carry put this minor project
work titled “ Prediction of covid-19 cases using machine learning”.

We would like to express our great gratitude towards our supervisor Mr. N Ganesh
Kumar who has given us support and suggestions. Without his help we could not have
presented this dissertation upto the present standards. We also take this opportunity to
give thanks to all others who gave us support for the project or in other aspects of our
study at PREDICTION OF COVID-19 CASES USING MACHINE LEARNING.

Finally, I would like to express my heartfelt thanks to my parents who were very
supportive both financially and mentally and for their encouragement to achieve my set
goals.

Mr. N. Ganesh Kumar


Assistant Professor
Computer Science and Engineering

i
About The Company

Verzeo is one of the leading online platforms which provide internships and certification
to the students. Verzeo is an AI-Based Online learning platform that provides students
with a holistic learning experience to help make them industry-ready. With access to the
Industry Experts, Online Courses and blended learning, it allows students to Learn Here
and Lead Anywhere.

Degrees tend to teach theoretical concepts in classrooms. Students graduate with blind
spots and absolutely no practical exposure to the job environment. We, at Verzeo, bridge
the gap between classroom and workplace with our flagship Internship Programs.

We help students achieve more holistic education and prepare them for better career
opportunities.Verzeo acts as an invisible mentor to students, creating channels to unleash
their learning potential. It provides access to a wide variety of training programs,
hackathons, and projects. These programs are interactive, collaborative and give access to
mentors and experts. With Verzeo, you can find Internships and Job Opportunities
seamlessly.

Verzeo has collaborated with technical moguls to create an immersive platform. With AI-
based software at its core, it offers a connected ecosystem accessible from anywhere and
by anyone.

Learning through Verzeo is fun, interactive and practical empowering students to Lead
Anywhere, Anyplace and Everywhere.

 VISION: Help students, globally, realize their full potential.


 MISSION: To provide AI-enabled real-time insights with the resolution of
reaching a broad community of individuals to help acquire skills that they want.
 GOAL: Enabling and facilitating Artificial Intelligence in the academic space.
Increase outreach to a diverse student community.

ii
About The Internship

The whole program was training cum internship program. The whole online program was
divided into 2 months. In the first 4 weeks, I have gone under a training in which I have
been trained in the Advanced Python, Exploratory Data Analysis, Statistics, various
algorithm of Supervised learning, Unsupervised learning and Basics Of Natural
Language Processing. We were provided with a mentor to train us in the various fields
of machine learning.

2nd month was the internship period. In this period I was assigned 2 projects minor and
major on which I needed to be worked on. In the first week I was provided with the
minor project i.e. “To Predict the Quality of Red Wine”.

After the completion of minor project I was assigned the major project which will be
discussed further in the project with all the major and minor aspects. Both the projects
were done by the individuals and this really provided me the industry like experience.

This complete internship cum training program gave me a lot of experience and I
definitely polished my skills in the field of machine learning. The whole project provided
me the complete understanding of a real world problem senerio.

iii
Table Of Content

1. Introduction
1.1 Overview…………………………………………………………………1
1.2 Purpose…………………………………………………………………. 2
1.3 Problem Statement……………………………………………………….2
1.4 About the Dataset……………………………………………………….. 3
1.5 Tasks Assigned…………………………………………………………..4
1.5.1 loading dataset………………………………………………….. 4
1.5.2 subset the data……………………………………………………4
1.5.3 Univariate Analysis………………………………………………4
1.5.4 Bivariate Analysis………………………………………………..4
1.5.5 Handle Missing Values…………………………………………. 5
1.5.6 Handle Datetime column……………………………………….. 5
1.5.7 Dropping column……………………………………………….. 5
1.5.8 Target variable………………………………………………….. 5
1.5.9 Modeling………………………………………………………... 5
1.5.10 Accuracy…………………………………………………………6

2. Modeling
2.1 Exploratory Data Analysis……………………………………………… 1
2.2 Data Cleaning………………………………………………………… …6
2.2.1 Dropping unwanted columns……………………………………..6
2.2.2 Missing values……………………………………………………7
2.3 Handeling outliers……………………………………………………….. 8
2.4 Data Modeling……………………………………………………………9
2.5 Accuracy…………………………………………………………………11

3. Tools And Technology Used


3.1 Hardware…………………………………………………………………1

iv
3.2 Software……………………………………………………………… ..1
3.2.1 Jupyter Notebook……………………………………………… 2
3.2.2 Python 3………………………………………………………...5
a.numpy………………………………………………………... 5
b.pandas…………………………………………………………5
c.Matplotlib……………………………………………………..5
d.Pyplot……………………………………………………….... 6
e.Seaborn………………………………………………………..6
f.Sickit Learn……………………………………………………6
3.2.3 Machine Learning………………………………………………. 7
3.2.4 Supervised Learning………………………………………….... 9
a. Linear Regression…………………………………………… 11
b. Random Forest Regressor……………………………………12

4. Snapshots

5. Results And Discussions


5.1 Challenges Faced……………………………………………………………..2

6. Conclusion

7. Future Scope
8. References

V
List Of Tables

Table 1.1 Name and Descriptions of Columns

Table 2.1 correlation matrix values for dependent columns

vi
List Of Figures

Fig 2.1 Distplot for female_smoker

Fig 2.2 Distplot for diabetes_prevalence

Fig 2.3 Distplot for population

Fig 2.4 Distplot for new_deaths

Fig 2.5 jointplot for date and total_cases column

Fig 2.6 jointplot for new_tests and total_cases column

Fig 2.7 jointplot for new_deaths and 65_aged_older column

Fig 2.8 number of null values in each columns

Fig 2.9 before outliers are dropped

Fig 2.10 after outliers are dropped

Fig 2.11 median age not having outlier

Fig 2.12 before outliers are dropped

Fig 2.13 after outliers are dropped

Fig 2.14 Modeling

Fig 3.1 Jupyter Notebook interface


Fig 3.2 Jupyter Notebook Dashboard
Fig 3.3 Linear Regression
Fig 3.4 Randomforest structure
Fig 6.1 workflow of machine learning
Vii
Introduction To Project

In this section everything about the project will be explained in detail how I have done
the project and what were the tasks assigned

1.1 Overview
The major project was the main part of the whole internship. The Project was based on
the supervised learning of Machine Learning. Topic provided for the project is
“Prediction of Covid-19 cases using Machine Learning”. Project was done by
individuals under the guidance of a mentor (if necessary).
Project was divided into various parts starting from loading the data set and ending up
with finding the accuracy of the whole model.

I was asked to build two models i.e. Random Forest Regressor and Linear Regression
and to choose the one with higher accuracy.

Project included Exploratory Data Analysis which was further divided into Bivariate
and Univariate Analysis. The analysis was done by plotting the different graphs and
retrieving the important information from the dataset. Handling Missing Values and
Outliers was another part which took a lot of hard work.
Then after this whole process I have been landed into modeling part where actual
machine learning plays its role. Modeling was done using data set splitting and fitting.

Once the model was build, I was supposed to calculate the accuracy using the proper
metrics. In between, it included target variables and training variable on which I have
worked upon.
Based on the accuracy one model was chosen and I need to predict for the one test case.

As the Data set which was provided was a Real-time based data set so it was difficult to
structure it properly and retrieve the proper relationship between different variables.
1.2 Purpose

The purpose of the prediction model is to predict the covid-19 cases based on the
previous data fed into the machine. The machine will find the accurate patterns and
relationships between the data and will provide the prediction of future with a much good
accuracy. The predictions will be purely based on upon the information that has been
given to the model.

1.3 Problem Statement

The problem statement was to feed the model with data set to predict the covid-19 cases.
Problem statement also included handling the missing value and doing proper analysis on
the data so that it becomes clean and structured.
Predicting the accuracy using two models and then increasing the accuracy so that no
over fitting can take place while learning by the machine was also a part of problem
statement that was given.

1.4 About the dataset

The data set was based on the previous data which has been recorded by the government
for covid-19. The data set can been find using the below link .
Data set is having 34 columns and 29591 rows.

Dataset Source https:/covid.ourworldindata.org/data/owid-covid-data.csv

2
Table 1.1 Columns and description

Column Name Description


Iso code ISO 3166-1 alpha-3 – three-letter country codes
continent Continent of the geographical location
location Geographical location
date Date of observation
total_cases Total confirmed cases of COVID-19
new_cases New confirmed cases of COVID-19
gdp_per_capita Gross domestic product at purchasing power parity (constant 2011
international dollars), most recent year available
total_deaths Total deaths attributed to COVID-19
new_deaths New deaths attributed to COVID-19
extreme_poverty Share of the population living in extreme poverty, most recent year
available since 2010
total_cases_per_million Total confirmed cases of COVID-19 per 1,000,000 people
new_cases_per_million New confirmed cases of COVID-19 per 1,000,000 people
cvd_death_rate Death rate from cardiovascular disease in 2017 (annual number of
deaths per 100,000 people)
total_deaths_per_million Total deaths attributed to COVID-19 per 1,000,000 people
new_deaths_per_million New deaths attributed to COVID-19 per 1,000,000 people
diabetes_prevalence Diabetes prevalence (% of population aged 20 to 79) in 2017
total_tests Total tests for COVID-19
new_tests New tests for COVID-19
new_tests_smoothed New tests for COVID-19 (7-day smoothed).
total_tests_per_thousand Total tests for COVID-19 per 1,000 people
new_tests_per_thousand New tests for COVID-19 per 1,000 people
new_tests_smoothed per New tests for COVID-19 (7-day smoothed) per 1,000 people
thousand
tests_units Units used by the location to report its testing data
stringency_index Government Response Stringency Index: composite measure based on
9 response indicators including school closures, workplace closures,
and travel bans, rescaled to a value from 0 to 100 (100 = strictest
response)
population Population in 2020
population_density Number of people divided by land area, measured in square
kilometers, most recent year available
median_age Median age of the population, UN projection for 2020
aged_65_older Share of the population that is 65 years and older, most recent year
available
aged_70_older Share of the population that is 70 years and older in 2015
female_smokers Share of women who smoke, most recent year available
male_smokers Share of men who smoke, most recent year available
handwashing_facilities Share of the population with basic handwashing facilities on premises,
most recent year available
hospital_beds_per_thousand Hospital beds per 1,000 people, most recent year available since 2010
life_expectancy Life expectancy at birth in 2019

3
1.5 Tasks Assigned

1.5.1 Loading dataset


Loading the dataset was the first and foremost thing which a machine learning engineer
does.
Without a dataset we cannot perform operations and we can’t move ahead . Dataset was
downloaded from the given source and was then read using pandas function.

1.5.2 Subset the data


This is an optional part and is done if needed by the engineer. This part include sub
setting the data which can be used further for predictions. I was supposed to subset only
those rows that have “India” in the “location” column.

1.5.3 Univariate Analysis


Univariate analysis is the part of data analysis.
a. Draw histograms of each numerical variable.
b. Find mean, median and mode of each column.

1.5.4 Bivariate Analysis


a. Draw scatter plots of each numerical column versus one another
b. Draw line plots of each numerical column versus one another

1.5.6 Handle Missing values


a. If there are null values in numerical column, replace the null values by the mean of
that column.
b. If there are null values in categorical column, replace the null values by the mode of
that column.
c. If more than 50%the values in a column are null, then drop that entire column.

4
1.5.7 Handle datetime column
Convert date column to ordinal. Date time column is usually a special column and cant
not be handled as it is, that is why it is needed to convert the date time column in ordinal
form.

1.5.8 Dropping columns


Drop all categorical columns.
Categorical columns usually have less importance but it is completely dependent on the
data set. Here the categorical columns which were having no importance and which will
not have any effect on the target column are dropped.
The categorical columns which are of use can be converted to numeric column using map
and astype functions.

1.5.9 Target variable


Target column is that column that need to be predicted. All the columns beside the target
column are features. The target column depends on the features columns. In the project I
was supposed to predict the no. of cases so I have chosen Total_cases column as the
target variable

1.5.10 Modeling
First step in modeling is to perform splitting in train and test set .Training set is used to
train and make the model learn and test set is used to do predictions on the model.
After the splitting the model is build using algorithm the algorithm here used are
a. Linear regression
b. Random forest regressor
Both the algorithms are discussed further with proper mathematical formulation.

5
1.5.11 Accuracy
Accuracy of the model is the thing which tells us how efficient our model is and how
good its predictions are.
Accuracy is done using accuracy score but there are also other methods and metrices
Here I have used r square for linear regression and root mean squared error metric for
random forest regressor. Both the metrices are explained in detail in next chapter.
By finding the accuracy we can also conclude wether our model is underfitted or
overfitted or not.

6
Machine Learning Modeling

Machine learning model requires sequenced steps to be performed to make it efficient.


In this chapter the proper steps that are needed to build a good model are explained and
also all the functions that are used in my project are properly explained with their
mathematical formulation.

2.1 Exploratory Data Analysis

Approach for EDA


Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical) to

1. maximize insight into a data set;


2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. Determine optimal factor settings.

Philosophy for EDA


EDA is not identical to statistical graphics although the two terms are used almost
interchangeably. Statistical graphics is a collection of techniques--all graphically based
and all focusing on one data characterization aspect. EDA encompasses a larger venue;
EDA is an approach to data analysis that postpones the usual assumptions about what
kind of model the data follow with the more direct approach of allowing the data itself to
reveal its underlying structure and model. EDA is not a mere collection of techniques;
EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and
how we interpret. It is true that EDA heavily uses the collection of techniques that we call
"statistical graphics", but it is not identical to statistical graphics per se.

Techniques for EDA


Most EDA techniques are graphical in nature with a few quantitative techniques. The
reason for the heavy reliance on graphics is that by its very nature the main role of EDA
is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so,
enticing the data to reveal its structural secrets, and being always ready to gain some new,
often unsuspected, insight into the data. In combination with the natural pattern-
recognition capabilities that we all possess, graphics provides, of course, unparalleled
power to carry this out.
The particular graphical techniques employed in EDA are often quite simple, consisting
of various techniques of:

1. Plotting the raw data (such as data traces, histograms, probability plots, lag
plots, block plots etc.
2. Plotting simple statistics such as mean plots, standard deviation plots, box plots,
and main effects plots of the raw data.
3. Positioning such plots so as to maximize our natural pattern-recognition abilities,
such as using multiple plots per page.

There are some plots which I have plotted in the project to get some information out of
the data.

They are as follows:

2
Fig 2.1 Distplot for female_smoker Fig 2.2 Distplot for diabetes_prevalence

Above shown are the distplot for two columns.

The distplot represents the univariate distribution of data i.e. data distribution of a
variable against the density distribution. The seaborn. distplot() function accepts the data
variable as an argument and returns the plot with the density distribution. We have used
the numpy.

Fig 2.3 Distplot for population Fig 2.4 Distplot for new_deaths

The above graphs shows which value is having atmost density in both the columns

3
Jointplot is seaborn library specific and can be used to quickly visualize and analyze
the relationship between two variables and describe their individual distributions on the
same plot. ... Apart from this, jointplot can also be used to plot 'kde', 'hex plot', and
'residual plot'.

Fig 2.5 jointplot for date and total_cases column

From the above joint plot we can colclude that as the date is increasing no of total cases
are also increasing at a rapid rate . individual functioning can also been seen in the bars
present at the corners of the plot.

4
Fig 2.6 jointplot for new_tests and total_cases column

From the above plot we can conclude that the total no of cases are increasing at a higer
rate but new_test are increasing at a lower rate.

Fig 2.7 jointplot for new_deaths and 65_aged_older column

5
From the above graph we can conclude that as the age is increasing that is older people
are having more death rate than younger people. And a point which is shown below is the
outlier which can be dealt with in the outliers section.

2.2 Data Cleaning


Data cleaning is the process of preparing data for analysis by removing or modifying data
that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted. This data is
usually not necessary or helpful when it comes to analyzing data because it may hinder
the process or provide inaccurate results. There are several methods for cleaning data
depending on how it is stored along with the answers being sought. Data cleaning is not
simply about erasing information to make space for new data, but rather finding a way to
maximize a data set’s accuracy without necessarily deleting information. For one, data
cleaning includes more actions than removing data, such as fixing spelling and syntax
errors, standardizing data sets, and correcting mistakes such as empty fields, missing
codes, and identifying duplicate data points. Data cleaning is considered a foundational
element of the data science basics, as it plays an important role in the analytical process
and uncovering reliable answers. Most importantly, the goal of data cleaning is to create
data sets that are standardized and uniform to allow business intelligence and data
analytics tools to easily access and find the right data for each query.

2.2.1 Removing Unwanted rows and columns

In this Project is data is cleaned in the form of rows and columns.


The columns which are not having impact on target column re removed and also the
columns which are having more than 50% null values are removed because they are of no
use for the analysis.
The decision to drop columns is based on 2 factors.
a. Column should not have strong effect on the target column.
b. According to the corelation matrix if 2 columns are having strong positive
relationship then one of the column should be removed.
6
Table 2.1 correlation matrix values for dependent columns

Column 1 Column 2 Correlation matrix value


Total cases Total deaths 0.98
New cases Total cases 0.95
New cases Total deaths 0.94
Total test per thousand Total cases 0.95
New test smoothed Total cases 0.94
Total cases New test 0.98
New test smoothed New tests 0.98
Aged 65 older Aged 70 older 0.97

A correlation matrix is a table showing correlation coefficients between variables.


Each cell in the table shows the correlation between two variables. A correlation
matrix is used to summarize data, as an input into a more advanced analysis, and as a
diagnostic for advanced analyses.

List Of Dropped Columns

 continent
 iso_code
 total_deaths
 total_cases_per_million',
 new_cases_per_million
 total_deaths_per_million
 total_tests
 new_deaths_per_million
 new_tests_smoothed new_tests_smoothed_per_thousand
 aged_70_older

7
 cvd_death_rate
 hospital_beds_per_thousand

The categorical columns are also dropped because they were not having any effect on the
target column.

2.2.2 Handeling missing values

Missing values are the most important part of the data cleaning until and unless the
missing values are filled the data can not fitted in to machine.

In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a
significant effect on the conclusions that can be drawn from the data. ... Missing data can
be handled similarly as censored data.

a. For the categorical columns:


For categorical columns having classes the missing values are to be filled by
using their mode.
b. For numerical columns (continuous data):
For the columns having the continuous data missing values can be filled using
their mean.

Below are the no. of null values which were present in the data set each column after
dropping the unnecessary column.

8
Fig 2.8 number of null values in each columns

2.3 Handling Outliers


An outlier is an observation that lies an abnormal distance from other values in a random
sample from a population. In a sense, this definition leaves it up to the analyst (or a
consensus process) to decide what will be considered abnormal. Before abnormal
observations can be singled out, it is necessary to characterize normal observations.

In this project the outliers are detected using box plots and handled using interquartile
range and then they are removed

The box plot is a useful graphical display for describing the behavior of the data in the
middle as well as at the ends of the distributions. The box plot uses the median and the
lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile
is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile
Range or IQ

9
A box plot is constructed by drawing a box between the upper and lower quartiles with a
solid line drawn across the box to locate the median. The following quantities
(called fences) are needed for identifying extreme values in the tails of the distribution:

1. lower inner fence: Q1 - 1.5*IQ


2. upper inner fence: Q3 + 1.5*IQ
3. lower outer fence: Q1 - 3*IQ
4. upper outer fence: Q3 + 3*IQ

below are the boxplots for some columns before and after the outliers are dropped.

Fig 2.9 before outliers are dropped Fig 2.10 after outliers are dropped

The above box plot show the outliers the outliers in 2nd figure are continous are are huge
in no. so accorng to iqr we not consider them as outliers.

Fig 2.11 median age not having outlier


as

As we can see that median_age column is not having any outlier and and therefore there
is no need to handle them.

10
Fig 2.12 before outliers are dropped Fig 2.13 after outliers are dropped

As we can see population density was having so many outlers and thay can overfir our
model so we have removed them and 2nd fig shows the clean column of the data.

2.4 Data Modeling

The outputs of prediction and feature engineering are a set of label times, historical
examples of what we want to predict, and features, predictor variables used to train a
model to predict the label. The process of modeling means training a machine learning
algorithm to predict the labels from the features, tuning it for the business need, and
validating it on holdout data.

Fig 2.14 Modeling

11
The output from modeling is a trained model that can be used for inference, making
predictions on new data points. Similar to feature engineering, modeling is independent of
the previous steps in the machine learning process and has standardized inputs which
mean we can alter the prediction problem without needing to rewrite all our code. If the
business requirements change, we can generate new label times, build corresponding
features, and input them into the model.

Modeling is divided into 3 parts.


a. Splitting
split your dataset to train and test datasets using SciKit Learn
test_size — This parameter decides the size of the data that has to be split as
the test dataset.
train_size — You have to specify this parameter only if you're not specifying the
test_size.
random_state — Here you pass an integer, which will act as the seed for the
random number generator during the split.

b. Training
Training a model simply means learning (determining) good values for all the
weights and the bias from labeled examples. In supervised learning, a machine
learning algorithm builds a model by examining many examples and attempting
to find a model that minimizes loss; this process is called empirical risk
minimization.

c. Testing
In testing we measure the accuracy of the model and finds out how efficient our
model is.

2.5 Accuracy
The accuracy of a machine learning algorithm is one way to measure how often the
algorithm classifies a data point correctly. Accuracy is the number of correctly predicted
data points out of all the data points.
Accuracy gives us the measure how efficient our model is.

12
Metrices used for accuracy check in this project are :

For linear regression – R square

R-squared is a statistical measure that represents the goodness of fit of a regression


model. The ideal value for r-square is 1. The closer the value of r-square to 1, the better is
the model fitted.
R-square is a comparison of residual sum of squares (SSres) with total sum of
squares(SStot). Total sum of squares is calculated by summation of squares of
perpendicular distance between data points and the average line.

Residual sum of squares in calculated by the summation of squares of perpendicular


distance between data points and the best fitted line.

R square is calculated by using the following formula :

Where SSres is the residual sum of squares and SStot is the total sum of squares.
The goodness of fit of regression models can be analyzed on the basis of R-square
method. The more the value of r-square near to 1, the better is the model.

Limitation of using R-square method –


 The value of r-square always increases or remains same as new variables are added
to the model, without detecting the significance of this newly added variable (i.e
value of r-square never decreases on addition of new attributes to the model). As a
result, non-significant attributes can also be added to the model with an increase in
r-square value.
 This is because SStot is always constant and regression model tries to decrease the
value of SSres by finding some correlation with this new attribute and hence the
overall value of r-square increases, which can lead to a poor regression model.

13
For Random Forest Regressor – Root mean squared error

Root mean squared error (RMSE) is the square root of the mean of the square of all of the
error. The use of RMSE is very common, and it is considered an excellent general-
purpose error metric for numerical predictions.
RMSE=1n∑i=1n(Si−Oi)2
where Oi are the observations, Si predicted values of a variable, and n the number of
observations available for analysis. RMSE is a good measure of accuracy, but only to
compare prediction errors of different models or model configurations for a particular
variable and not between variables, as it is scale-dependent.

This tells us heuristically that RMSE can be thought of as some kind of (normalized)
distance between the vector of predicted values and the vector of observed values.
But why are we dividing by n under the square root here? If we keep n (the number of
observations) fixed, all it does is rescale the Euclidean distance by a factor of √(1/n). It’s a
bit tricky to see why this is the right thing to do, so let’s delve in a bit deeper.
Imagine that our observed values are determined by adding random “errors” to each of the
predicted values, as follows:

These errors, thought of as random variables, might have Gaussian distribution with mean
μ and standard deviation σ, but any other distribution with a square-integrable PDF
(probability density function) would also work. We want to think of ŷᵢ as an underlying
physical quantity, such as the exact distance from Mars to the Sun at a particular point in
time. Our observed quantity yᵢ would then be the distance from Mars to the Sun as we
measure it, with some errors coming from mis-calibration of our telescopes and
measurement noise from atmospheric interference.

14
Tools and Technology Used

3.1 Hardware

Since the computational aspect of the project is of importance to the model, it is


important to know the hardware that was used in the evaluation process. The training and
evaluation of the model has been done on a Windows 10 computer using a quad-core
CPU at 3.4 GHz.. Let us also note that the final experimental setup will be generating
events at frequencies of 2 or 20 MHz, which is relevant in the sense of data points to
process per second.

3.2 Software
3.2.1 Jupyter Notebook

Project Jupyter started as a spin-off from IPython project in 2014. IPython’s language-
agnostic features were moved under the name – Jupyter. The name is a reference to core
programming languages supported by Jupyter which are Julia, Python and RProducts
under Jupyter project are intended to support interactive data science and scientific
computing.

The project Jupyter consists of various products described as under −

 IPykernel − This is a package that provides IPython kernel to Jupyter.

 Jupyter client − This package contains the reference implementation of the


Jupyter protocol. It is also a client library for starting, managing and
communicating with Jupyter kernels.

 Jupyter notebook − This was earlier known as IPython notebook. This is a web
based interface to IPython kernel and kernels of many other programming
languages.

 Jupyter kernels − Kernel is the execution environment of a programming


language for Jupyter products.
IPython notebook was developed by Fernando Perez as a web based front end to IPython
kernel. As an effort to make an integrated interactive computing environment for
multiple language, Notebook project was shifted under Project Jupyter providing front
end for programming environments Juila and R in addition to Python.

A notebook document consists of rich text elements with HTML formatted text, figures,
mathematical equations etc. The notebook is also an executable document consisting of
code blocks in Python or other supporting languages.

Jupyter notebook is a client-server application. The application starts the server on local
machine and opens the notebook interface in web browser where it can be edited and run
from. The notebook is saved as ipynb file and can be exported as html, pdf and LaTex
files.

Fig 3.1 Jupyter Notebook interface

You can easily install Jupyter notebook application using pip package manager.

pip3 install jupyter

To start the application, use the following command in the command prompt window.

c:\python36>jupyter notebook

2
The server application starts running at default port number 8888 and browser window

opens to show notebook dashboard.

Fig 3.2 Jupyter Notebook Dashboard

Observe that the dashboard shows a dropdown near the right border of browser with an
arrow beside the new button. It contains the currently available notebook kernels. Now,
choose Python 3, and then a new notebook opens in a new tab. An input cell as similar to
that of in IPython console is displayed.

You can execute any Python expression in it. The result will be displayed in the Out
cell.

3
Jupyter notebooks have three particularly strong benefits:

 They’re great for showcasing your work. You can see both the code and the
results. The notebooks at Kaggle is a particularly great example of this.

 It’s easy to use other people’s work as a starting point. You can run cell by cell to
better get an understanding of what the code does.
 Very easy to host server side, this is useful for security purposes. A lot of data is
sensitive and should be protected, and one of the steps toward that is no data is
stored on local machines. A server-side Jupyter Notebook setup gives you that for
free.

When prototyping, the cell-based approach of Jupyter notebooks is great. But you quickly
end up programming several steps — instead of looking at object-oriented programming.

Downsides of Jupyter notebooks

When we’re writing code in cells instead of functions/classes/objects, you quickly end up
with duplicate code that does the same thing, which is very hard to maintain.

Don’t get the support from a powerful IDE.

Consequences of duplicate code:

 It’s hard to actually collaborate on code with Jupyter — as we’re copying snippets
from each other it’s very easy to get out of sync

 Hard to maintain one version of the truth. Which one of these notebooks has the one
true solution to the number of xyz?

4
3.2.2 Python3

Python is a high-level, interpreted, interactive and object-oriented scripting language.


Python is designed to be highly readable. It uses English keywords frequently where as
other languages use punctuation, and it has fewer syntactical constructions than other
languages.

Python is a MUST for students and working professionals to become a great Software
Engineer especially when they are working in Web Development Domain. I will list
down some of the key advantages of learning Python:

 Python is Interpreted − Python is processed at runtime by the interpreter. You


do not need to compile your program before executing it. This is similar to PERL
and PHP.

 Python is Interactive − You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.

 Python is Object-Oriented − Python supports Object-Oriented style or


technique of programming that encapsulates code within objects.

 Python is a Beginner's Language − Python is a great language for the beginner-


level programmers and supports the development of a wide range of applications
from simple text processing to WWW browsers to games.

Characteristics of Python

Following are important characteristics of python −

 It supports functional and structured programming methods as well as OOP.

 It can be used as a scripting language or can be compiled to byte-code for


building large applications.

 It provides very high-level dynamic data types and supports dynamic type
checking.

5
 It supports automatic garbage collection.

 It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

Applications of Python

As mentioned before, Python is one of the most widely used language over the web. I'm
going to list few of them here:

 Easy-to-learn − Python has few keywords, simple structure, and a clearly


defined syntax. This allows the student to pick up the language quickly.

 Easy-to-read − Python code is more clearly defined and visible to the eyes.

 Easy-to-maintain − Python's source code is fairly easy-to-maintain.

 A broad standard library − Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.

 Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.

 Portable − Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.

 Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.

 Databases − Python provides interfaces to all major commercial databases.

 GUI Programming − Python supports GUI applications that can be created and
ported to many system calls, libraries and windows systems, such as Windows
MFC, Macintosh, and the X Window system of Unix.

 Scalable − Python provides a better structure and support for large programs than
shell scripting.

6
A.Pandas
In computer programming, pandas is a software library written for the Python
programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series. It is free
software released under the three-clause BSD license. The name is derived from the term
"panel data", an econometrics term for data sets that include observations over multiple
time periods for the same individuals.[3] Its name is a play on the phrase "Python data
analysis" itself

B.Numpy
It is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-
level mathematical functions to operate on these arrays.[5] The ancestor of NumPy,
Numeric, was originally created by Jim Hugunin with contributions from several other
developers. In 2005, Travis Oliphant created NumPy by incorporating features of the
competing Numarray into Numeric, with extensive modifications. NumPy is open-source
software and has many contributors.

C. Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots
into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt,
or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is
discouraged. SciPy makes use of Matplotlib.

Matplotlib was originally written by John D. Hunter, since then it has an active

development community, and is distributed under a BSD-style license. Michael


Droettboom was nominated as matplotlib's lead developer shortly before John Hunter's
death in August 2012, and further joined by Thomas Caswell.

7
Matplotlib 2.0.x supports Python versions 2.7 through 3.6. Python 3 support started with
Matplotlib 1.2. Matplotlib 1.4 is the last version to support Python 2.6. Matplotlib has
pledged to not support Python 2 past 2020 by signing the Python 3 Statement

D. Seaborn

Seaborn is a library for making statistical graphics in Python. It builds on top


of matplotlib and integrates closely with pandas data structures.

Seaborn helps you explore and understand your data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots. Its dataset-
oriented, declarative API lets you focus on what the different elements of your plots
mean, rather than on the details of how to draw them.

E. Scikit Learn

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine
learning library for the Python programming language. It features
various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.

3.2.3 Machine Leaning

Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is the first-class ticket to most interesting careers in data analytics today.
As data sources proliferate along with the computing power to process them, going
straight to the data is one of the most straightforward ways to quickly gain insights and
make predictions. Machine Learning can be thought of as the study of a list of sub-
8
problems, viz: decision making, clustering, classification, forecasting, deep-learning,
inductive logic programming, support vector machines, reinforcement learning, similarity
and metric learning, genetic algorithms, sparse dictionary learning, etc.

3.2.4 Supervised Leaning


Supervised learning, or classification is the machine learning task of inferring a function
from a labeled data . In Supervised learning, we have a training set, and a test set. The
training and test set consists of a set of examples consisting of input and output vectors,
and the goal of the supervised learning algorithm is to infer a function that maps the input
vector to the output vector with minimal error. In an optimal scenario, a model trained on
a set of examples will classify an unseen example in a correct fashion, which requires the
model to generalize from the training set in a reasonable way
There is no single algorithm that works for all cases, as merited by the No free lunch
theorem . In this project, we try and find patterns in a dataset , which is a sample of males
in a heart-disease high risk region of South Africa, and attempt to throw various
intelligently-picked algorithms at the data, and see what sticks.

Problems and Issues in Supervised learning:


1. Heterogeneity of Data: Many algorithms like neural networks and support vector
machines like their feature vectors to be homogeneous numeric and normalized. The
algorithms that employ distance metrics are very sensitive to this, and hence if the data is
heterogeneous, these methods should be the afterthought. Decision Trees can handle
heterogeneous data very easily.

2. Redundancy of Data: If the data contains redundant information, i.e. contain highly
correlated values, then it’s useless to use distance based methods because of numerical
instability. In this case, some sort of Regularization can be employed to the data to
prevent this situation.

9
3. Dependent Features: If there is some dependence between the feature vectors, then
algorithms that monitor complex interactions like Neural Networks and Decision Trees
fare better than other algorithms.

4. Bias-Variance Tradeoff: A learning algorithm is biased for a particular input x if, when
trained on each of these data sets, it is systematically incorrect when predicting the
correct output for x, whereas a learning algorithm has high variance for a particular input
x if it predicts different output values when trained on different training sets. The
prediction error of a learned classifier can be related to the sum of bias and variance of
the learning algorithm, and neither can be high as they will make the prediction error to
be high. A key feature of machine learning algorithms is that they are able to tune the
balance between bias and variance automatically, or by manual tuning using bias
parameters, and using such algorithms will resolve this situation.

5.Overfitting: The programmer should know that there is a possibility that the output
values may constitute of an inherent noise which is the result of human or sensor errors.
In this case, the algorithm must not attempt to infer the function that exactly matches all
the data. Being too careful in fitting the data can cause overfitting, after which the model
will answer perfectly for all training examples but will have a very high error for unseen
samples. A practical way of preventing this is stopping the learning process prematurely,
as well as applying filters to the data in the pre-learning phase to remove noises. Only
after considering all these factors can we pick a supervised learning algorithm that works
for the dataset we are working on. For example, if we were working with a dataset
consisting of heterogeneous data, then decision trees would fare better than other
algorithms. If the input space of the dataset we were working on had 1000 dimensions,
then it’s better to first perform PCA on the data before using a supervised learning
algorithm on it.

10
a. Linear regression

Linear regression, the relationships are modeled using linear predictor


functions whose unknown model parameters are estimated from the data. Such
models are called linear models. Most commonly, the conditional mean of the
response given the values of the explanatory variables (or predictors) is assumed
to be an affine function of those values; less commonly, the conditional median or
some other quantile is used. Like all forms of regression analysis, linear
regression focuses on the conditional probability distribution of the response
given the values of the predictors, rather than on the joint probability
distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously,
and to be used extensively in practical applications. This is because models which
depend linearly on their unknown parameters are easier to fit than models which
are non-linearly related to their parameters and because the statistical properties
of the resulting estimators are easier to determine.

Fig 3.3 Linear Regression

11
Y= a + bX, where Y is the dependent variable (that's the variable that goes on the Y
axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope of the
line and a is the y-intercept.

b. Random Forest Regressor

Random forest is a Supervised Learning algorithm which uses ensemble learning


method for classification and regression. Random forest is a bagging technique
and not a boosting technique. The trees in random forests are run in parallel. There is
no interaction between these trees while building the trees.

Fig 3.4 Randomforest structure

12
It operates by constructing a multitude of decision trees at training time and outputting the
class that is the mode of the classes (classification) or mean prediction (regression) of the
individual trees.
A random forest is a meta-estimator (i.e. it combines the result of multiple predictions)
which aggregates many decision trees, with some helpful modifications:
1. The number of features that can be split on at each node is limited to some percentage
of the total This ensures that the ensemble model does not rely too heavily on any
individual feature, and makes fair use of all potentially predictive features.
2. Each tree draws a random sample from the original data set when generating its splits,
adding a further element of randomness that prevents overfitting.

Feature and Advantages of Random Forest :

1. It is one of the most accurate learning algorithms available. For many data sets, it

produces a highly accurate classifier.

2. It runs efficiently on large databases.

3. It can handle thousands of input variables without variable deletion.

4. It gives estimates of what variables that are important in the classification.

5. It generates an internal unbiased estimate of the generalization error as the forest

building progresses.

6. It has an effective method for estimating missing data and maintains accuracy when a

large proportion of the data are missing.

Disadvantages of Random Forest :

1. Random forests have been observed to overfit for some datasets with noisy

classification/regression tasks.

2. For data including categorical variables with different number of levels, random

forests are biased in favor of those attributes with more levels. Therefore, the variable

importance scores from random forest are not reliable for this type of data.

13
Snapshots

1. Data set reading

2. Subsetting data having INDIA location


3. Handling date time column

2
4. Accuracy using Random forest regressor

5. Accuracy using Linear Regression

3
Results and Discussions

As we have predicted for Linear Regression and random forest regressor .


The accuracy we have got for linear regression is 40 percent where as for random forest
regressor we have got an accuracy of near about 98 percent.

As we can clearly see that random forest is working far better than linear regression.
They’ve proven themselves to be both reliable and effective, and are now part of any
modern predictive modeler’s toolkit.

Random forests very often outperform linear regression. In fact, almost always. I’d
reframe the question the other way around: When is a linear regression better than a
random forest?

1. when the underlying function is truly linear


2. when there are a very large number of features, especially with very low signal
to noise ratio. RF’s have a little trouble modeling linear combinations of a large
number of features.
3. when covariate shift is likely
The point is: there are probably only a few cases in which LM is better than RF; in
general, you should expect it to be the other way around.

Yes, random forests fit data better from the get-go without transforms.

They’re more forgiving in almost every way. You don’t need to scale your data, you
don’t need to do any monotonic transformations (log etc). You often don’t even need to
remove outliers.

You can throw in categorical features, and it’ll automatically partition the data if it aids
the fit.

You don’t have to spend any time generating interaction terms.

And perhaps most important: in most cases, it’ll probably be notably more accurate.
That’s what you’d expect to happen a majority of the time.

If that sort of regular boost in quality and reduction in setup time appeals to you: then try
a Random Forest first next time you find yourself confronted with a new modeling task.

4.1 Challenges Faced

The first and the foremost challenge is the data was so much uncleaned and it took lot of
visualization and data cleaning.

As the problem was based on real time problem it makes the project more challenging.

As there were so many features and outliers it was difficult to handle outliers and
overfitting was again the most challenging part. Data set having more outliers usually
have more chances for overfitting.

Beside all the challenges it was fun and I learned a lot while doing the whole project I am
still working on it to make it more efficient . also keeping the model up to date with new
data is another challenge but it comes with wider scope so till now I am not working in
such a wider scope and make it that much particular.

2
Conclusion

We can conclude that Regression is an import part of the learning and our dataset is also
based on regression . whenever we have continuous data rather than discreet data then we
need to use regression model.

We can also say that for smaller dataset Supervised learning is more better than
unsupervised learning.

Fig 6.1 workflow of machine learning

Complete workflow can be seen in the above diagram data is splitted into training and
testing dataset.

Training set is used to make the model learn and testing data is used to evaluate the
model. Accuracy is checked using the predictions of the testing set and model is made.
Model is then used for predictions.

Random forest regressor proved to me more accurate than linear regression because it
takes the average of all the predictions and it proved out to be more efficient.
Future Scope

1. I am working on Natural Language Processing and with this I will build a chat

bot. that chat bot I will integrate it with the model so that it can work as proper

chat bot which can reply to all the queries related to cases.

2. Future scope is also I will work in a wider scope and keep the machine up to

date with the upcoming data. As this data is also for a particular time period

that is why to make it more efficient I will keep it up to date with the

upcoming data.

3. I will try to get into more depth of the mathematical formulation for the

algorithms so that I can hybrid to or more algorithms to keep the machine

model efficient.
References

1. Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical


Learning: Data Mining, Inference, and Prediction: With 200 Full-color
Illustrations. New York: Springer, 2001.
2. Baldi, P. and Brunak, S. (2002). Bioinformatics: A Machine Learning Approach.
Cambridge, MA: MIT Press.
3. Tutorials point website : https://www.tutorialspoint.com/index.htm
4. Geeks for geeks website : https://www.geeksforgeeks.org
5. Website http://www.it.uu.se/edu/course/homepage/projektTDB/ht17/project01/
6. https://www.researchgate.net/publication/303326261_Machine_Learning_Project
7. Witten, I. H., and Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Amsterdam: Morgan Kaufman, 2005.
8. Google Scholars research papers. https://www.scholar.google.com
Datasheet
Certificate from Verzeo

You might also like