Professional Documents
Culture Documents
NNN
Submitted By
SUBMITTED TO:
I hereby declare that the Industrial Training Report entitled "Prediction Of Covvid-19
cases using Machine Learning" is an authentic record of my own work as requirements of
Industrial Training during the period from 1st May 2020 to 1st July 2020 for the award of
degree of B.Tech. (Computer Science & Engineering & Engineering), SRM University,
Delhi-NCR, Haryana, under the guidance of Verzeo team and mentor Smith Shah.
We would like to express our great gratitude towards our supervisor Mr. N Ganesh
Kumar who has given us support and suggestions. Without his help we could not have
presented this dissertation upto the present standards. We also take this opportunity to
give thanks to all others who gave us support for the project or in other aspects of our
study at PREDICTION OF COVID-19 CASES USING MACHINE LEARNING.
Finally, I would like to express my heartfelt thanks to my parents who were very
supportive both financially and mentally and for their encouragement to achieve my set
goals.
i
About The Company
Verzeo is one of the leading online platforms which provide internships and certification
to the students. Verzeo is an AI-Based Online learning platform that provides students
with a holistic learning experience to help make them industry-ready. With access to the
Industry Experts, Online Courses and blended learning, it allows students to Learn Here
and Lead Anywhere.
Degrees tend to teach theoretical concepts in classrooms. Students graduate with blind
spots and absolutely no practical exposure to the job environment. We, at Verzeo, bridge
the gap between classroom and workplace with our flagship Internship Programs.
We help students achieve more holistic education and prepare them for better career
opportunities.Verzeo acts as an invisible mentor to students, creating channels to unleash
their learning potential. It provides access to a wide variety of training programs,
hackathons, and projects. These programs are interactive, collaborative and give access to
mentors and experts. With Verzeo, you can find Internships and Job Opportunities
seamlessly.
Verzeo has collaborated with technical moguls to create an immersive platform. With AI-
based software at its core, it offers a connected ecosystem accessible from anywhere and
by anyone.
Learning through Verzeo is fun, interactive and practical empowering students to Lead
Anywhere, Anyplace and Everywhere.
ii
About The Internship
The whole program was training cum internship program. The whole online program was
divided into 2 months. In the first 4 weeks, I have gone under a training in which I have
been trained in the Advanced Python, Exploratory Data Analysis, Statistics, various
algorithm of Supervised learning, Unsupervised learning and Basics Of Natural
Language Processing. We were provided with a mentor to train us in the various fields
of machine learning.
2nd month was the internship period. In this period I was assigned 2 projects minor and
major on which I needed to be worked on. In the first week I was provided with the
minor project i.e. “To Predict the Quality of Red Wine”.
After the completion of minor project I was assigned the major project which will be
discussed further in the project with all the major and minor aspects. Both the projects
were done by the individuals and this really provided me the industry like experience.
This complete internship cum training program gave me a lot of experience and I
definitely polished my skills in the field of machine learning. The whole project provided
me the complete understanding of a real world problem senerio.
iii
Table Of Content
1. Introduction
1.1 Overview…………………………………………………………………1
1.2 Purpose…………………………………………………………………. 2
1.3 Problem Statement……………………………………………………….2
1.4 About the Dataset……………………………………………………….. 3
1.5 Tasks Assigned…………………………………………………………..4
1.5.1 loading dataset………………………………………………….. 4
1.5.2 subset the data……………………………………………………4
1.5.3 Univariate Analysis………………………………………………4
1.5.4 Bivariate Analysis………………………………………………..4
1.5.5 Handle Missing Values…………………………………………. 5
1.5.6 Handle Datetime column……………………………………….. 5
1.5.7 Dropping column……………………………………………….. 5
1.5.8 Target variable………………………………………………….. 5
1.5.9 Modeling………………………………………………………... 5
1.5.10 Accuracy…………………………………………………………6
2. Modeling
2.1 Exploratory Data Analysis……………………………………………… 1
2.2 Data Cleaning………………………………………………………… …6
2.2.1 Dropping unwanted columns……………………………………..6
2.2.2 Missing values……………………………………………………7
2.3 Handeling outliers……………………………………………………….. 8
2.4 Data Modeling……………………………………………………………9
2.5 Accuracy…………………………………………………………………11
iv
3.2 Software……………………………………………………………… ..1
3.2.1 Jupyter Notebook……………………………………………… 2
3.2.2 Python 3………………………………………………………...5
a.numpy………………………………………………………... 5
b.pandas…………………………………………………………5
c.Matplotlib……………………………………………………..5
d.Pyplot……………………………………………………….... 6
e.Seaborn………………………………………………………..6
f.Sickit Learn……………………………………………………6
3.2.3 Machine Learning………………………………………………. 7
3.2.4 Supervised Learning………………………………………….... 9
a. Linear Regression…………………………………………… 11
b. Random Forest Regressor……………………………………12
4. Snapshots
6. Conclusion
7. Future Scope
8. References
V
List Of Tables
vi
List Of Figures
In this section everything about the project will be explained in detail how I have done
the project and what were the tasks assigned
1.1 Overview
The major project was the main part of the whole internship. The Project was based on
the supervised learning of Machine Learning. Topic provided for the project is
“Prediction of Covid-19 cases using Machine Learning”. Project was done by
individuals under the guidance of a mentor (if necessary).
Project was divided into various parts starting from loading the data set and ending up
with finding the accuracy of the whole model.
I was asked to build two models i.e. Random Forest Regressor and Linear Regression
and to choose the one with higher accuracy.
Project included Exploratory Data Analysis which was further divided into Bivariate
and Univariate Analysis. The analysis was done by plotting the different graphs and
retrieving the important information from the dataset. Handling Missing Values and
Outliers was another part which took a lot of hard work.
Then after this whole process I have been landed into modeling part where actual
machine learning plays its role. Modeling was done using data set splitting and fitting.
Once the model was build, I was supposed to calculate the accuracy using the proper
metrics. In between, it included target variables and training variable on which I have
worked upon.
Based on the accuracy one model was chosen and I need to predict for the one test case.
As the Data set which was provided was a Real-time based data set so it was difficult to
structure it properly and retrieve the proper relationship between different variables.
1.2 Purpose
The purpose of the prediction model is to predict the covid-19 cases based on the
previous data fed into the machine. The machine will find the accurate patterns and
relationships between the data and will provide the prediction of future with a much good
accuracy. The predictions will be purely based on upon the information that has been
given to the model.
The problem statement was to feed the model with data set to predict the covid-19 cases.
Problem statement also included handling the missing value and doing proper analysis on
the data so that it becomes clean and structured.
Predicting the accuracy using two models and then increasing the accuracy so that no
over fitting can take place while learning by the machine was also a part of problem
statement that was given.
The data set was based on the previous data which has been recorded by the government
for covid-19. The data set can been find using the below link .
Data set is having 34 columns and 29591 rows.
2
Table 1.1 Columns and description
3
1.5 Tasks Assigned
4
1.5.7 Handle datetime column
Convert date column to ordinal. Date time column is usually a special column and cant
not be handled as it is, that is why it is needed to convert the date time column in ordinal
form.
1.5.10 Modeling
First step in modeling is to perform splitting in train and test set .Training set is used to
train and make the model learn and test set is used to do predictions on the model.
After the splitting the model is build using algorithm the algorithm here used are
a. Linear regression
b. Random forest regressor
Both the algorithms are discussed further with proper mathematical formulation.
5
1.5.11 Accuracy
Accuracy of the model is the thing which tells us how efficient our model is and how
good its predictions are.
Accuracy is done using accuracy score but there are also other methods and metrices
Here I have used r square for linear regression and root mean squared error metric for
random forest regressor. Both the metrices are explained in detail in next chapter.
By finding the accuracy we can also conclude wether our model is underfitted or
overfitted or not.
6
Machine Learning Modeling
1. Plotting the raw data (such as data traces, histograms, probability plots, lag
plots, block plots etc.
2. Plotting simple statistics such as mean plots, standard deviation plots, box plots,
and main effects plots of the raw data.
3. Positioning such plots so as to maximize our natural pattern-recognition abilities,
such as using multiple plots per page.
There are some plots which I have plotted in the project to get some information out of
the data.
2
Fig 2.1 Distplot for female_smoker Fig 2.2 Distplot for diabetes_prevalence
The distplot represents the univariate distribution of data i.e. data distribution of a
variable against the density distribution. The seaborn. distplot() function accepts the data
variable as an argument and returns the plot with the density distribution. We have used
the numpy.
Fig 2.3 Distplot for population Fig 2.4 Distplot for new_deaths
The above graphs shows which value is having atmost density in both the columns
3
Jointplot is seaborn library specific and can be used to quickly visualize and analyze
the relationship between two variables and describe their individual distributions on the
same plot. ... Apart from this, jointplot can also be used to plot 'kde', 'hex plot', and
'residual plot'.
From the above joint plot we can colclude that as the date is increasing no of total cases
are also increasing at a rapid rate . individual functioning can also been seen in the bars
present at the corners of the plot.
4
Fig 2.6 jointplot for new_tests and total_cases column
From the above plot we can conclude that the total no of cases are increasing at a higer
rate but new_test are increasing at a lower rate.
5
From the above graph we can conclude that as the age is increasing that is older people
are having more death rate than younger people. And a point which is shown below is the
outlier which can be dealt with in the outliers section.
continent
iso_code
total_deaths
total_cases_per_million',
new_cases_per_million
total_deaths_per_million
total_tests
new_deaths_per_million
new_tests_smoothed new_tests_smoothed_per_thousand
aged_70_older
7
cvd_death_rate
hospital_beds_per_thousand
The categorical columns are also dropped because they were not having any effect on the
target column.
Missing values are the most important part of the data cleaning until and unless the
missing values are filled the data can not fitted in to machine.
In statistics, missing data, or missing values, occur when no data value is stored for the
variable in an observation. Missing data are a common occurrence and can have a
significant effect on the conclusions that can be drawn from the data. ... Missing data can
be handled similarly as censored data.
Below are the no. of null values which were present in the data set each column after
dropping the unnecessary column.
8
Fig 2.8 number of null values in each columns
In this project the outliers are detected using box plots and handled using interquartile
range and then they are removed
The box plot is a useful graphical display for describing the behavior of the data in the
middle as well as at the ends of the distributions. The box plot uses the median and the
lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile
is Q1 and the upper quartile is Q3, then the difference (Q3 - Q1) is called the interquartile
Range or IQ
9
A box plot is constructed by drawing a box between the upper and lower quartiles with a
solid line drawn across the box to locate the median. The following quantities
(called fences) are needed for identifying extreme values in the tails of the distribution:
below are the boxplots for some columns before and after the outliers are dropped.
Fig 2.9 before outliers are dropped Fig 2.10 after outliers are dropped
The above box plot show the outliers the outliers in 2nd figure are continous are are huge
in no. so accorng to iqr we not consider them as outliers.
As we can see that median_age column is not having any outlier and and therefore there
is no need to handle them.
10
Fig 2.12 before outliers are dropped Fig 2.13 after outliers are dropped
As we can see population density was having so many outlers and thay can overfir our
model so we have removed them and 2nd fig shows the clean column of the data.
The outputs of prediction and feature engineering are a set of label times, historical
examples of what we want to predict, and features, predictor variables used to train a
model to predict the label. The process of modeling means training a machine learning
algorithm to predict the labels from the features, tuning it for the business need, and
validating it on holdout data.
11
The output from modeling is a trained model that can be used for inference, making
predictions on new data points. Similar to feature engineering, modeling is independent of
the previous steps in the machine learning process and has standardized inputs which
mean we can alter the prediction problem without needing to rewrite all our code. If the
business requirements change, we can generate new label times, build corresponding
features, and input them into the model.
b. Training
Training a model simply means learning (determining) good values for all the
weights and the bias from labeled examples. In supervised learning, a machine
learning algorithm builds a model by examining many examples and attempting
to find a model that minimizes loss; this process is called empirical risk
minimization.
c. Testing
In testing we measure the accuracy of the model and finds out how efficient our
model is.
2.5 Accuracy
The accuracy of a machine learning algorithm is one way to measure how often the
algorithm classifies a data point correctly. Accuracy is the number of correctly predicted
data points out of all the data points.
Accuracy gives us the measure how efficient our model is.
12
Metrices used for accuracy check in this project are :
Where SSres is the residual sum of squares and SStot is the total sum of squares.
The goodness of fit of regression models can be analyzed on the basis of R-square
method. The more the value of r-square near to 1, the better is the model.
13
For Random Forest Regressor – Root mean squared error
Root mean squared error (RMSE) is the square root of the mean of the square of all of the
error. The use of RMSE is very common, and it is considered an excellent general-
purpose error metric for numerical predictions.
RMSE=1n∑i=1n(Si−Oi)2
where Oi are the observations, Si predicted values of a variable, and n the number of
observations available for analysis. RMSE is a good measure of accuracy, but only to
compare prediction errors of different models or model configurations for a particular
variable and not between variables, as it is scale-dependent.
This tells us heuristically that RMSE can be thought of as some kind of (normalized)
distance between the vector of predicted values and the vector of observed values.
But why are we dividing by n under the square root here? If we keep n (the number of
observations) fixed, all it does is rescale the Euclidean distance by a factor of √(1/n). It’s a
bit tricky to see why this is the right thing to do, so let’s delve in a bit deeper.
Imagine that our observed values are determined by adding random “errors” to each of the
predicted values, as follows:
These errors, thought of as random variables, might have Gaussian distribution with mean
μ and standard deviation σ, but any other distribution with a square-integrable PDF
(probability density function) would also work. We want to think of ŷᵢ as an underlying
physical quantity, such as the exact distance from Mars to the Sun at a particular point in
time. Our observed quantity yᵢ would then be the distance from Mars to the Sun as we
measure it, with some errors coming from mis-calibration of our telescopes and
measurement noise from atmospheric interference.
14
Tools and Technology Used
3.1 Hardware
3.2 Software
3.2.1 Jupyter Notebook
Project Jupyter started as a spin-off from IPython project in 2014. IPython’s language-
agnostic features were moved under the name – Jupyter. The name is a reference to core
programming languages supported by Jupyter which are Julia, Python and RProducts
under Jupyter project are intended to support interactive data science and scientific
computing.
Jupyter notebook − This was earlier known as IPython notebook. This is a web
based interface to IPython kernel and kernels of many other programming
languages.
A notebook document consists of rich text elements with HTML formatted text, figures,
mathematical equations etc. The notebook is also an executable document consisting of
code blocks in Python or other supporting languages.
Jupyter notebook is a client-server application. The application starts the server on local
machine and opens the notebook interface in web browser where it can be edited and run
from. The notebook is saved as ipynb file and can be exported as html, pdf and LaTex
files.
You can easily install Jupyter notebook application using pip package manager.
To start the application, use the following command in the command prompt window.
c:\python36>jupyter notebook
2
The server application starts running at default port number 8888 and browser window
Observe that the dashboard shows a dropdown near the right border of browser with an
arrow beside the new button. It contains the currently available notebook kernels. Now,
choose Python 3, and then a new notebook opens in a new tab. An input cell as similar to
that of in IPython console is displayed.
You can execute any Python expression in it. The result will be displayed in the Out
cell.
3
Jupyter notebooks have three particularly strong benefits:
They’re great for showcasing your work. You can see both the code and the
results. The notebooks at Kaggle is a particularly great example of this.
It’s easy to use other people’s work as a starting point. You can run cell by cell to
better get an understanding of what the code does.
Very easy to host server side, this is useful for security purposes. A lot of data is
sensitive and should be protected, and one of the steps toward that is no data is
stored on local machines. A server-side Jupyter Notebook setup gives you that for
free.
When prototyping, the cell-based approach of Jupyter notebooks is great. But you quickly
end up programming several steps — instead of looking at object-oriented programming.
When we’re writing code in cells instead of functions/classes/objects, you quickly end up
with duplicate code that does the same thing, which is very hard to maintain.
It’s hard to actually collaborate on code with Jupyter — as we’re copying snippets
from each other it’s very easy to get out of sync
Hard to maintain one version of the truth. Which one of these notebooks has the one
true solution to the number of xyz?
4
3.2.2 Python3
Python is a MUST for students and working professionals to become a great Software
Engineer especially when they are working in Web Development Domain. I will list
down some of the key advantages of learning Python:
Python is Interactive − You can actually sit at a Python prompt and interact with
the interpreter directly to write your programs.
Characteristics of Python
It provides very high-level dynamic data types and supports dynamic type
checking.
5
It supports automatic garbage collection.
It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
Applications of Python
As mentioned before, Python is one of the most widely used language over the web. I'm
going to list few of them here:
Easy-to-read − Python code is more clearly defined and visible to the eyes.
A broad standard library − Python's bulk of the library is very portable and
cross-platform compatible on UNIX, Windows, and Macintosh.
Interactive Mode − Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.
Portable − Python can run on a wide variety of hardware platforms and has the
same interface on all platforms.
Extendable − You can add low-level modules to the Python interpreter. These
modules enable programmers to add to or customize their tools to be more
efficient.
GUI Programming − Python supports GUI applications that can be created and
ported to many system calls, libraries and windows systems, such as Windows
MFC, Macintosh, and the X Window system of Unix.
Scalable − Python provides a better structure and support for large programs than
shell scripting.
6
A.Pandas
In computer programming, pandas is a software library written for the Python
programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series. It is free
software released under the three-clause BSD license. The name is derived from the term
"panel data", an econometrics term for data sets that include observations over multiple
time periods for the same individuals.[3] Its name is a play on the phrase "Python data
analysis" itself
B.Numpy
It is a library for the Python programming language, adding support for large, multi-
dimensional arrays and matrices, along with a large collection of high-
level mathematical functions to operate on these arrays.[5] The ancestor of NumPy,
Numeric, was originally created by Jim Hugunin with contributions from several other
developers. In 2005, Travis Oliphant created NumPy by incorporating features of the
competing Numarray into Numeric, with extensive modifications. NumPy is open-source
software and has many contributors.
C. Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots
into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt,
or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is
discouraged. SciPy makes use of Matplotlib.
Matplotlib was originally written by John D. Hunter, since then it has an active
7
Matplotlib 2.0.x supports Python versions 2.7 through 3.6. Python 3 support started with
Matplotlib 1.2. Matplotlib 1.4 is the last version to support Python 2.6. Matplotlib has
pledged to not support Python 2 past 2020 by signing the Python 3 Statement
D. Seaborn
Seaborn helps you explore and understand your data. Its plotting functions operate on
dataframes and arrays containing whole datasets and internally perform the necessary
semantic mapping and statistical aggregation to produce informative plots. Its dataset-
oriented, declarative API lets you focus on what the different elements of your plots
mean, rather than on the details of how to draw them.
E. Scikit Learn
Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine
learning library for the Python programming language. It features
various classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to
interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is the first-class ticket to most interesting careers in data analytics today.
As data sources proliferate along with the computing power to process them, going
straight to the data is one of the most straightforward ways to quickly gain insights and
make predictions. Machine Learning can be thought of as the study of a list of sub-
8
problems, viz: decision making, clustering, classification, forecasting, deep-learning,
inductive logic programming, support vector machines, reinforcement learning, similarity
and metric learning, genetic algorithms, sparse dictionary learning, etc.
2. Redundancy of Data: If the data contains redundant information, i.e. contain highly
correlated values, then it’s useless to use distance based methods because of numerical
instability. In this case, some sort of Regularization can be employed to the data to
prevent this situation.
9
3. Dependent Features: If there is some dependence between the feature vectors, then
algorithms that monitor complex interactions like Neural Networks and Decision Trees
fare better than other algorithms.
4. Bias-Variance Tradeoff: A learning algorithm is biased for a particular input x if, when
trained on each of these data sets, it is systematically incorrect when predicting the
correct output for x, whereas a learning algorithm has high variance for a particular input
x if it predicts different output values when trained on different training sets. The
prediction error of a learned classifier can be related to the sum of bias and variance of
the learning algorithm, and neither can be high as they will make the prediction error to
be high. A key feature of machine learning algorithms is that they are able to tune the
balance between bias and variance automatically, or by manual tuning using bias
parameters, and using such algorithms will resolve this situation.
5.Overfitting: The programmer should know that there is a possibility that the output
values may constitute of an inherent noise which is the result of human or sensor errors.
In this case, the algorithm must not attempt to infer the function that exactly matches all
the data. Being too careful in fitting the data can cause overfitting, after which the model
will answer perfectly for all training examples but will have a very high error for unseen
samples. A practical way of preventing this is stopping the learning process prematurely,
as well as applying filters to the data in the pre-learning phase to remove noises. Only
after considering all these factors can we pick a supervised learning algorithm that works
for the dataset we are working on. For example, if we were working with a dataset
consisting of heterogeneous data, then decision trees would fare better than other
algorithms. If the input space of the dataset we were working on had 1000 dimensions,
then it’s better to first perform PCA on the data before using a supervised learning
algorithm on it.
10
a. Linear regression
11
Y= a + bX, where Y is the dependent variable (that's the variable that goes on the Y
axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope of the
line and a is the y-intercept.
12
It operates by constructing a multitude of decision trees at training time and outputting the
class that is the mode of the classes (classification) or mean prediction (regression) of the
individual trees.
A random forest is a meta-estimator (i.e. it combines the result of multiple predictions)
which aggregates many decision trees, with some helpful modifications:
1. The number of features that can be split on at each node is limited to some percentage
of the total This ensures that the ensemble model does not rely too heavily on any
individual feature, and makes fair use of all potentially predictive features.
2. Each tree draws a random sample from the original data set when generating its splits,
adding a further element of randomness that prevents overfitting.
1. It is one of the most accurate learning algorithms available. For many data sets, it
building progresses.
6. It has an effective method for estimating missing data and maintains accuracy when a
1. Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
2. For data including categorical variables with different number of levels, random
forests are biased in favor of those attributes with more levels. Therefore, the variable
importance scores from random forest are not reliable for this type of data.
13
Snapshots
2
4. Accuracy using Random forest regressor
3
Results and Discussions
As we can clearly see that random forest is working far better than linear regression.
They’ve proven themselves to be both reliable and effective, and are now part of any
modern predictive modeler’s toolkit.
Random forests very often outperform linear regression. In fact, almost always. I’d
reframe the question the other way around: When is a linear regression better than a
random forest?
Yes, random forests fit data better from the get-go without transforms.
They’re more forgiving in almost every way. You don’t need to scale your data, you
don’t need to do any monotonic transformations (log etc). You often don’t even need to
remove outliers.
You can throw in categorical features, and it’ll automatically partition the data if it aids
the fit.
And perhaps most important: in most cases, it’ll probably be notably more accurate.
That’s what you’d expect to happen a majority of the time.
If that sort of regular boost in quality and reduction in setup time appeals to you: then try
a Random Forest first next time you find yourself confronted with a new modeling task.
The first and the foremost challenge is the data was so much uncleaned and it took lot of
visualization and data cleaning.
As the problem was based on real time problem it makes the project more challenging.
As there were so many features and outliers it was difficult to handle outliers and
overfitting was again the most challenging part. Data set having more outliers usually
have more chances for overfitting.
Beside all the challenges it was fun and I learned a lot while doing the whole project I am
still working on it to make it more efficient . also keeping the model up to date with new
data is another challenge but it comes with wider scope so till now I am not working in
such a wider scope and make it that much particular.
2
Conclusion
We can conclude that Regression is an import part of the learning and our dataset is also
based on regression . whenever we have continuous data rather than discreet data then we
need to use regression model.
We can also say that for smaller dataset Supervised learning is more better than
unsupervised learning.
Complete workflow can be seen in the above diagram data is splitted into training and
testing dataset.
Training set is used to make the model learn and testing data is used to evaluate the
model. Accuracy is checked using the predictions of the testing set and model is made.
Model is then used for predictions.
Random forest regressor proved to me more accurate than linear regression because it
takes the average of all the predictions and it proved out to be more efficient.
Future Scope
1. I am working on Natural Language Processing and with this I will build a chat
bot. that chat bot I will integrate it with the model so that it can work as proper
chat bot which can reply to all the queries related to cases.
2. Future scope is also I will work in a wider scope and keep the machine up to
date with the upcoming data. As this data is also for a particular time period
that is why to make it more efficient I will keep it up to date with the
upcoming data.
3. I will try to get into more depth of the mathematical formulation for the
model efficient.
References