Professional Documents
Culture Documents
on
BACHELOR OF TECHNOLOGY
Session 2021-22
in
Information Technology
By
JATIN CHAUDHARY
2000321540032
AATIF JAMSHED
ASSISTANT PROFESSOR
I hereby declare that the work being presented in this report entitled
“COVID -19 PREDICTION”is an authentic record of my own work
carried out under the supervision of Dr. /Mr. /Ms. AATIF JAMSHED,
Assistant Professor, Information Technology.
Date: 01/07/22
Signature of student
(Name:Jatin chaudhary)
(Roll No. 2000321540032)
Department: csds
This is to certify that the above statement made by the candidate(s) is correct to
the best of my knowledge.
Date:………….
Table of
Contents
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In
this technique, the dataset is divided into clusters to create a tree-like structure,
which is also called a dendrogram. The observations or any number of clusters can
be selected by cutting the tree at the correct level. The most common example of
this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
Problem Statement:
Before we start with code, we need to import all the required libraries in
Python.
Imports required:-
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,
cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
CODE:-
import os
for dirname, _, filenames in os.walk('COVID-19_coronavirus.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you
create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current
sessionIn [7]:
cas=pd.read_csv('COVID-19_coronavirus.csv') #reads the file
print(cas.shape,'\n')
print(cas.dtypes,'\n')
print(cas.columns,'\n')
cas.head()
(225, 10)
Country object
Other names object
ISO 3166-1 alpha-3 CODE object
Population int64
Continent object
Total Cases int64
Total Deaths int64
Tot Cases//1M pop int64
Tot Deaths/1M pop int64
Death percentage float64
dtype: object
2738 1.27505
1 Albania Albania ALB 2872296 Europe 3492 95349 1216
70 8
4002 0.38227
3 Andorra Andorra AND 77481 Europe 153 516565 1975
4 1
In [8]:
cas.columns = [col.replace('Tot Cases//1M pop','casmill').lower() for col in cas.columns]
cas.columns = [col.replace(' ','_').lower() for col in cas.columns]#replaces all the spaces with _ and changes
capital letters to lowercase
deaths = cas['total_deaths'].astype(float) #converts the data for total deaths into a float
deaths.head()
deaths.dtypes
Out[8]:
dtype('float64')
ln[9]:
cases = cas['total_cases'].astype(float) #converts the data for the total cases into a float
cases.head()
cases.dtypes
Out[9]:
dtype('float64')
ln[10]:
print('mean deaths:',cas.total_deaths.mean()) #calculates the mean number of deaths
print('sum deaths:',cas.total_deaths.sum()) #calculates the total number of deaths
mean deaths: 27448.12888888889
sum deaths: 6175829
In [11]:
print('mean cases:',cas.total_cases.mean()) #calculates the mean number of cases
print('sum cases:',cas.total_cases.sum()) #calculates sum of number of cases
mean cases: 2184781.453333333
sum cases: 491575827
In [12]:
import matplotlib.pyplot as plt
deaths.plot.box(grid=True) #plots the total deaths in box plot format
ax = plt.subplot()
ax.set_ylim(0,20000) #limits the y axis
<ipython-input-12-77722e30d8ce>:4: MatplotlibDeprecationWarning: Adding an
axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
Out[12]:
(0.0, 20000.0)
Ln[13]:
import matplotlib.pyplot as plt
cases.plot.box(grid=True)
ax = plt.subplot()
ax.set_ylim(0,2000000) #limits the y axis
<ipython-input-13-37054355fa16>:4: MatplotlibDeprecationWarning: Adding an
axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
Out[13]:
(0.0, 2000000.0)
Ln[14]:
x = cas['total_cases'] #x is an array
y = cas['total_deaths'] #y is an array
In [15]:
cas4=cas.loc[(deaths> 50000) & (cas['country'])] #limits number of countries to countries with death number
greated than 50000
In [16]:
cas4.plot(x="country",y=["total_deaths"],kind='bar',color='red') #plots the country deaths in a bar graph
plt.xlabel('Country',color='red')
plt.ylabel('Total Deaths (millions)',color='red')
Out[16]:
Text(0, 0.5, 'Total Deaths (millions)')
In [17]:
Out[17]:
Text(0, 0.5, 'Population (billions)')
In [18]:
cas5=cas.loc[(deaths> 500000) & (cas['country'])] #limits the data to countries with deaths greater than 500000
mil
cas5.groupby(['country']).sum().plot(kind='pie', y='total_deaths',autopct='%1.0f%%') #graphs the death number
of the countries in pie chart
plt.legend()
plt.show()
In [19]:
print(cas5['country']) #prints the countries with deaths greater than 500000 mil
print(cas5['total_deaths']) #prints the number of deaths for those countries
26 Brazil
92 India
214 USA
Name: country, dtype: object
26 660269
92 521388
214 1008222
Name: total_deaths, dtype: int64
In [20]:
x = cas['total_cases']
y = cas['total_deaths']
popt, pcov = curve_fit(func,x, y) # popt is the set of coefficients of the curve, and pcov is the covariance matrix