You are on page 1of 19

Data Science Project Training Report

on

Machine Learning Domain Projects for Regression,


Classification and Clustering using Various
Datasets

BACHELOR OF TECHNOLOGY

Session 2021-22
in

Information Technology

By
JATIN CHAUDHARY
2000321540032

AATIF JAMSHED
ASSISTANT PROFESSOR

DEPARTMENT OF INFORMATION TECHNOLOGY


ABES ENGINEERING COLLEGE, GHAZIABAD
Student’s Declaration

I hereby declare that the work being presented in this report entitled
“COVID -19 PREDICTION”is an authentic record of my own work
carried out under the supervision of Dr. /Mr. /Ms. AATIF JAMSHED,
Assistant Professor, Information Technology.

Date: 01/07/22

Signature of student
(Name:Jatin chaudhary)
(Roll No. 2000321540032)
Department: csds

This is to certify that the above statement made by the candidate(s) is correct to
the best of my knowledge.

Signature of HOD Signature of Teacher


Dr. Amit Sinha Aatif Jamshed

Information Technology Assistant Professor


Information Technology

Date:………….
Table of
Contents

S. No. Contents Page No.


Student’s Declaration i
Chapter 1 : Clustering 1
1.1 : Dataset 2-3
1.2 : Project 4-6
PROJECT – COVID -19 CORONAVIRUS (Clustering)

Clustering:- Clustering or cluster analysis is a machine learning technique, which


groups the unlabelled dataset. It can be defined as "A way of grouping the data
points into different clusters, consisting of similar data points. The objects with
the possible similarities remain in a group that has less or no similarities with
another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the


algorithm, and it deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.

The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering


algorithm that uses Gaussian Mixture Models (GMM).

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In
this technique, the dataset is divided into clusters to create a tree-like structure,
which is also called a dendrogram. The observations or any number of clusters can
be selected by cutting the tree at the correct level. The most common example of
this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering


algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that works
on updating the candidates for centroid to be the center of the points within a given
region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can
be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N 2T) time complexity,
which is the main drawback of this algorithm.

PROJECT BASED LEARNING AND IMPLEMENTATION

Problem Statement:
Before we start with code, we need to import all the required libraries in
Python.

I follow a convention of dedicating one cell in the Notebook only for


imports. This is beneficial when we want to add additional import
statements. We just need to run the cell which only has imports. It will not
affect the remaining ‘code’.
 

Imports required:-

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,
cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
CODE:-

numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #graphing tool

# Input data files are available in the read-only "../input/" directory


# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('COVID-19_coronavirus.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you
create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current
sessionIn [7]:
cas=pd.read_csv('COVID-19_coronavirus.csv') #reads the file
print(cas.shape,'\n')
print(cas.dtypes,'\n')
print(cas.columns,'\n')
cas.head()
(225, 10)

Country object
Other names object
ISO 3166-1 alpha-3 CODE object
Population int64
Continent object
Total Cases int64
Total Deaths int64
Tot Cases//1M pop int64
Tot Deaths/1M pop int64
Death percentage float64
dtype: object

Index(['Country', 'Other names', 'ISO 3166-1 alpha-3 CODE', 'Population',


'Continent', 'Total Cases', 'Total Deaths', 'Tot Cases//1M pop',
'Tot Deaths/1M pop', 'Death percentage'],
dtype='object')
Out[7]:
ISO
3166
-1 Total Total Death
Other Populati Contin Tot Cases// Tot Deaths/
Country alph Case Deat percent
names on ent 1M pop 1M pop
a-3 s hs age
COD
E

Afghanis Afghanis 4046218 1778 4.31374


0 AFG Asia 7671 4395 190
tan tan 6 27 3

2738 1.27505
1 Albania Albania ALB 2872296 Europe 3492 95349 1216
70 8

4523669 2656 2.58721


2 Algeria Algeria DZA Africa 6874 5873 152
9 91 6

4002 0.38227
3 Andorra Andorra AND 77481 Europe 153 516565 1975
4 1

3465421 9919 1.91543


4 Angola Angola AGO Africa 1900 2862 55
2 4 8

In [8]:
cas.columns = [col.replace('Tot Cases//1M pop','casmill').lower() for col in cas.columns]
cas.columns = [col.replace(' ','_').lower() for col in cas.columns]#replaces all the spaces with _ and changes
capital letters to lowercase

deaths = cas['total_deaths'].astype(float) #converts the data for total deaths into a float
deaths.head()
deaths.dtypes
Out[8]:
dtype('float64')

ln[9]:

cases = cas['total_cases'].astype(float) #converts the data for the total cases into a float

cases.head()
cases.dtypes

Out[9]:
dtype('float64')

ln[10]:
print('mean deaths:',cas.total_deaths.mean()) #calculates the mean number of deaths
print('sum deaths:',cas.total_deaths.sum()) #calculates the total number of deaths
mean deaths: 27448.12888888889
sum deaths: 6175829

In [11]:
print('mean cases:',cas.total_cases.mean()) #calculates the mean number of cases
print('sum cases:',cas.total_cases.sum()) #calculates sum of number of cases
mean cases: 2184781.453333333
sum cases: 491575827

In [12]:
import matplotlib.pyplot as plt
deaths.plot.box(grid=True) #plots the total deaths in box plot format

ax = plt.subplot()
ax.set_ylim(0,20000) #limits the y axis
<ipython-input-12-77722e30d8ce>:4: MatplotlibDeprecationWarning: Adding an
axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
Out[12]:
(0.0, 20000.0)

Ln[13]:
import matplotlib.pyplot as plt
cases.plot.box(grid=True)

ax = plt.subplot()
ax.set_ylim(0,2000000) #limits the y axis
<ipython-input-13-37054355fa16>:4: MatplotlibDeprecationWarning: Adding an
axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
Out[13]:
(0.0, 2000000.0)

Ln[14]:
x = cas['total_cases'] #x is an array
y = cas['total_deaths'] #y is an array

import matplotlib.pyplot as plt


from scipy.optimize import curve_fit

plt.plot(x,y,'ro',label="original data") #plots the total cases in comparison to total deaths


ax = plt.subplot()
ax.set_xlim(0,30000000)

plt.xlabel('Total Cases (ten millions)',color='red') #labels x axis


plt.ylabel('Total Deaths (millions)',color='red') #labels y axis

a, b = np.polyfit(x, y, 1) #plots line of best fit


plt.plot(x, a*x+b,color='blue')
plt.text(10000000,600000,'y = ' + '{:.2f}'.format(b) + ' + {:.2f}'.format(a) + 'x', size=14,color='blue')

<ipython-input-14-a3baa463e315>:8: MatplotlibDeprecationWarning: Adding an


axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
Out[14]:
Text(10000000, 600000, 'y = 819.33 + 0.01x')

In [15]:

cas4=cas.loc[(deaths> 50000) & (cas['country'])] #limits number of countries to countries with death number
greated than 50000
In [16]:
cas4.plot(x="country",y=["total_deaths"],kind='bar',color='red') #plots the country deaths in a bar graph
plt.xlabel('Country',color='red')
plt.ylabel('Total Deaths (millions)',color='red')

Out[16]:
Text(0, 0.5, 'Total Deaths (millions)')
In [17]:

cas4.plot(x="country",y=["population"],kind='bar') #graphs the population of the countries in a bar graph


plt.xlabel('Country',color='darkblue')
plt.ylabel('Population (billions)',color='darkblue')

Out[17]:
Text(0, 0.5, 'Population (billions)')
In [18]:
cas5=cas.loc[(deaths> 500000) & (cas['country'])] #limits the data to countries with deaths greater than 500000
mil
cas5.groupby(['country']).sum().plot(kind='pie', y='total_deaths',autopct='%1.0f%%') #graphs the death number
of the countries in pie chart
plt.legend()
plt.show()
In [19]:
print(cas5['country']) #prints the countries with deaths greater than 500000 mil
print(cas5['total_deaths']) #prints the number of deaths for those countries
26 Brazil
92 India
214 USA
Name: country, dtype: object
26 660269
92 521388
214 1008222
Name: total_deaths, dtype: int64
In [20]:
x = cas['total_cases']
y = cas['total_deaths']

def func(x, a, m): # a = I0, b = k


return a + m * x # exponential function

popt, pcov = curve_fit(func,x, y) # popt is the set of coefficients of the curve, and pcov is the covariance matrix

print ("a = %s, m = %s" % (popt[0], popt[1]))


plt.plot(x, y, 'ro', label = "Original Data")
ax = plt.subplot()
ax.set_xlim(0,30000000)
plt.plot(x, func(x, * popt), label = "Fitted Curve")
plt.xlabel('total_cases')
plt.ylabel('total_deaths')
plt.legend()
plt.show()
a = 819.3260060413504, m = 0.012188314230475505
<ipython-input-20-35808ab2c4e5>:12: MatplotlibDeprecationWarning: Adding an
axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
In [21]:
print(pcov)
w, v = np.linalg.eig(pcov)
print('eigenvalues \n', w, '\n eigenvectors \n', v)
[[ 7.41805772e+06 -2.81969619e-01]
[-2.81969619e-01 1.29060742e-07]]
eigenvalues
[7.41805772e+06 1.18342726e-07]
eigenvectors
[[ 1.00000000e+00 3.80112463e-08]
[-3.80112463e-08 1.00000000e+00]]

You might also like