Machine Learning Clustering Report

Data Science Project Training Report
on
Machine Learning Domain Projects for Regression,

Classification and Clustering using Various
Datasets
BACHELOR OF TECHNOLOGY
Session 2021-22
in
Information Technology
By
JATIN CHAUDHARY
2000321540032
AATIF JAMSHED
ASSISTANT PROFESSOR
DEPARTMENT OF INFORMATION TECHNOLOGY

ABES ENGINEERING COLLEGE, GHAZIABAD
Student’s Declaration
I hereby declare that the work being presented in this report entitled
“COVID -19 PREDICTION”is an authentic record of my own work
carried out under the supervision of Dr. /Mr. /Ms. AATIF JAMSHED,
Assistant Professor, Information Technology.
Date: 01/07/22
Signature of student
(Name:Jatin chaudhary)
(Roll No. 2000321540032)
Department: csds
This is to certify that the above statement made by the candidate(s) is correct to
the best of my knowledge.
Signature of HOD Signature of Teacher

Dr. Amit Sinha Aatif Jamshed
Information Technology Assistant Professor

Information Technology
Date:………….
Table of
Contents
S. No. Contents Page No.

Student’s Declaration i
Chapter 1 : Clustering 1
1.1 : Dataset 2-3
1.2 : Project 4-6
PROJECT – COVID -19 CORONAVIRUS (Clustering)
Clustering:- Clustering or cluster analysis is a machine learning technique, which

groups the unlabelled dataset. It can be defined as "A way of grouping the data
points into different clusters, consisting of similar data points. The objects with
the possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the

algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a
cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also). But there are also other various approaches of Clustering exist. Below are the
main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering

algorithm that uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In
this technique, the dataset is divided into clusters to create a tree-like structure,
which is also called a dendrogram. The observations or any number of clusters can
be selected by cutting the tree at the correct level. The most common example of
this method is the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as the
Fuzzy k-means algorithm
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between the
observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in
machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering

algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that works
on updating the candidates for centroid to be the center of the points within a given
region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to the
mean-shift, but with some remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this, the clusters can be
found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as
an alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can
be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N 2T) time complexity,
which is the main drawback of this algorithm.
PROJECT BASED LEARNING AND IMPLEMENTATION
Problem Statement:
Before we start with code, we need to import all the required libraries in
Python.
I follow a convention of dedicating one cell in the Notebook only for

imports. This is beneficial when we want to add additional import
statements. We just need to run the cell which only has imports. It will not
affect the remaining ‘code’.

Imports required:-
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(color_codes=True)
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split,
cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
CODE:-
numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #graphing tool
# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('COVID-19_coronavirus.csv'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you
create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current
sessionIn [7]:
cas=pd.read_csv('COVID-19_coronavirus.csv') #reads the file
print(cas.shape,'\n')
print(cas.dtypes,'\n')
print(cas.columns,'\n')
cas.head()
(225, 10)
Country object
Other names object
ISO 3166-1 alpha-3 CODE object
Population int64
Continent object
Total Cases int64
Total Deaths int64
Tot Cases//1M pop int64
Tot Deaths/1M pop int64
Death percentage float64
dtype: object
Index(['Country', 'Other names', 'ISO 3166-1 alpha-3 CODE', 'Population',

'Continent', 'Total Cases', 'Total Deaths', 'Tot Cases//1M pop',
'Tot Deaths/1M pop', 'Death percentage'],
dtype='object')
Out[7]:
ISO
3166
-1 Total Total Death
Other Populati Contin Tot Cases// Tot Deaths/
Country alph Case Deat percent
names on ent 1M pop 1M pop
a-3 s hs age
COD
E
Afghanis Afghanis 4046218 1778 4.31374

0 AFG Asia 7671 4395 190
tan tan 6 27 3
2738 1.27505
1 Albania Albania ALB 2872296 Europe 3492 95349 1216
70 8
4523669 2656 2.58721

2 Algeria Algeria DZA Africa 6874 5873 152
9 91 6
4002 0.38227
3 Andorra Andorra AND 77481 Europe 153 516565 1975
4 1
3465421 9919 1.91543

4 Angola Angola AGO Africa 1900 2862 55
2 4 8
In [8]:
cas.columns = [col.replace('Tot Cases//1M pop','casmill').lower() for col in cas.columns]
cas.columns = [col.replace(' ','_').lower() for col in cas.columns]#replaces all the spaces with _ and changes
capital letters to lowercase
deaths = cas['total_deaths'].astype(float) #converts the data for total deaths into a float
deaths.head()
deaths.dtypes
Out[8]:
dtype('float64')
ln[9]:
cases = cas['total_cases'].astype(float) #converts the data for the total cases into a float
cases.head()
cases.dtypes
Out[9]:
dtype('float64')
ln[10]:
print('mean deaths:',cas.total_deaths.mean()) #calculates the mean number of deaths
print('sum deaths:',cas.total_deaths.sum()) #calculates the total number of deaths
mean deaths: 27448.12888888889
sum deaths: 6175829
In [11]:
print('mean cases:',cas.total_cases.mean()) #calculates the mean number of cases
print('sum cases:',cas.total_cases.sum()) #calculates sum of number of cases
mean cases: 2184781.453333333
sum cases: 491575827
In [12]:
deaths.plot.box(grid=True) #plots the total deaths in box plot format
ax = plt.subplot()
ax.set_ylim(0,20000) #limits the y axis
<ipython-input-12-77722e30d8ce>:4: MatplotlibDeprecationWarning: Adding an
axes using the same arguments as a previous axes currently reuses the
earlier instance. In a future version, a new instance will always be
created and returned. Meanwhile, this warning can be suppressed, and the
future behavior ensured, by passing a unique label to each axes instance.
ax = plt.subplot()
Out[12]:
(0.0, 20000.0)
Ln[13]:
cases.plot.box(grid=True)
ax = plt.subplot()
ax.set_ylim(0,2000000) #limits the y axis
<ipython-input-13-37054355fa16>:4: MatplotlibDeprecationWarning: Adding an
ax = plt.subplot()
Out[13]:
(0.0, 2000000.0)
Ln[14]:
x = cas['total_cases'] #x is an array
y = cas['total_deaths'] #y is an array

from scipy.optimize import curve_fit
plt.plot(x,y,'ro',label="original data") #plots the total cases in comparison to total deaths

ax = plt.subplot()
ax.set_xlim(0,30000000)
plt.xlabel('Total Cases (ten millions)',color='red') #labels x axis

plt.ylabel('Total Deaths (millions)',color='red') #labels y axis
a, b = np.polyfit(x, y, 1) #plots line of best fit

plt.plot(x, a*x+b,color='blue')
plt.text(10000000,600000,'y = ' + '{:.2f}'.format(b) + ' + {:.2f}'.format(a) + 'x', size=14,color='blue')
<ipython-input-14-a3baa463e315>:8: MatplotlibDeprecationWarning: Adding an

ax = plt.subplot()
Out[14]:
Text(10000000, 600000, 'y = 819.33 + 0.01x')
In [15]:
cas4=cas.loc[(deaths> 50000) & (cas['country'])] #limits number of countries to countries with death number
greated than 50000
In [16]:
cas4.plot(x="country",y=["total_deaths"],kind='bar',color='red') #plots the country deaths in a bar graph
plt.xlabel('Country',color='red')
plt.ylabel('Total Deaths (millions)',color='red')
Out[16]:
Text(0, 0.5, 'Total Deaths (millions)')
In [17]:
cas4.plot(x="country",y=["population"],kind='bar') #graphs the population of the countries in a bar graph

plt.xlabel('Country',color='darkblue')
plt.ylabel('Population (billions)',color='darkblue')
Out[17]:
Text(0, 0.5, 'Population (billions)')
In [18]:
cas5=cas.loc[(deaths> 500000) & (cas['country'])] #limits the data to countries with deaths greater than 500000
mil
cas5.groupby(['country']).sum().plot(kind='pie', y='total_deaths',autopct='%1.0f%%') #graphs the death number
of the countries in pie chart
plt.legend()
plt.show()
In [19]:
print(cas5['country']) #prints the countries with deaths greater than 500000 mil
print(cas5['total_deaths']) #prints the number of deaths for those countries
26 Brazil
92 India
214 USA
Name: country, dtype: object
26 660269
92 521388
214 1008222
Name: total_deaths, dtype: int64
In [20]:
x = cas['total_cases']
y = cas['total_deaths']
def func(x, a, m): # a = I0, b = k

return a + m * x # exponential function
popt, pcov = curve_fit(func,x, y) # popt is the set of coefficients of the curve, and pcov is the covariance matrix
print ("a = %s, m = %s" % (popt[0], popt[1]))

plt.plot(x, y, 'ro', label = "Original Data")
ax = plt.subplot()
ax.set_xlim(0,30000000)
plt.plot(x, func(x, * popt), label = "Fitted Curve")
plt.xlabel('total_cases')
plt.ylabel('total_deaths')
plt.legend()
plt.show()
a = 819.3260060413504, m = 0.012188314230475505
<ipython-input-20-35808ab2c4e5>:12: MatplotlibDeprecationWarning: Adding an
ax = plt.subplot()
In [21]:
print(pcov)
w, v = np.linalg.eig(pcov)
print('eigenvalues \n', w, '\n eigenvectors \n', v)
[[ 7.41805772e+06 -2.81969619e-01]
[-2.81969619e-01 1.29060742e-07]]
eigenvalues
[7.41805772e+06 1.18342726e-07]
eigenvectors
[[ 1.00000000e+00 3.80112463e-08]
[-3.80112463e-08 1.00000000e+00]]

Machine Learning Clustering Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Clustering Report

Uploaded by

Copyright:

Available Formats

Data Science Project Training Report

Machine Learning Domain Projects for Regression,

DEPARTMENT OF INFORMATION TECHNOLOGY

Signature of HOD Signature of Teacher

Information Technology Assistant Professor

S. No. Contents Page No.

Clustering:- Clustering or cluster analysis is a machine learning technique, which

It is an unsupervised learning method, hence no supervision is provided to the

The example of this type is the Expectation-Maximization Clustering

1. K-Means algorithm: The k-means algorithm is one of the most popular clustering

PROJECT BASED LEARNING AND IMPLEMENTATION

I follow a convention of dedicating one cell in the Notebook only for

numpy as np # linear algebra

# Input data files are available in the read-only "../input/" directory

Index(['Country', 'Other names', 'ISO 3166-1 alpha-3 CODE', 'Population',

Afghanis Afghanis 4046218 1778 4.31374

4523669 2656 2.58721

3465421 9919 1.91543

import matplotlib.pyplot as plt

plt.plot(x,y,'ro',label="original data") #plots the total cases in comparison to total deaths

plt.xlabel('Total Cases (ten millions)',color='red') #labels x axis

a, b = np.polyfit(x, y, 1) #plots line of best fit

<ipython-input-14-a3baa463e315>:8: MatplotlibDeprecationWarning: Adding an

cas4.plot(x="country",y=["population"],kind='bar') #graphs the population of the countries in a bar graph

def func(x, a, m): # a = I0, b = k

print ("a = %s, m = %s" % (popt[0], popt[1]))

You might also like