You are on page 1of 9

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

20PAIE51J- MACHINE LEARNING (UNSUPERVISED MODEL) - Group 7

REPORT BATCH 7
a. Import required Library

b. Read the dataset (tab, csv, xls, txt, inbuilt dataset).


c. Perform explanotory data analysis on the dataset.

Inferences:

1. The given dataset has 517 observations and 13 features.

2. There are 2 categorical variables(month, day) and 11 numerical variables.

Inferences:

The describe function provides the 5 point summary of numerical features like min,max,25th
percentile, 75th percentile and median.

From the result, we could suspect there might be presence of outliers by comparing min and 25th
percentile, max and 75th percentile.
Inferences:

The X and Y are the coordinates and here, we are not considering geographical coordinates for
clustering as it is not much helpful. Hence dropping X and Y features from the dataframe.

Inferences:

From the above results, we could see there are no standard missing values. There are as many
as 509 zeros in the rain column and 247 zeros in the area but these appear to be valid
observations (zero rainfall, zero burnt forest area). There is no reason to believe these are
caused by errors with the available information.
checking for the outliers

Inferences:

The box plot is used to visualize all the numerical features and We could see there is a presence
of outliers. These outliers should be handled using technique like capping, trimming and
transformations.
There are some heavily skewed variables such as rain, area, FFMC, DC Before clustering, we
need to reduce the skew of these.
Using Power Transformations to reduce the outliers
* Power transformations can be used on all these fields.
* Box-Cox can be applied only to strictly positive data.
* Yeo-Johnson is used here on all the columns, as an optimal transformation that estimates
the best from a family of power transformations to reduce the skew of the data.

Inferences
The extreme values are handled using transformation. One of the power transformations -
yeojohnson is applied to numerical data and then the numerical features are visualized using
box plot. We could see there is decrease in the outliers after applying the transformation and
also the skewsness is reduced.
SCALING
Inferences:

The Hierarchical clustering uses distance measure and hence it is necessary to bring all the
feature values in same scale. So the standard scaling technique is applied on the numerical
features and first 5 records are displayed.

ENCODING
Inferences:

The categorical variables - month and day preserve some order and hence the ordinal encoding
is be applied on them. The encoded values are stored in the new feature -
'month_encoded','day_encoded' and the original column - 'month' and 'day' is removed.

d. Plot the datapoints using Scatter Plot.

Inferences:

The pair plot is used to visualize the scatter plots for all the numerical features(the plots which
are in non-diagonal are scatter plots) and the scatter plot is also used to visualize the data
spread in the numerical variables.

Before applying the clustering techniques on the data points we should know how well spread
the data is. From the plots visualized, We could see that the features data points are spreaded
much and we can apply the clustering techniques like k-means, Hierarchical and DB scan based
on the business needs.
Inferences:

1. The linkage matrix represents the distance between the clusters based on the given
linkage method.
2. The linkage methods are Single, complete, average, centriod and ward.
3. The linkage matrix is calculated for each of these methods and the matrix is printed
4. Each linkage methods gives the different dendrogram representation

The linkage matrix returns 4 columns, The values in the first two columns are the indices of the
observations that are clustered in pairs to form a new cluster. The third column represents the
cophenetic distances and the fourth column gives the number of observations in that cluster

f. Draw dendrogram for the above five clustering methods.


Inferences:

1. The Dendrogram is visualized for each linkage methods. It plays a major role in finding
the optimal number of clusters.
2. From plot, we could see that the observations are linked at low height which represents
observations are similar and the dissimilar observations fuse at a higher level.
3. The x-axis represent the data points and y axis represents the distance.
4. The optimal number of clusters is the number that remains constant for the larger
distance on the y-axis and hence we can conclude that optimal number of cluster is 2
5. f cluster is 2.

g. Calculate Cophenet Coorelation coefficient for the above five methods.

h. Plot the best method labels using the scatter plot


Inference
As the best linkage method determined by comparing cophenetic coefficients - 'Average'
doesn't represent the best cluster formation as seen visually. This can be traced to Ward being
the best method under heavy presence of outliers. Although we have treated.

We confirmed this dropping the rain column. In this case, average clustering also did visually
perform well. We infer that average clustering has limitations when it comes to heavily
skewed columns (other linkage methods have this issue too).

We also understand that validation through visualization of clustering output and domain
sense are important and indices such as cophenetic coefficient have limitations in capturing
clustering performance and final choice of clustering methods.

We can also infer that 'rain' column doesn't contribute much to clustering performance.

We have also produced silhouette scores for all methods below to assess clustering quality and
to validate optimum number of clusters as 2

You might also like