You are on page 1of 19

Course Code: MGNM801 Course Title: Business Analytics - I

Course Instructor: Dr. Avinash Rana

Academic Task No.: 2 Academic Task Title: CA-2

Date of Allotment: 09/25/2022 Date of submission: 10/08/2022

Student’s Roll no: RQ1969B37 Student’s Reg. no: 11903267


Evaluation Parameters:

Learning Outcomes: (Student to write briefly about learnings obtained from the academic tasks)

Declaration:
I declare that this Assignment is my individual work. I have not copied it from any other student’s work
or from any other source except where due acknowledgement is made explicitly in the text, nor has any
part been written for me by any other person.
Student’s Signature:

Evaluator’s comments (For Instructor’s use only)

General Observations Suggestions for Improvement Best part of assignment

Evaluator’s Signature and Date:


Marks Obtained: _______________ Max. Marks: ______________
KNN - Rscript
library(caret)
library(e1071)
str(glass)
glasstype = as.factor(glass$`Type of glass`)
glassdata =data.frame(glasstype,
glass$RI,glass$Na,glass$Mg,glass$Al,glass$Si,glass$K,glass$Ca,glass$Ba,glass$Fe)
str(glassdata)
set.seed(9999)
glassdata

ran <- sample(1:nrow(glassdata), 0.8 * nrow(glassdata))


summary(glassdata)

nor <-function(x) { (x -min(x))/(max(x)-min(x)) }

glassdata_norm <- as.data.frame(lapply(glassdata[,c(2:10)], nor))


summary(glassdata_norm)
glassdata_train <- glassdata_norm[ran,]
glassdata_test <- glassdata_norm[-ran,]
glassdata_target_category <- glassdata[ran,1]
glassdata_test_category <- glassdata[-ran,1]
library(class)
kr <- knn(glassdata_train,glassdata_test,cl=glassdata_target_category,k=1)
tab <- table(kr,glassdata_test_category)
tab
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
KNN – Output
Interpretation:
A knn model of supervised machine learning is used to classify the entries of a dataset into
certan categories which are represented by a factor vector. In this knn script, we are
classifying the type of glass based on the composition of the following elements:
1. Refractive index
2. Percentage of the following elements:
a. Sodium (Na)
b. Magnesium (Mg)
c. Aluminium (Al)
d. Silicon (Si)
e. Potassium(K)
f. Calcium (Ca)
g. Barium (Ba)
h. Iron (Fe)
In this data set, there are 7 glass types. This variable is present in character vector form, but it
is to be converted into factor vector. This will allow us to classify the entries of the data set
into glass types. There are 11 variables and 214 observations. Out of these 214 observations,
80% of the data has been used for training, and the remaining 20% i.e. 43 data being tested
on the basis of 9 variables which are the refractive index and the elements composed in the
glass.
From the table appearing from the function,
tab <- table(kr,glassdata_test_category)
tab
we can the classification. There are 9 entries that are confirmed as glass type 1. However,
there is a confusion for 1 entry, whether it is glass type 1 or 2. Similarly in the second column
of the output table, there is a confusion regarding 3 data entries whether they belong to glass
type 1 or 2. Similar is the case for the rest of the data entries. After calculating the proper
classification of the glass data into glass types, it is seen that out of 43 data entries, 15 entries
are incorrectly classified. These incorrect entries make the accuracy shown 65%, which has
obtained with the accuracy function.
Naive Bayes - Rscript

library(e1071)
library(caTools)
library(caret)

split <- sample.split(glassdata, SplitRatio = 0.95)


train_cl <- subset(glassdata, split == "TRUE")
test_cl <- subset(glassdata, split == "FALSE")

train_scale <- scale(train_cl[,2:10])


test_scale <- scale(test_cl[,2:10])

set.seed(9999)
classifier_cl <- naiveBayes(glasstype ~ ., data = train_cl)
classifier_cl

y_pred <- predict(classifier_cl, newdata = test_cl)

cm <- table(test_cl$glasstype, y_pred)


cm

confusionMatrix(cm)
Naïve Bayes - Output
Interpretation:

In Naive Bayes model, the data is split into 2 categories: training set and testing set. Naive
Bayes model creates condtional probability for the other variables such as refractive index,
amount of sodium, mangnesium and the other elements, which in this case is given by
classifier_cl. These conditional probabilities are used to make the confusion matrix.

According to the output table, the predictions regarding glass type 1 is correct for 5 predicted
values. Whereas, 4 values have been wrongly predicted to glass type 2 and 1 value has been
predicted for glass type 3. Similarly, there is 1 wrong value for glass type 1 and 2 correct
values for glass type 2. Similarly, 1, 0, 1, 0 correct predictions for glass types 3, 5, 6, 7. On
the other hand, the incorrect predictions for the same are 1, 0, 6, 0 respectively. This leaves
the output at 40.9% accuracy.
Decision Tree – Rscript

library(ISLR)
library(rpart)
library(rpart.plot)

str(sr)

tree <- rpart(glasstype ~., data=glassdata)

best <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]

pruned_tree <- prune(tree, cp=best)


prp(pruned_tree,
faclen=0,
extra=1,
roundint=F,
digits=5)
Decision Tree – Output

Interpretation:

Decision tree classifies the glass types based on the properties like refractive index and
elements in glass. The first classification has been done on the basis of the amount of Barium.
The glass types that have amounts of Barium more than 0.335% are classified into glass type
7, however, within glass type 7, there are 3 glass types that have been classified into glass
types 1, 2 and 5 each. The glass data which have Further, classification occurs on the basis of
Aluminium. However, the classification of aluminium further requires the classification of
glass types on the basis of Calcium and Magnesium.
In case of classification on the basis of magnesium, if the magnesium content is more than or
equal to 2.26%, then it is classified as glass type 2. But here, there are 6, 4 and 1 glass types
1, 2 and 6 present, which is incorrectly classified. If the magnesium content is less than
2.26%, then there is a further classification with respect to the sodium content. If it is more
than 13.4%, then it is considered as glass type 6, which has 3 wrong entries. Glass type 5 has
11 entries which have sodium content less that 13.4%, and there is 1 incorrect entry here,
which belongs to glass type 2.

When classifying on the basis of Calcium content, the glass data which has calcium content
more than 10.4% have been categorised into glass type 2. There are 2 categories of glass type
2, because they have a difference in aluminium content. If the calcium content is less than
10.4%, then further classification is done on the basis of refractive index. If the refractive
index is less than 1.5%, then it is classified as glass type 3, where there are 3 incorrect
classification of glass type 1, 3 of glass type 2, and 1 of glass types 6 and 7 each. And finally,
the glass data that has refractive index more than or equal 1.5%, is further classified on the
basis of Magnesium once again, where glass type 1 has less magnesium content than 3.8%,
and glass type 3, having magnesium content more than 3.8%.
Kmeans – Rscript
library(factoextra)
library(ggplot2)
Glasses = Glass_Types %>% select(RI,Na,Mg,Al,Si,K,Ca,Ba,Fe)
Glasses
str(Glasses)
fviz_nbclust(x = Glasses, kmeans, method = "wss")
set.seed(123)
km.res = kmeans(Glasses,6,nstart = 25)
km.res
fviz_cluster(km.res,data = Glasses)
km.res$cluster
clusterdata = cbind(Glasses,cluster = km.res$cluster)
library(dplyr)
a=clusterdata %>% group_by(RI) %>%
summarise(mean(Na),mean(Mg),mean(Al),mean(Si),mean(K),mean(Ca),mean(Ba),mean(Fe)
)
a
Kmeans - Output
Interpretation:
According to the WSS plot, the optimum number of cluster would be where the graph line
starts to get flat. In this case, it is 6. Hence, in the rscript k has been considered as 6. There
are 214 observations in the data set, which have been classified into 6 clusters as shown in the
image just above. However, there is some amount of inaccuracy as these clusters are
somewhat overlapping. The major overlapping is seen among clusters 1 and 3, followed by
the overlapping between clusters 2 and 5 and 3 and 4. According to the table presented by
Km.res$centres, the mean value for refractive index, sodium, magnesium and many more are
1.52, 13.89, 3.34 and so on for cluster 1. Similar, is the case with the other cluster. Since
these values of centres are so closely related, hence, the clusters overlap. These values of
centres are the classifying criteria for all the observations that are included within those
clusters.
Hierarchical Unsupervised – Rscript

Glasses1 = Glasses %>% slice(1:50)


library(tidyverse)
library(cluster)
library(factoextra)
library(dendextend)
data1= Glasses1
data1=na.omit(data1)
d = dist(data1, method = "euclidean")
hc1 = hclust(d, method = "complete" )
plot(hc1, cex = 0.6, hang = -1)
hc2 = agnes(data1, method = "complete")
plot(hc2, cex=0.7)

0# Agglomerative coefficient

hc2$ac
hc3 = agnes(data1, method = "ward")
pltree(hc3, cex = 0.6, hang = -1, main = "Dendrogram of agnes")
hc4 = diana(data1)
hc4$dc
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of diana")
hc5 = hclust(d, method = "ward.D2" )
sub_grp = cutree(hc5, k = 6)
table(sub_grp)
Glasses1 %>%
mutate(cluster = sub_grp) %>%
head
plot(hc5, cex = 0.6)
rect.hclust(hc5, k = 6, border = 2:5)
fviz_cluster(list(data = data1, cluster = sub_grp))
hc_a = agnes(data1, method = "ward")
cutree(as.hclust(hc_a), k = 6)
hc_d <- diana(data1)
cutree(as.hclust(hc_d), k = 6)
res.dist = dist(data1, method = "euclidean")
hc1 = hclust(res.dist, method = "complete")
hc2 = hclust(res.dist, method = "ward.D2")
dend1 = as.dendrogram (hc1)
dend2 = as.dendrogram (hc2)

tanglegram(dend1, dend2)
Hierarchical Unsupervised – Output
Interpretation:
Hierarchical clustering classifies the 214 glass observations into clusters using dendrograms.
The first dendrogram is created on the basis of Euclidean distance, of each observation from
another, and they have been categorised in clusters. Then we have another dendrogram that is
named “Dendrogram of Agnes”, and they both categorize the observations into several
clusters. There is another dendrogram named “Dendrogram of Diana”. Combining these
dendrograms together, we have a cluster dendrogram which actually shows the clusters,
marked by coloured lines. Further, these clusters have been represented in a plot.

At the very end of the program, we see a tanglegram which shows the connection of different
uniform groups from one another, from the different dendrograms.

You might also like