You are on page 1of 10

Course Code: MGN619 Course Title: Business Analytics

Course Instructor: Dr Vishal Soodan

Academic Task No.: CA2 Academic Task Machine learning models

Date of Allotment: 11/09/2019 Date of submission: 20/10/2019

Student’s Roll no: RQ1843A02 Student’s Reg. no: 11803152


Evaluation Parameters: (Parameters on which student is to be evaluated- To be mentioned by students as
specified at the time of assigning the task by the instructor)

Learning Outcomes: (Student to write briefly about learnings obtained from the academic tasks)
To understand the concept of Machine learning.
Declaration:

I declare that this Assignment is my individual work. I have not copied it from any other
student‟s work or from any other source except where due acknowledgement is made explicitly
in the text, nor has any part been written for me by any other person.
Student’s
Signature: Vishal Jaiswal
Evaluator’scomments (For Instructor’s use only)

General Observations Suggestions for Improvement Best part of assignment

Evaluator‟s Signature and Date:

Marks Obtained: Max. Marks: …………………………


REGRESSION: Multiple Regression
Multiple regression is a statistical tool used to derive the value of a criterion from several other
independent, or predictor, variables. It is the simultaneous combination of multiple factors to
assess how and to what extent they affect a certain outcome. This technique breaks down when
the nature of the factors themselves is of an unmeasurable or pure-chance nature. We have used
the School grades data set and we are trying to find the dependence of multiple variable upon
the rating of the channel like sex, age. Reason, and internet.

Dataset: School grades dataset

data=school_grades_dataset
str(data)
abc=as.numeric(data$failures)
hh=step(lm(abc~sex+age+reason+internet,data=data))
summary(hh)
c=round(predict(hh),1)
d=data.frame(c,abc)
plot(abc,type="l",col="red")+lines(c,col="blue")
INTERPRETATION

The data which I have taken shows that how various factors affect the school grades of a
particular student.

Taking failures as dependant variable and combining the same with other major independent
factors sex, age. Reason, and internet the results are obtained.

Significance code show‘***’ three star which shows that data is significant.
CLUSTERING
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.

Clustering is an unsupervised machine learning method that attempts to uncover the natural
groupings and statistical distributions of data. There are multiple clustering methods such as
K-means or Hierarchical Clustering. Often, a measure of distance from point to point is used
to find which category a point should belong to as with K-means. Hierarchical clustering seeks
to build up or break down sets of clusters based on the input information. This allows the user
to use the sets of clusters that best accomplish their purpose. The algorithm will not name the
groups it creates for you, but it will show you where they are and then they can be named
anything.

Dataset: Wine Quality

wine_quality
head(wine_quality)
plot(wine_quality$Quality~wine_quality$pH)
s1=wine_quality[,-1]
head(s1)
results.s1=kmeans(s1,centers = 5)
results.s1
results.s1$cluster
results.s1$size

INTERPRETATION
The data is about wine quality, which is dependant on so many factors. The quality has been
rated from minimum 3 to maximum 9.

Clustering Distance Measures:

The classification of observations into groups requires some methods for computing the
distance or the (dis)similarity between each pair of observations. The result of this computation
is known as a dissimilarity or distance matrix.

Euclidean distance: deuc(x,y)=⎷ n∑i=1(xi−yi)2

Where, x and y are two vectors of length n.


As we are founding the wine quality, the cluster analysis shows the grouping of the events
where we can found the highest clustering of suicide data over ph scale for various category of
wines.

CLASSIFICATION- Knn Model


KNN algorithm is one of the simplest classification algorithm and it is one of the most used
learning algorithms.

A very simple classification and regression algorithm.

In case of classification, new data points get classified in a particular class on the basis of voting
from nearest neighbors.

In case of regression, new data get labeled based on the averages of nearest value.

It is a lazy learner because it doesn’t learn much from the training data.

It is supervised learning algorithm.

Default method is Euclidean distance (shortest distance between 2 points, using formula =
(√(X1−X2)2+(Y1−Y2)2)((X1−X2)2+(Y1−Y2)2) ) used for continuous variables, whereas for
discrete variables, such as for text classification the overlap metric(Hamming distance) would
be employed.
Dataset: Credit cards

data=CREDIT_DATASET
View(CREDIT_DATASET)
CREDIT_DATASET$class=as.factor(CREDIT_DATASET$class)
str(CREDIT_DATASET)
set.seed(123)
index1=createDataPartition(CREDIT_DATASET$class,p=0.8, list=FALSE
)traindata1=CREDIT_DATASET[index1,]
testdata1=CREDIT_DATASET[-index1,]
modelknn=train(class~.,method="knn",data=traindata1)
modelknn
prediction1=predict(modelknn,testdata1)
conmatrix=confusionMatrix(prediction1,testdata1$class)
conmatrix

INTERPRETATION

K-nn (K-nearest neighbours) is also a supervised machine learning model. It uses both
classification and regression. In both cases the input consists of the k closest training examples
in future. In the credit data above it has given an accuracy of 69.5% as indicated above.

The algorithm is highly unbiased in nature and makes no prior assumption of the underlying
data. We have founded the frequency of the grades aong the data. We have classified the data
into test data and train data so that we can run the algorithm to find the desired results. At
first we have collected the data, and created test and train data for better understanding of the
model on the data set.
Confusion Matrix and Statistics
Reference
Prediction bad good
bad 12 13
good 48 127
Accuracy : 0.695
95% CI : (0.6261, 0.758)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.5953
Kappa : 0.1286
Mcnemar's Test P-Value : 1.341e-05
Sensitivity : 0.2000
Specificity : 0.9071
Pos Pred Value : 0.4800
Neg Pred Value : 0.7257
Prevalence : 0.3000
Detection Rate : 0.0600
Detection Prevalence : 0.1250
Balanced Accuracy : 0.5536
'Positive' Class : bad

SIMPLE LINEAR REGRESSION


Simple linear regression is a prediction when a variable (y) is dependent on a second variable
(x) based on the regression equation of a given set of data.

Simple linear regression model serves two purposes:

1. It describes the linear dependence of one variable on another

2. It can predict values of one variable from values of another based on historical relationship
between independent and dependent variable.

Data used: Used the data from Kaggle for analyzing the youtube channel’s rating upon the
number of views and subscribers. Which will help us to find the number of people subscribing
after watching the video that means number of views leads to subscription. Thee link is given
at the end in the references.

Code:
library(caret)
reg
model=lm(Video.views~Subscribers,data=reg)
model
###predict and impredect the model
predict(model)
options(scipen=999)
summary(model)
###for removing 0
round(predict(model),0)
pred=round(predict(model),1)
class(pred)
plot(reg$Video.views,type = "l",col="blue")
lines(pred,col="green")

INTERPRETATION
The data which I have taken shows that how ranks are given to the different channels based
upon there views, number of videos uploaded and subscribers. The first step in interpreting the
multiple regression analysis is to examine the F-statistic and the associated p-value, at the
bottom of model summary. In our example, it can be seen that p-value of the F-statistic is 10.64,
which is highly significant. This means that, at least, one of the predictor variables is
significantly related to the outcome variable. When adjusted R- square goes below Multipal R-
square then there is a case of overfitting problem. In my example they are near by at the same
level which show there is not any overfitting problem.this also replicates that these variables
are required. Significance code show‘***’ three star which shows that data is significant.
REFERENCES:

 Kaggle.com/datasets
 Uci.com/datasets
 Mldata.io

You might also like