You are on page 1of 10

Sardar Patel Institute of Technology, Mumbai

Department of Electronics and Telecommunication Engineering

T.E. EXTC (2018-2019)

ETL54-Statistical Computational Laboratory


Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression

NAME: Vishal Ramina | Batch: D | UID: 2017120049

Objective:

(i) To implement Logistic Regression for University Admission Dataset


(ii) To implement Multinomial Logistic Regression for Iris Dataset

Outcomes:
• To carry out logistic regression
• Analyzing various parameters like Accuracy, Misclassification Rate, True
Positive Rate, False Positive Rate, True Negative Rate, Precision,
Prevalence, Null Error Rate, Cohen's Kappa, F Score, etc.
• Understand how to predict probabilities in a multinomial logistic
regression model.

Introduction to Logistic Regression

Logistic regression is a predictive modeling algorithm that is used when the Y


variable is binary categorical. That is, it can take only two values like 1 or 0. The goal
is to determine a mathematical equation that can be used to predict the probability of
event 1. Once the equation is established, it can be used to predict the Y when only
the X’s are known. Logistic regression can be used to model and solve problems.

Limitations of Linear Regression

When the response variable has only 2 possible values, it is desirable to have a model
that predicts the value either as 0 or 1 or as a probability score that ranges between 0
and 1. Linear regression does not have this capability. Because, If you use linear
regression to model a binary response variable, the resulting model may not restrict
the predicted Y values within 0 and 1.

1
Building a Logistic Regression Model in R

Now let's see how to implement logistic regression using the given University
Admission dataset in csv format. The goal here is to model and predict if a given
application (row in dataset) is admit or reject, based on 3 other features. So, let's load
the data and keep only the complete cases.

> mydata<-read.csv("~/Downloads/binary.csv",header=T)

Structure of this dataset:

> str(mydata)
'data.frame': 400 obs. of 4 variables:
$ admit: int 01 1 1 0 1 1 0 1 0 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39...
$ rank : int 33 14421232…

The dataset has 400 observations and 4 columns. The Class column is the response
(dependent) variable and it tells if a given student is accepted or rejected. The column
with rank and admit are numeric int data types. Let us convert them into factors.

> mydata$admit=as.factor(mydata$admit)
> mydata$rank=as.factor(mydata$rank)
> str(mydata)
'data.frame': 400 obs. of 4 variables:

$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...


$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 212 3 2 ...

> xtabs(~admit+rank,data=mydata)
rank
admit 1 2 3 4
0 28 97 93 55
1 33 54 28 12

2
Partitioning the dataset

We partition the given dataset into training and test data. Before building the logistic
regressor, you need to randomly split the data into training and test samples.

> set.seed(123)
> ind<-sample(2,nrow(mydata),replace=T,prob=c(0.8,0.2))
> ind
[1] 1 112211 21121111 21112211 211111122 1111
11

[39] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 1 1 11
1 11
[77] 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 12
1 12
[115] 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 11
1 21
[153] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 11
1 22
[191] 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 11
1 11
[229] 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 21
2 11
[267] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 11
1 11
[305] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 11
2 11
[343] 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 21
1 12
[381] 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2

> traindata<-mydata[ind==1,]
> testdata<-mydata[ind==2,]
> str(traindata)

'data.frame': 325 obs. of 4 variables:


$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 2 21 12 1 ...
$ gre : int 380 660 800 760 560 540 700 440 760 700 ...
$ gpa : num 3.61 3.67 4 3 2.98 3.39 3.92 3.22 4 3.08 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 3 31 21 32 1 1 2 ...

3
> str(testdata)
'data.frame': 75 obs. of 4 variables:
$ admit: Factor w/ 2 levels "0","1": 2 1 11 1 2 11 1 1 ...
$ gre : int 640 520 400 800 480 540 500 680 540 760 ...
$ gpa : num 3.19 2.93 3.08 4 3.44 3.81 3.17 3.19 3.78 3.35 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 4 4 2 4 3 134 4 3 ...

Generating the Model

The syntax to build a logit model is very similar to the lm function you saw in linear
regression. You only need to set the family='binomial' for glm to build a logistic
regression model.

glm stands for generalised linear models and it is capable of building many types of
regression models besides linear and logistic regression.

>model<-
glm(formula=admit~gre+gpa+rank,data=traindata,family=bino
mial)
In above model, Class is modeled as a function of admit alone.

Summarizing the model

Call:
glm(formula = admit ~ gre + gpa + rank, family = binomial, data = traindata)

Deviance Residuals:

Min 1Q Median 3Q Max


-1.6388 -0.8867 -0.6350 1.1109 2.0515

Coefficients:

Estimate Std. Error z value Pr(>|z|)


(Intercept) -4.382508 1.258545 -3.482 0.000497 ***
gre 0.001853 0.001213 1.527 0.126687
gpa 1.031450 0.370665 2.783 0.005391 **
rank2 -0.807749 0.359133 -2.249 0.024502 *
rank3 -1.319560 0.385300 -3.425 0.000615 ***
rank4 -1.523491 0.458172 -3.325 0.000884 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

4
(Dispersion parameter for binomial family taken to be 1)

Null deviance: 414.65 on 324 degrees of freedom


Residual deviance: 377.36 on 319 degrees of freedom
AIC: 389.36

Number of Fisher Scoring iterations: 4

(gpa is not significant, so drop it as pvalue is more than 0.05)

> model<-glm(formula=admit~gre+rank,data=traindata,family=binomial) >


summary(model)

Call:
glm(formula = admit ~ gre + rank, family = binomial, data = traindata)
Deviance Residuals:

Min 1Q Median 3Q Max


-1.6054 -0.8949 -0.6930 1.1789 2.0388
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.596438 0.731463 -2.183 0.029071 *
gre 0.003203 0.001114 2.876 0.004023 **
rank2 -0.905751 0.352906 -2.567 0.010272 *
rank3 -1.333377 0.379481 -3.514 0.000442 ***
rank4 -1.629553 0.452996 -3.597 0.000322 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for
binomial family taken to be 1)
Null deviance: 414.65 on 324 degrees of freedom
Residual deviance: 385.41 on 320 degrees of freedom
AIC: 395.41
Number of Fisher Scoring iterations: 4

Utilizing Test Dataset

The model is now built. You can now use it to predict the response on test
data. You need to set type='response' in order to compute the prediction
probabilities.

5
> pred<-predict(model,testdata,type="response")
> head(pred)
4 5 8 11 16 20
0.2357941 0.1736073 0.2277837 0.3399874 0.1990518 0.5332868

Logic function used here is a sigmoid function.The common practice is to take the
probability cutoff as 0.5. If the probability of Y is > 0.5, then it can be classified as an
event.
So if prediction is greater than 0.5, it is admit else it is rejected.
> p1<-ifelse(pred>0.5,1,0)

> head(p1)

4 5 8 11 16 20

0 0000 1

Generating the confusion matrix

> table(p1)

p1

0 1

63 12

> table(testdata$admit)

0 1

57 18

> table(predicted=p1,actual=testdata$admit)

actual

predicted 0 1

0 51 12 tn fp

1 6 6 fn tp

6
Evaluation of performance of a model (based on standard paramters)

Accuracy: 0.6933333

Misclassification Rate: 0.3066667

True Positive Rate: 0.16

False Positive Rate: 0.04

True Negative Rate: 0.96

Precision: 0.66

Prevalence: 0.33

Null Error Rate: 0.66

Cohen's Kappa: 0.1482

F1 Score: 1.835152

Multinomial Regression

Multinomial logistic regression is used to model nominal outcome variables, in

which the log odds of the outcomes are modeled as a linear combination of the

predictor variables.

Description of the data

For our data analysis example, we will be using the preloaded IRIS data

set.

> str(iris)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

7
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 111111

11

The data set contains variables on 150 flowers. The outcome variable is Species,

factor type with three categories.

Below we use the multinom function from the nnet package to estimate a

multinomial logistic regression model.

> install.packages("nnet")

> library(nnet)

Partitioning the dataset

We partition the given dataset into training and test data. Before building the logistic
regressor, you need to randomly split the data into training and test samples.

> data<-iris
> set.seed(1234)
> ind<-sample(2,nrow(mydata),replace=T,prob=c(0.8,0.2))
> traindata<-data[ind==1,]
> testdata<-data[ind==2,

Training and Summarizing

The model is trained using multinom function from the nnet package.

>model<-
multinom(formula=Species~Sepal.Length+Sepal.Width+Petal.L
ength+Petal.Width,data=traindata)
# weights: 18 (10 variable) initial value
129.636250 iter 10 value 10.683012 iter
20 value 5.933903

8
iter 30 value 5.873500
iter 40 value 5.866866
iter 50 value 5.861992
iter 60 value 5.860395
iter 70 value 5.859634
iter 80 value 5.859340
iter 90 value 5.859208
iter 100 value 5.859118
final value 5.859118
stopped after 100 iterations
> summary(model)
Call:
multinom(formula = Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = traindata)

Coefficients:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 20.95306 -2.019734 -12.17769 10.47244 -2.626553
virginica -18.78599 -4.541125 -18.31369 19.53561 14.264642

Std. Errors:
(Intercept) Sepal.Length Sepal.Width Petal.Length Petal.Width
versicolor 41.61007 134.8689 191.1050 76.27252 17.95107

virginica 42.22799 134.8825 191.1707 76.48059 19.08432

Residual Deviance: 11.71824


AIC: 31.71824

Predict on Test Data

> pred<-predict(model,testdata,type="class")
> head(pred)
setosa setosa setosa setosa setosa setosa Levels: setosa

versicolor virginica

Confusion Matrix

> tab<-table(pred,testdata$Species)
> tab
Pred setosa versicolor virginica
setosa 11 0 0

9
versicolor 0 6 0
virginica 0 0 15

We have received an accuracy of 100% which can clearly concluded from the above
confusion matrix.

Conclusion:

 Statistical Analysis is the first step to Machine learning.


 Logistic regression jumps the gap by assuming that the dependent variable is a
stochastic event and the dependent variable describes the outcome of this stochastic
event with a density function (a function of cumulated probabilities ranging from 0 to
1).
 Multinomial Logistic Regression (MLR) is a form of linear regression analysis
conducted when the dependent variable is nominal with more than two levels. It is
used to describe data and to explain the relationship between one dependent nominal
variable and one or more continuous-level (interval or ratio scale) independent
variables. It estimates a separate binary logistic regression model for each dummy
variables. Thus the result is M-1 binary logistic regression models.

You might also like