You are on page 1of 4

A mathematical essay on Logistic Regression

Awik Dhar
Indian Institute of Technology, Madras
Ahmedabad, India
ch18b041@smail.iitm.ac.in

Abstract—This article aims to give and overview of Logistic values (referred to as the Greek capital letter Beta) to predict
Regression algorithm and describes how it has been used to an output value (y). A key difference from linear regression is
model and predict with high performance whether a given RMS that the output value being modeled is a binary values (0 or
Titanic passenger would survive. Analysis of the distribution of
the passengers has been made. 1) rather than a numeric value.

I. I NTRODUCTION
Classification is a common Machine Learning problem, and
has a variety of available techniques for different sorts of
features and data.
Logistic Regression is a classification algorithm which
predicts the target class based on the output of the logistic
function and a threshold. If the output of the logistic function
is above the threshold, the class prediction is taken as positive
or 1.
The sinking of the Titanic is one of the most infamous
shipwrecks in history. On April 15, 1912, during her maiden Fig. 1. Logistic function
voyage, the widely considered “unsinkable” RMS Titanic sank
after colliding with an iceberg. Unfortunately, there were not Logistic regression models the probability of the default
enough lifeboats for everyone on-board, resulting in the death class. The function outputs a value grater than 0.5 for net pos-
of 1502 out of 2224 passengers and crew. While there was itive inputs, which may be considered as a positive prediction,
some element of luck involved in surviving, it seems some and vice versa. Actual chosen threshold depends on what’s
groups of people were more likely to survive than others. more important to capture- positives, negatives or the trade
In this challenge, we ask you to build a predictive model off of precision and recall. To learn the logistic regression
that answers the question: “what sorts of people were more model(by learning the coefficients), two methods exist, the
likely to survive?” using passenger data (name, age, gender, Maximum likelihood method, and applying Gradient descent
socioeconomic class, etc). on a cost function.
This paper aims to analyze the available data, understand Maximum-likelihood estimation is a common learning al-
the relationships of various features and use the Logistic re- gorithm used by a variety of machine learning algorithms,
gression algorithm to model survivability of a given passenger. although it does make assumptions about the distribution of
II. LOGISTIC REGRESSION CLASSIFIER your data (more on this when we talk about preparing your
data). The best coefficients would result in a model that would
Logistic regression is named for the function used at predict a value very close to 1 for the default class and a value
the core of the method, the logistic function. The logistic very close to 0 for the other class. The intuition for maximum-
function, also called the sigmoid function was developed likelihood for logistic regression is that a search procedure
by statisticians to describe properties of population growth seeks values for the coefficients that minimize the error in the
in ecology, rising quickly and maxing out at the carrying probabilities predicted by the model to those in the data.
capacity of the environment. It’s an S-shaped curve that can This paper uses gradient descent to minimize cost function
take any real-valued number and map it into a value between method with binary cross entropy loss function given by
0 and 1, but never exactly at those limits. The logistic
function has the following form
m
X
1 Cost = −(yi log(pi ) + (1 − yi ) log(1 − pi ))
y= Pn
1+ e−( 1 wi ∗xi +b) 1

Where y is the predicted output, b is the bias or intercept where m is the number of examples in the training data,
term and wi ’s are the coefficients for the features xi ’s. Input p is the probability output by logistic function, and y is the
values (x) are combined linearly using weights or coefficient ground truth label.
III. DATA AND T HE P ROBLEM inverse-strength parameter C set to 1. The observed train
A. Overview accuracy was 80.3% and test accuracy was 80% .F1 scores
were 0.84 and 0.75 for deceased and survived classes
We are given a dataset of 891 records with features such
respectively.
as passenger class, fare, sex, age, and so forth; and the label
of outcome for the passenger- 1 for survived, 0 for deceased.

The dataset does not have a strong imbalance with 62%


records falling in deceased and 38% falling in survived.

Fig. 4. Learning curve and rising cross-val score

Fig. 2. Size of classes

Categorical values like embarked status and sex were con-


verted to numerical. Correlation heat maps were obtained and
an estimate of important features can be made already.

Fig. 5. Confusion Matrix

Fig. 3. Heat map of features. Color intensity indicates widespread correlations


among several features

B. Approach
After converting the data to strictly numerical values, the
Dataset is now ready for use with no missing values. 20% of
the data was held out as test data. A logistic regression model Fig. 6. Classification Report
was fit to train data with l2 regularization and regularization
C. Observations
The mean age of the survivors is 1 year less than the mean of
deceased with a spike in the 0-10 years group, as children were
prioritized in efforts to save people. Average age of survivors
was 28 and average age of deceased was 29.

Fig. 9. Ticket class of survivors

Fig. 7. Age contrast of survivors and deceased

First class ticket passengers were much more likely to


survive with high likelihood of death for third class ticket
passengers.Second class ticket passengers were modestly
better off. In other words, extremity in socioeconomic status
made survival proportionally likely or unlikely.

Further observations:
1) A greater proportion of the female passengers survived.
2) The deceased passengers had typically paid a lower fare,
with survivors being more evenly spread out.
3) Greater proportion of people without parents/children died,
which evens out with increasing number of parents/children.
Fig. 10. Sex ratio among survivors

Fig. 8. Ticket class of the deceased Fig. 11. Fare distribution of survivors and deceased
Fig. 12. Contrasting survival across number of parents and children

IV. C ONCLUSION
Logistic Regresson is a simple machine learning algorithm
that provides good accuracy for many simple data sets and
it performs well when the dataset is linearly separable. It is
easy to implement, interpret, and very efficient to train and
makes no assumptions about distributions of classes in feature
space. This paper was able to use logistic regression to learn a
probabilistic model of Titanic passenger survival and gave ex-
cellent, well generalising results given there’s also noise(luck
or randomness) associated with whether a passenger survives.
While making more complex models would only be possible
by extending logistic regression into neural networks, the
obtained model works well and is interpretable.
R EFERENCES
[1] James, Gareth, D. Witten, T. Hastie, and R. Tibshirani, ”An introduction
to statistical learning: with applications in R”, 2017.
[2] Müller, Andreas and S. Guido, “Introduction to machine learning with
Python: a guide for data scientists”, 2016.
[3] D.W. Hosmer, ”Applied logistic regression”.
[4] J.W. Osborne, ”Best Practices in Logistic Regression”.
[5] J.M. Hilbe, ”Practical Guide to Logistic Regression”.
[6] J. Peng and K.L. Lee, ”An Introduction to Logistic Regression Analysis
and Reporting”.

You might also like