You are on page 1of 12

Problem 2: Logistic Regression and LDA

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and some
didn't. You have to help the company in predicting whether an employee will opt for the package or not
on the basis of the information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.

Solution –

Firstly the required packages are loaded, and set the working directory. Load the CSV file into a
dataframe. To validate whether the data has loaded fetch for the top 5 or 10 rows of the data set.

Head of the Dataframe

Shape of the Data Set - There are 872 rows and 8 attributes. There are integer and object data types in
the data. No null values are present in the data.
Description/ Summary of all attributes

Except for the ‘salary’ column, no high variations can be seen in the data set (indicating any presence of
outliers), which we will be resolving before forming any models.

We will now check how the data is distributed across a various attributes, and for duplicates (if any).
From the above we see that while 54% are not interested in the holiday package, 45% are positively
skewed towards picking the package.

Data pattern across various attributes

Taking the salary attribute to consideration we can see from the below while performing univariate
analysis that it is rightly skewed –
Performing bivariate analysis by considering attributes like age, salary to see how the pattern varies -
There is a cluster of people whose salary is around 50k and have opted for a package, while the density
seems to reduce as the salary goes up. In salary the trend where package is decided is seen more in the
mid level ages rather than in the 20’s or in the 60’s.

Pairplot for all the data columns


From the above 2 plots we can see that there is no major correlation among the data variables.

Before proceeding with the train-test split on data, we first resolve the outliers so that the data is
uniform –

After encoding the categorical variables, data set looks as below –


Since the variables are now encoded, we can split the data into train and test set (70 – 30), and
implement the Logistic regression model on the split data.

Logistic Regression -

Test Data Predictions –


Training Data Predictions -
Modelling with Linear Discriminant Analysis –
The accuracy in both the models on the training as well as the test set is just above 55%, which is roughly
the same proportion as the class 0 observations in the dataset.

Inference –

Logistic regression and linear discriminant analyses are multivariate statistical methods which can
be used for the evaluation of the associations between various covariates and a categorical outcome.

LDA can also be implemented with small data sets, and hence in this case even though the yield of
accuracy from both models is same we can use the LDA model.
Per the problem we need to identify whether the employee would opt for a holiday package or not. By
using some attributes like salary, age, education (which show a considerate variation) we have observed
that a certain aged people whose salary is around 50k prefer to opt for the package.

To recommend, the focus/ target is that the holiday packages should be made reachable to the older age
group as well, and for the same additional comfort features can be added, or instead of making a
package more sport/adventure oriented, it can be simplified.

For the salary group having higher pay a private stay/ sophisticated vacation options can be provided.
Making any of these alteration by keeping the packages customizable to an extent can be of benefit.

You might also like