You are on page 1of 2

Problem Statement –

X Education sells online courses to industry professionals. On any given day, many professionals who
are interested in the courses land on their website and browse for courses. Although X Education
gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in
a day, only about 30 of them are converted. To make this process more efficient, the company
wishes to identify the most potential leads, also known as ‘Hot Leads’. If they successfully identify
this set of leads, the lead conversion rate should go up as the sales team will now be focusing more
on communicating with the potential leads rather than making calls to everyone

Data preparation –
We started with analysing the data and understanding it. First we found the categorical columns
with null values. If the percentage of null values is 40 and more or very close to 40, we have deleted
that column.

Now we took each categorical column with less than 40% of null value and checked what is total
count of null values. If the count of null values is more than the mode of the values present in the
column, we have dropped the column. The reason to drop the column is that if we would have
replace the null values by the mode of the column we would have make the column unbalanced.

For the categorical columns which has very few null values, we have replaced the null values by the
mode of the column.

We also dropped the ID columns – Prospect ID and Lead Number, as they don’t have any role in
labelling a customer as converted or not converted

After completing our analysis with categorical variables we moved to continuous variables. We
started checking for outliers in those columns.

In the columns where we found presence of outliers, we decided to drop values more than 95
percentile and less than 5 percentile values.

Then we found that we have many column with Yes/No values. We replaced the Yes by 1 and No by
0. After that we found there are 6 columns which are different levels of values in them. So we
created dummy variables and replaced the original columns.

We ended up with total of 78 columns

Train – Test Split

We assigned the Converted variables as the target variable.

We divide the data in test set and train set. Train set has 70% of total data and Test has 30%

Feature Scaling -

We had 3 continuous columns which has values of different scales. We scaled the columns to have
consistence in the range of values and for better interpretation.

Model Building –

We build the model based on the train data. We used the RFE technique to select 30 columns The
we use the selected 30 columns to re-build the model. We also assumed that any customer with
50% or more probability of conversion would be considered as Converted.
We identified the variables with high P values and dropped them as they are not significant. We also
found few variables are highly correlated with each other.

After deleting the columns with high P value and high VIF values which indicated high correlation
among variables, we re-build the model. This time we achieved an accuracy of 80.97%

The model has Sensitivity of 68.25% and Specificity of 88.70%

When we plot the ROC curve based on probability, sensitivity and specificity we found that a cut off
probability of 30% would be optimal.

We made the final prediction on the train set with a cut off probability of 0.3. We achieved an
accuracy of 80.97%. Sensitivity of 82.99% and Specificity of 72.65%

Prediction on Test set

We used the model to predict the test set and found it to have an accuracy of 81.69%

You might also like