You are on page 1of 13

1

Learn to build an end to end data science project

Appreciating the process you must work through for any Data Science project
is valuable before you land your first job in this field. With a well-honed
strategy, such as the one outlined in this example project, you will remain
productive and consistently deliver valuable machine learning models.

A Data Scientist is one who is the best programmer among all the statisticians
and the best statistician among all the programmers. Every Data Scientist
needs an efficient strategy to solve data science problems. Data Science
positions are unique across the country so we can try and predict the salary of
data science positions based on Job Title, Company, Geography, etc. Here I
have built a project where any user can plug in the information, and it splits up
into a range of salaries, so if anyone is trying to negotiate, then this is a pretty
cool tool for them to use.

Data Science Workflow

1
1. Business Understanding
This stage is significant because it helps clarify the customer’s target.
The success of any project depends on the quality of the questions
asked. If you understand the business requirement correctly, then it
helps you collect the right data. Asking the right questions will help you
narrow down the data acquisition part.

2. Analytic Approach
This is the stage where, once the business problem has been clearly
stated, the data scientist can define the analytic approach to solve the
problem. This step includes explaining the problem in the sense of
statistical and machine-learning techniques, and it is important as it
helps to determine what kind of trends are required to solve the issue in
the most efficient way possible. If the issue is to determine the
probabilities of something, then a predictive model might be used; if the
question is to show relationships, a descriptive approach may be
required, and if our problem requires counts, then statistical analysis is
the best way to solve it. For each type of approach, we can use different
algorithms.

3. Data Requirements
We find out the necessary data content, formats, and sources for initial
data collection, and we use this data inside the algorithm of the
approach we chose. Data reveals impact, and with data, you can bring
more science to your decisions.

4. Data Collection
We identify the available data resources relevant to the problem
domain. To retrieve the data, we can apply web scraping on a related
website, or we can use a repository with premade datasets that are
ready to use. If you want to collect data from any website or repository,

2
use the Pandas library, which is a very useful tool to download, convert,
and modify datasets.

So for this purpose, I have tweaked the web scraper to scrape 1000 job
postings from glassdoor.com. With each job, we get the following: Job
title, Salary Estimate, Job Description, Rating, Company, Location,
Company Headquarters, Company Size, Company Founded Date, Type of
Ownership, Industry, Sector, Revenue, Competitors. So these are the
various attributes for determining the salary of a person working in the
Data Science field.
To check the Web Scraper Article, click here.
To check the Web Scraper Github code, click here.
You can have data without information, but you cannot have
information without data.

5. Data Understanding
Data scientists try to understand more about the data collected before.
We have to check the type of each data and have to learn more about
the attributes and their names.

3
Few salaries contain -1, so those values are not of much importance to
us so let’s remove them. As the salary estimate column is in a string
right now, so we need to give -1 in the string format.

So now we can see that the number of rows has come down to 742. We
observe that most of our variables are categorical and not numerical.
This dataset comprises 2 numerical and 12 categorical variables. But in
reality, our dependent variable, Salary Estimate, has to be numerical. So
we need to convert that into a numerical variable.

6. Data Preparation
Data can be in any format. To analyze it, you need to have data in a
certain format. Data scientists have to prepare data for modeling, which
is one of the most crucial steps because the model has to be clean and
should not contain any errors or null values.

4
In real-world scenarios, data scientists spend 80% of their time cleaning
the data and only spend 20% of their time giving insights and
conclusions.

This is a pretty messy process, so that’s something you should be


prepared for.

After scraping the data, We needed to clean it up so that it was usable


for our model. Let’s made a few changes and created new variables.

When we split on the left parenthesis, what happens is, the left and
right sides of ‘(‘ of all the rows go into 2 different lists. That’s why we
need to include [0] to get the salaries. After obtaining the salaries,
replace ‘K’,’$’ with an empty string. In a few entries, the salary is given as
‘employer provided’ and ‘per hour’, so these are inconsistent and should
be looked after.

5
Now we shouldn’t have employer provided or per hour in salaries. So we
return 2 lists which contain the minimum and maximum salaries of
each entry. This becomes our final dependent variable(to predict the
average salary of a person)

7. Exploratory Data Analysis


EDA plays a very important role at this stage as the summarization of
clean data helps in identifying the structure, outliers, anomalies, and
patterns in data. These insights could help us in building the model.
However, I am going to discuss EDA in detail in a separate article, and
you can find it in my medium profile.

8. Model Building
The data scientist has the chance to understand if his work is ready to
go or if it needs review. Modeling focuses on developing models that are
either descriptive or predictive. So here, we perform Predictive
modeling, which is a process that uses data mining and probability to
forecast outcomes. For predictive modeling, data scientists use a
training set that is a set of historical data in which the outcomes are
already known. This step can be repeated more times until the model
understands the question and answer to it.

6
If we have categorical data, then we need to create dummy variables, so
that's why I transformed the categorical variables into dummy variables.
I also split the data into train and test sets with a test size of 20%. I tried
three different models and evaluated them using Mean Absolute Error. I
chose MAE because it is relatively easy to interpret, and outliers aren’t
particularly bad for this type of model.

After this conversion, the number of columns in our dataset has


increased from 14 to 178!!

I have implemented three different models:

● Multiple Linear Regression — Baseline for the model


● Lasso Regression — Because of the sparse data from the many
categorical variables, I thought a normalized regression like lasso
would be effective.
● Random Forest — Again, with the sparsity associated with the
data, I thought this would be a good fit.

7
Now when it starts, it’s a little worse. So we try to find the optimal value
of alpha for which the error is the least.

Here I have chosen i/10 as well, but the error was still high, so that’s why
I have reduced the values of alpha.

8
After plotting the graph and checking the value of alpha, we see that an
alpha value of 0.13 gives the best error term. Now our error has reduced
from 21.09 to 19.25 (which means 19.25K dollars). We can also improve
the model tuning the GridSearch.

GridSearch is the process of performing hyperparameter tuning in order


to determine the optimal values for a given model. GridSearchCV is
basically like you put in all the parameters which you want, and then it
runs all the models and splits the ones with the best results.

We can even use Support Vector Regression, XGBoost, or any other


models.

9
Random Forest Regression is a tree-based decision process, and also,
there are many 0s, 1s in our dataset, so we expect it to be a better
model. So that’s why I have preferred Random Forest Regression here.

So here we are getting a smaller value of error than the previous ones,
so the Random Forest model is better than the previous models. I have
combined the Random Forest model with the Linear Regression model
to make a prediction. So I have taken the average of both, which means
that I have given 50% weightage to each of the models.

Most of the time, it’s better to combine different models and then make
predictions because there are very good chances of increasing our
accuracy. These types of models are called ensemble models, and they
are widely used. The error may or may not increase because one model
might be overtraining.

The tuned Random Forest model is the best here because it has the
least error when compared to Lasso and Linear regression. So instead of

10
taking the average of both, we can even merge 90% of the random
forest model with 10% of any other models and test the
accuracy/performance. Generally, these types of ensemble models are
better for classification problems.

The project should not be about trying all the models, but it should be to
choose the most effective models and should be able to tell a story as to
why we have chosen those specific ones. Usually, Lasso regression
should have more effect than linear regression as it has the
normalization effect, and we have a sparse matrix, but here the Lasso
performed worse than the linear regression. Hence it depends model to
model, and we cannot generalize anything.

9. Model Evaluation
Data scientists can evaluate the model in two ways: Hold-Out and
Cross-Validation. In the Hold-Out method, the dataset is divided into
three subsets: a training set, a validation set that is a subset that is used
to assess the performance of the model built in the training phase, and
a test set is a subset to test the likely future performance of a model. In
most of the cases, the training:validation:test set ratios will be 3:1:1,
which means 60% of the data to the training set, 20% of the data to the
validation set, and 20% of the data to the test set.

10. Model Deployment


So I have created a basic webpage so that it’s simple to understand.
Given the details of the employee and company, this model predicts the
expected salary for the employee.

I have deployed my Machine Learning model in Heroku using flask. I


have trained the model using Linear regression (because it’s easy to
understand), but you can always train your model using any other

11
Machine Learning model, or you can even use ensemble models as they
provide good accuracy.

Material Sources:
https://www.kdnuggets.com/2020/11/build-data-science-project.html
https://blog.devgenius.io/learn-to-build-an-end-to-end-data-science-
project-c9f79692191

Other Reading Sources:


1. Python Best Practice Example
https://www.kaggle.com/kabure/predicting-credit-risk-model-pipeline
/notebook
2. R Best Practice Example
https://www.kaggle.com/raenish/home-credit-default-risk-r
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classificat
ion
3. Python Code Style Guide
https://peps.python.org/pep-0008/
https://realpython.com/python-pep8/
4. R Code Style Guide
http://adv-r.had.co.nz/Style.html
https://www.r-bloggers.com/2014/07/consistent-naming-conventions-
in-r/

12

You might also like