Professional Documents
Culture Documents
Appreciating the process you must work through for any Data Science project
is valuable before you land your first job in this field. With a well-honed
strategy, such as the one outlined in this example project, you will remain
productive and consistently deliver valuable machine learning models.
A Data Scientist is one who is the best programmer among all the statisticians
and the best statistician among all the programmers. Every Data Scientist
needs an efficient strategy to solve data science problems. Data Science
positions are unique across the country so we can try and predict the salary of
data science positions based on Job Title, Company, Geography, etc. Here I
have built a project where any user can plug in the information, and it splits up
into a range of salaries, so if anyone is trying to negotiate, then this is a pretty
cool tool for them to use.
1
1. Business Understanding
This stage is significant because it helps clarify the customer’s target.
The success of any project depends on the quality of the questions
asked. If you understand the business requirement correctly, then it
helps you collect the right data. Asking the right questions will help you
narrow down the data acquisition part.
2. Analytic Approach
This is the stage where, once the business problem has been clearly
stated, the data scientist can define the analytic approach to solve the
problem. This step includes explaining the problem in the sense of
statistical and machine-learning techniques, and it is important as it
helps to determine what kind of trends are required to solve the issue in
the most efficient way possible. If the issue is to determine the
probabilities of something, then a predictive model might be used; if the
question is to show relationships, a descriptive approach may be
required, and if our problem requires counts, then statistical analysis is
the best way to solve it. For each type of approach, we can use different
algorithms.
3. Data Requirements
We find out the necessary data content, formats, and sources for initial
data collection, and we use this data inside the algorithm of the
approach we chose. Data reveals impact, and with data, you can bring
more science to your decisions.
4. Data Collection
We identify the available data resources relevant to the problem
domain. To retrieve the data, we can apply web scraping on a related
website, or we can use a repository with premade datasets that are
ready to use. If you want to collect data from any website or repository,
2
use the Pandas library, which is a very useful tool to download, convert,
and modify datasets.
So for this purpose, I have tweaked the web scraper to scrape 1000 job
postings from glassdoor.com. With each job, we get the following: Job
title, Salary Estimate, Job Description, Rating, Company, Location,
Company Headquarters, Company Size, Company Founded Date, Type of
Ownership, Industry, Sector, Revenue, Competitors. So these are the
various attributes for determining the salary of a person working in the
Data Science field.
To check the Web Scraper Article, click here.
To check the Web Scraper Github code, click here.
You can have data without information, but you cannot have
information without data.
5. Data Understanding
Data scientists try to understand more about the data collected before.
We have to check the type of each data and have to learn more about
the attributes and their names.
3
Few salaries contain -1, so those values are not of much importance to
us so let’s remove them. As the salary estimate column is in a string
right now, so we need to give -1 in the string format.
So now we can see that the number of rows has come down to 742. We
observe that most of our variables are categorical and not numerical.
This dataset comprises 2 numerical and 12 categorical variables. But in
reality, our dependent variable, Salary Estimate, has to be numerical. So
we need to convert that into a numerical variable.
6. Data Preparation
Data can be in any format. To analyze it, you need to have data in a
certain format. Data scientists have to prepare data for modeling, which
is one of the most crucial steps because the model has to be clean and
should not contain any errors or null values.
4
In real-world scenarios, data scientists spend 80% of their time cleaning
the data and only spend 20% of their time giving insights and
conclusions.
When we split on the left parenthesis, what happens is, the left and
right sides of ‘(‘ of all the rows go into 2 different lists. That’s why we
need to include [0] to get the salaries. After obtaining the salaries,
replace ‘K’,’$’ with an empty string. In a few entries, the salary is given as
‘employer provided’ and ‘per hour’, so these are inconsistent and should
be looked after.
5
Now we shouldn’t have employer provided or per hour in salaries. So we
return 2 lists which contain the minimum and maximum salaries of
each entry. This becomes our final dependent variable(to predict the
average salary of a person)
8. Model Building
The data scientist has the chance to understand if his work is ready to
go or if it needs review. Modeling focuses on developing models that are
either descriptive or predictive. So here, we perform Predictive
modeling, which is a process that uses data mining and probability to
forecast outcomes. For predictive modeling, data scientists use a
training set that is a set of historical data in which the outcomes are
already known. This step can be repeated more times until the model
understands the question and answer to it.
6
If we have categorical data, then we need to create dummy variables, so
that's why I transformed the categorical variables into dummy variables.
I also split the data into train and test sets with a test size of 20%. I tried
three different models and evaluated them using Mean Absolute Error. I
chose MAE because it is relatively easy to interpret, and outliers aren’t
particularly bad for this type of model.
7
Now when it starts, it’s a little worse. So we try to find the optimal value
of alpha for which the error is the least.
Here I have chosen i/10 as well, but the error was still high, so that’s why
I have reduced the values of alpha.
8
After plotting the graph and checking the value of alpha, we see that an
alpha value of 0.13 gives the best error term. Now our error has reduced
from 21.09 to 19.25 (which means 19.25K dollars). We can also improve
the model tuning the GridSearch.
9
Random Forest Regression is a tree-based decision process, and also,
there are many 0s, 1s in our dataset, so we expect it to be a better
model. So that’s why I have preferred Random Forest Regression here.
So here we are getting a smaller value of error than the previous ones,
so the Random Forest model is better than the previous models. I have
combined the Random Forest model with the Linear Regression model
to make a prediction. So I have taken the average of both, which means
that I have given 50% weightage to each of the models.
Most of the time, it’s better to combine different models and then make
predictions because there are very good chances of increasing our
accuracy. These types of models are called ensemble models, and they
are widely used. The error may or may not increase because one model
might be overtraining.
The tuned Random Forest model is the best here because it has the
least error when compared to Lasso and Linear regression. So instead of
10
taking the average of both, we can even merge 90% of the random
forest model with 10% of any other models and test the
accuracy/performance. Generally, these types of ensemble models are
better for classification problems.
The project should not be about trying all the models, but it should be to
choose the most effective models and should be able to tell a story as to
why we have chosen those specific ones. Usually, Lasso regression
should have more effect than linear regression as it has the
normalization effect, and we have a sparse matrix, but here the Lasso
performed worse than the linear regression. Hence it depends model to
model, and we cannot generalize anything.
9. Model Evaluation
Data scientists can evaluate the model in two ways: Hold-Out and
Cross-Validation. In the Hold-Out method, the dataset is divided into
three subsets: a training set, a validation set that is a subset that is used
to assess the performance of the model built in the training phase, and
a test set is a subset to test the likely future performance of a model. In
most of the cases, the training:validation:test set ratios will be 3:1:1,
which means 60% of the data to the training set, 20% of the data to the
validation set, and 20% of the data to the test set.
11
Machine Learning model, or you can even use ensemble models as they
provide good accuracy.
Material Sources:
https://www.kdnuggets.com/2020/11/build-data-science-project.html
https://blog.devgenius.io/learn-to-build-an-end-to-end-data-science-
project-c9f79692191
12