Professional Documents
Culture Documents
Salary Prediction
Fall 2022
Group 6
Web Mining Final Project Report
Contents
Introduction: .......................................................................................................... 3
Research Questions: .............................................................................................. 3
Data Analysis: ........................................................................................................ 4
Feature Extraction: ............................................................................................. 4
Exploratory Data Analysis ................................................................................... 5
Model Selection: ............................................................................................... 10
Model Comparison: .......................................................................................... 10
Result Analysis: .................................................................................................... 12
1. Sentimental Analysis ..................................................................................... 12
2. Salary Prediction ........................................................................................... 13
Limitations: .......................................................................................................... 14
Future Work: ....................................................................................................... 15
Introduction:
As a direct response to the present status of the global economy, the vast
majority of companies have reached their absolute lowest level of hiring ever
recorded. The recruiting process is difficult for everyone, but in today's economy,
it is particularly difficult for recent grads who are looking for their first
employment.
When someone is searching for their first job, the employment websites they use
aren't always as useful and versatile as they may be. When we check for a certain
employment position on these websites, we find that not all the job needs and
specifications for that role are the same, despite the fact that the roles may have
the same name. As a direct consequence of this, the students get confused. In
addition to that, this will create an issue all the way through the interviewing
process.
Because there are such vast differences in the wages given by various companies,
people often feel confused when discussing their pay, particularly if it is their first
work. This is especially true for those who are looking for their first employment.
Within the scope of our project, we will prioritize addressing issues of this kind.
We will be working on one such job listing website – Glassdoor, as part of our
project. It is a website where companies may publish their job opportunities along
with the essential information, users can offer their thoughts on a specific
position or firm, and workers can look at evaluations and make job searches on
the platform. With the use of this data, we will be able to determine which
companies are advertising which job titles, the pay rates that relate to those job
titles, and the overall geographic distribution of needs throughout the country.
Research Questions:
The following areas are researched as part of our project to help us understand
the diversity of jobs and corresponding variables and details:
1. Annual Salary based on Job Title
2. Sentiment Analysis on Reviews
3. Salary Distribution across different states
4. Variation of job salaries based on different factors (company rating,
seniority, skills required, etc.)
5. Word cloud on job description
Data Analysis:
Feature Extraction:
Using Beautiful Soup, we extracted data from the Glassdoor Website. We used
libraries like plotly, Seaborn, NumPy, pandas, spacy, request, time, and re. The
extracted data required extensive data cleansing after that. As was expected, not
every business abides by the rules and posts all the data. Sometimes the city or
even the state was not mentioned in the location of the company. Most
companies disclose their pay in one of four ways: annual salary ($52K), annual
salary in the range ($52K - $73K), hourly compensation ($25), or hourly salary in
the range ($25 - $35). After data processing, all the aforementioned parameters
were enforced for consistency. Extracted data contained job listings from various
companies like Accenture, American-Express, Apple, Barclays, Cisco-Systems, Citi,
Deloitte, EY, Goldman-Sachs, Google, IBM, Morgan-Stanley, etc.
Exploratory Data Analysis
Average salaries distribution throughout US states
Virginia state has the highest average annual salary. The states with the lowest
average annual salaries are Tennessee and North Carolina.
The above figure shows the annual salaries of employees with respect to their job
descriptions. The job title Senior Blockchain Data Analyst has the highest annual
salary of more than $275k based Washington D.C.
Annual Salaries of the bottom 10 jobs
This graph depicts 10 job titles having the least annual salaries. Here, we can see
that Finance Analyst has the lowest salary below $10k.
This plot depicts cities which have highest paying job titles. We can conclude that
Washington D.C. has highest paying Analyst job titles.
Annual Salary Distribution
By looking at this figure we can see that most of the salaries are distributed
between $50k and $100k and just one salary above $250k.
Annual salaries with respect to US states
This histogram shows the job availability along with the salary distribution with
respect to the 17 states of United States.
Word Cloud
Word Cloud shows the highest frequency of word occurrence. We can determine
that words like Business, Data and Technical have the highest frequency of
searches.
Pie Chart
After Web scraping the availability of jobs on glass door throughout 17 states of
United States, we can determine the highest and the lowest availability of jobs in
each state in the pie chart.
Polarity Plot
From the polarity plot, we can determine that most of the reviews are positive.
Subjectivity Plot
From the subjectivity plot, we can determine that majority of the reviews are
fairly subjective.
Model Selection:
Model Comparison:
We have used 3 different models for Salary Prediction.
1. Lasso Regressor
2. Linear Regressor
3. Random Forest Regressor
This histogram depicts comparison between polarity of two companies, Apple and
American Express. (Blue – American Express, Red – Apple). Positive polarity of
reviews of Apple is higher than that of American Express. Thus, we can say that
Apple has better reviews than American Express from the employee perspective.
With the successful implementation of our model, we can now conclude which
companies have the most satisfied employees and good work environment
Moreover, we can also determine companies with the least satisfied employees
Out of all the companies, Apple has the highest sentiment with 0.33. Thus, we can
say that Apple employees are most satisfied with their companies. Now, EY has
the lowest sentiment with 0.23. This means, EY employees are least satisfied with
their companies.
2. Salary Prediction
As seen in the code above, all the user has to do is input their job requirements in
the highlighted part,
1. Which Industry they want to work in?
2. How should the company be owned? (Public, Private, NGO, Government,
etc.)
3. What job title they want?
4. Level of seniority of the job
5. The skills and tools they want to work with as a part of the job
Once the user has specified their requirements, our model shows the annual
salary the user should expect by comparing data of all the job listings which we
have extracted and trained the model with,
As seen in the code above, our model gives the estimated salary in the highlighted
part.
We might be able to see a job's salary on the standard Glassdoor website. But
what our model accomplished was it enhanced the system so that the user could
estimate their expected wage based on several input criteria.
Recent graduates often worry during interviews about their inadequate
understanding of the expected income. They might be paid insufficiently in some
circumstances because of their inexperience. They will benefit from our model in
these circumstances.
Limitations:
In our project, we have trained our model in such a way that it should be
beneficial for the user before going to an interview. The inputs like skills required
is totally dependent on the user. The user can decide whatever the skills they”
want to work on. However, we haven’t included a scenario where we checked if
the user would actually be able to work for the skills they have searched for.
For example, if the user searched for python as a skill requirement in their job
search, we haven’t cross checked whether they are comfortable working in
python or not. To include that, we need to work on the user’s resume as well.
Additionally, the Glassdoor website is always being updated and improved. It is
challenging to extract data in real time since the classes change so frequently. It is
inconvenient to have to check all the classes on the website before executing the
code each time data needs to be extracted.
Future Work:
Our next action will be to remove the limitation we mentioned before. To do this,
we'll begin by gathering resume datasets. After each resume has been cleaned
and processed, we will compare it again with the job description. With the
addition of this functionality to our model, we would also be able to advise the
user on the skills they should utilize in their job search to increase their chances of
landing a position.