You are on page 1of 22

Market Segmentation - Job Market in India

Finding Companies most probable to hire an ML Engineer/Data Scientist


Applicant in respect to his/her skillset

Submitted By

Kiran Kunwar Chouhan (Lead)


Amena zoha
Sai Deeraj.D
Pavan Kumar Yadav Dukka
Mohammed Hamza Malik
Abstract
Machine Learning Engineers and Data Scientists are two of the Hottest Jobs in the Industry right
now, and for a good reason. With 2.5 Quintillion bytes of data generated every day, a professional
who can organize this humongous data to provide business solutions is the hero! The competition
between Machine Learning engineers and Data Scientists is increasing, and the line between them
is diminishing.
The mix of personality traits, experience, and analytic skills required is considered difficult to find.
Thus, the demand for qualified Data scientists and Machine Learning Engineers has exceeded the
supply in recent years. So this report is subjected to analysis of the job market in India using
Market segmentation as the categories of job market such as Machine Learning Engineer, Data
Scientist, Data Analyst, and Data Engineer are the demanding jobs concerning the company size
and growth. The main components taken into consideration are job position, skillset, company
location, company size, experience, and Salary. A detailed investigation of the ML Engineer/Data
Scientist job market has been carried out through various machine learning algorithms such as
K-Means clustering, linear regression model, etc.

Data Collection
The data has been collected manually, and the sources used for this process are listed below.
 https://www.naukri.com/data-scientist-ml-engineer-jobs?k=data%20scientist%2Fml%20en
 https://www.naukri.com/machine-learning-
jobs?k=machine%20learning&functionAreaIdGid=3&functionAreaIdGid=5
 https://www.glassdoor.co.in/Jobs/Glassdoor-Jobs-E100431.htm
 https://www.linkedin.com/jobs/search/?geoId=102713980&keywords=data%20scientist&location
=India

Market Segmentation
Target Market:
The target market of job market segmentation can be categorized into Geographic, Socio-
Demographic, Behavioral, and Psychographic Segmentation.
Behavioral segmentation: searches directly for similarities in behavior or reported behavior.
Example: prior experience with the product, amount spent on the purchase, etc.
Advantage: uses the very behavior of interest is used as the basis of segment extraction.
Disadvantage: not always readily available.
Concerning job market segmentation this segmentation is done based on the employees' experience
as a factor has good experience with the company. The company should be flexible with the
employee with the right salary according to his work contributed to the company. Doing this
increases the scope of the jobs that you are applying to that company from a former employee’s
review.
Psychographic segmentation: grouped based on beliefs, interests, preferences, aspirations, or
benefits sought when purchasing a product. Suitable for lifestyle segmentation. Involves many
segmentation variables. Advantage: generally more reflective of the underlying reasons for
differences in consumer behavior. Disadvantage: increased complexity of determining segment
memberships for consumers.

Socio-Demographic segmentation: includes age, gender, income and education. Useful in


industries.
Advantage: segment membership can easily be determined for every customer.
Disadvantage: if this criteria is not the cause for customers product preferences then it does not
provide sufficient market insight for optimal segmentation decisions.
Geographic segmentation: The original segmentation criteria used for market segmentation.
Typically this consists of using the consumer’s location of residence to form market segments.
Example: place, culture, nationality, etc.

Advantage: can easily be assigned to a geographic unit. This makes it easier for communication
throughout that unit.
Disadvantage: given the same geographical unit the exchange of characteristics relevant to
marketers are not shared among people.
Classifying based on companies location leads to geographic segmentation

At the most fundamental level, the difference between conventional market segmentation methods
and the jobs-based segmentation approach is the primary unit of analysis for grouping customers.
For conventional segmentation, the primary unit of analysis is the attributes of customers
themselves. The primary unit of analysis for jobs-based segmentation is a job that customers are
trying to get done.

However, the aim of both approaches is the same—to create customer segments that share a
uniform set of needs that can be profitably satisfied by products and services. Let’s use this
common premise as the basis for further contrasting jobs-based segmentation and conventional
segmentation methods.

One key difference between these two approaches is how a “market” is defined. The conventional
definition of a market is based on the product and service categories defined by solution providers.
Jobs Theory, on the other hand, defines a market as an aggregation of all available solutions, both
provider and non-provider, that customers regard as being able to satisfy their needs with respect
to getting a job done. For this reason, the term customer-defined solution market or simply
solution market is used rather than just the term “market” to emphasize this distinction.

From a Jobs Theory perspective, customers segment themselves by the jobs they’re trying to get
done and the needs they’re trying to satisfy with respect to getting those jobs done well. For this
reason, customers do not constrain their search for solutions based on industry-defined product
and service categories.

This is an important distinction because to create and maintain the best value solutions, it’s
imperative to know what other alternatives your products and services are competing with from
the customers’ perspective. If this is not known, it’ll be unclear what differential value must be
generated to keep a company’s offerings positioned as the best value among competing solutions.

Yet another key difference is the primary criteria used to define customer needs. Conventional
segmentation approaches group customers according to similar characteristics and behaviors—
collectively called attributes. Customer attributes often include a combination of demographic,
psychographic, lifestyle and behavioral data, and business classifications, among others.

The assumption is that individuals and organizations that share similar attributes will strongly
correlate with a uniform set of needs for that group. The problem, however, is that correlation can’t
really predict the value that customers want from solutions because their buying behaviors are
often driven by more fundamental causal factors.

Using correlations as the primary basis for segmentation can result in customer segments that have
significant variation around a set of needs. Such variation makes it very difficult for companies to
create solutions that can profitably satisfy all the needs of customers in that group.
In contrast, the primary criteria used to define customer needs for jobs-based segmentation are the
jobs that customers are trying to get done, not the attributes of the customers themselves. For any
target job, moments of struggle and the circumstance causing that struggle are the primary basis
for grouping customers into segments. Customer attributes are then used as secondary criteria to
create job executor personas so that customers can be identified out these in the world based on
who they are and/or what they do.

Once customers are grouped according to these criteria for a target job, all the value targets
associated with those particular job executors are a complete and precise set of needs for the
segment. As such, those customers have a high degree of uniformity around the value they want
from solutions to get a target job done better. Because innovation teams know in advance the value
that these customers want, they can consistently create and maintain the best value solutions at the
lowest cost to the company.

Again, the common goal of segmentation is to identify groups of customers that have a high degree
of uniformity around a set of needs. Therefore, the efficacy of any segmentation method relies on
completely and precisely defining those needs. Yet, needs are often defined in ambiguous and
incomplete ways.

They are defined as wants, benefits, motivations, a bundle of satisfactions, requirements, problems
to be solved, state of dissatisfactions, desire sets, preferences, desired outcomes, product attributes,
functional goals, functional tasks, critical-to-quality specifications, among other definitions.
Further, it’s often said that customers have articulated and unarticulated needs.

This uncertainty is a big problem because it’s been well established that ambiguity around
customer needs is the root cause of most innovation failures.
To summarize, conventional segmentation methods group customers together that share similar
attributes. For each segment, the customers’ needs are then defined using a combination of
methods like the voice of the customer, ethnography, lead user analysis, conjoint analysis, and
focus groups, to mention a few. The aim is then to satisfy those needs better than competing
solutions where “competing solutions” are typically limited to product and service categories
defined by companies.
Regardless of which of these methods are used, conventional segmentation is inherently based on
the assumptions of correlation because needs are defined after customers are grouped according
to similar attributes.
The Jobs-based segmentation method, on the other hand, starts with identifying an important job
that’s not getting done well with products and services. Competing solutions are those defined by
the customers’ solution market. Using the Jobs-to-be-Done Framework, a complete set of needs is
first captured for all customers trying to get that job done.
Customers are then asked to prioritize that set of needs which produces an exhaustive set of value
targets. The moments of struggle indicated by undershot value targets and the circumstance
causing that struggle are then used as the primary criteria to group job executors into a job
segment(s). The other associated value targets for the segment(s) fall into place. Now the customer
segment has a high degree of uniformity around the value those customers want from solutions to
get that particular job done better.
Job-based segmentation is inherently predictive because causal factors rather than assumptions of
correlation are used as a basis for grouping customers. This gives a company a significant
advantage over competitors because they can anticipate the value that customers want—even
before customers are aware of certain needs.
A company can quickly and efficiently enhance their existing offerings and create new offerings
that can satisfy customer needs better than competitive alternatives at the lowest possible cost.

Data Cleaning
Information on Role, which company, location of the company, Experience, and skills required,
what is the size of the company, Enrollment type, Salary range is collected from various job search
websites such as Naukri.com, Glassdoor with specific parameters such as location, Role, etc as a
base for the collection. The data collected is compact and is partly used for visualization purposes
and partly for clustering. Python libraries such as NumPy, Pandas, Scikit-Learn, and SciPy are
used for the workflow, and the results obtained are ensured to be reproducible.

EDA
We start the Exploratory Data Analysis with some data Analysis drawn from the data without
Principal Component Analysis and with some Principal Component Analysis in the dataset
obtained from the combination of all the data we have. PCA is a statistical process that converts
the observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal Components.
The process helps in reducing dimensions of the data to make the process of
classification/regression or any form of machine learning, cost-effective
Fig 2: above data transformed into principal components

Fig 3: PC1 vs PC2


Correlation matrix: A correlation matrix is simply a table that displays the correlation. It is best
used in variables that demonstrate a linear relationship between each other. Coefficients for
different variables. The matrix depicts the correlation between all the possible pairs of values
through the heatmap in the below figure. The relationship between two variables is usually
considered strong when their correlation coefficient value is larger than 0.7. So according to the
below heatmap, the following variables are strongly correlated:
 Eligibility Criteria is strongly negatively correlated with Enrollment Type, maximum
experience and average experience.
 Enrollment type is strongly positively correlated to average experience.
Fig4 : Correlation matrix for the dataset
Irrespective of the salary that a company offer, the city with the most job openings is Bengaluru
with 40% of vacancies this leaves a good probability for a person to start his career as an employee
and this can be trusted only if the person is a fresher. If a person with 3-8 years can be expected to
get a job immediately with good pay whereas this doesn’t mean that others won’t get a good payor
will not be hired its just the companies need to think about these candidates. Likewise, a lot of
conclusions can be drawn from respective EDA done on the dataset. Since we know the best place
with high in-demand job opportunities and experience that most companies are asking a new job
segment or a company or a market can be set up in this place with this experience and knowledge
or skills that are required with respect to your company can be formed.

The company is called Analytics Vidhya Educon Pvt. Ltd has the highest number of roles open So
from an employee's perspective according to his requirements if they are met then he may go on
to apply to this company instead of taking the long road and finally returning to this company.
This process takes a long time and based on the roles and positions that this company is taking in
a person has a higher probability of getting his dream job here than in any of the above companies.
Likewise, two-way analysis can be performed from both employee and employer perspectives
leading to a wide increased job market within the location of the employee.

Strip plot
The above strip plot depicts the scatter plot between min_experience and min_salary into
categorized min_experience values.

Point plot
Above point plot represents an estimate of central tendency for min_salary by the position of
scatter plot points between min_salary and min_experience and provides some indication of the
uncertainty around that estimate using error bars.

The above chart depicts that there is a high demand for Data Scientist roles in the job market.
The above bar chart depicts the availability of jobs based on Role/Position, Location and
Company.

The above bar chart depicts the number of jobs available for specific qualifications in Data
Scientist, Machine Learning Engineer and Data Engineer roles respectively.

Scree plot is a common method for determining the number of PCs to be retained via graphical
representation. It is a simple line segment plot that shows the eigenvalues for each individual PC.
It shows the eigenvalues on the y-axis and the number of factors on the x-axis. It always displays a
downward curve. Most scree plots look broadly similar in shape, starting high on the left, falling
rather quickly, and then flattening out at some point. This is because the first component usually
explains much of the variability, the next few components explain a moderate amount, and the latter
components only explain a small fraction of the overall variability. The scree plot criterion looks
for the “elbow” in the curve and selects all components just before the line flattens out. The
proportion of variance plot: The selected PCs should be able to describe at least 80% of the variance.

Extracting Segments
Dendrogram
This technique is specific to the agglomerative hierarchical method of clustering. The
agglomerative hierarchical method of clustering starts by considering each point as a separate
cluster and starts joining points to clusters in a hierarchical fashion based on their distances. To get
the optimal number of clusters for hierarchical clustering, we make use of a dendrogram which is a
tree-like chart that shows the sequences of merges or splits of clusters. If two clusters are merged,
the dendrogram will join them in a graph and the height of the join will be the distance between
those clusters.

As shown in Figure, we can chose the optimal number of clusters based on hierarchical structure of
the dendrogram. As highlighted by other cluster validation metrics, four to five clusters can be
considered for the agglomerative hierarchical as well.
Elbow Method
The Elbow method is a popular method for determining the optimal number of clusters. The
method is based on calculating the Within-Cluster-Sum of Squared Errors (WSS) for a different
number of clusters (k) and selecting the k for which change in WSS first starts to diminish. The
idea behind the elbow method is that the explained variation changes rapidly for a small number
of clusters and then it slows down leading to an elbow formation in the curve. The elbow point is
the number of clusters we can use for our clustering algorithm.

The KElbowVisualizer function fits the KMeans model for a range of clusters values between 2 to
8. As shown in Figure, the elbow point is achieved with 4 clusters which is highlighted by the
function itself. The function also informs us about how much time was needed to plot models for
various numbers of clusters through the green line.
K-means clustering is one of the simplest unsupervised machine learning algorithms. It is an
iterative algorithm that divides the unlabeled dataset into K different clusters in such a way that
each dataset belongs to only one group that has similar properties. The k-means clustering algorithm
performs the following tasks:
(i). Specify number of clusters K
(ii). Initialize centroids by first shuffling the dataset and then randomly selecting K data points for
the centroids without replacement.
(iii). Compute the sum of the squared distance between data points and all centroids. (iv). Assign
each data point to the closest cluster (centroid).
(v). Compute the centroids for the clusters by taking the average of the all data points that belong
to each cluster.
(vi). Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters
isn’t changing.
According to the Elbow method, here we take K=4 clusters to train KMeans model. The derived
clusters are shown in the following figure
Prediction of average salary:
Linear regression is a machine learning algorithm based on supervised learning. It performs a
regression task. Regression models targets prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.
Here we use a linear regression model to predict the average salary of different job positions in
companies. X contains the independent variables and y is the dependent average salary that is to
be predicted. We train our model with a splitting of data into a 4:6 ratio, i.e. 40% of the data is
used to train the model. LinearRegression().fit(X_train,y_train) command is used to fit the data set
into model. The values of intercept, coefficient, and cumulative distribution function (CDF) are
described in the figure.
After completion of training the model process, we test the remaining 60% of data on the model.
The obtained results are checked using a scatter plot between predicted values and the original test
data set for the dependent variable and acquired similar to a straight line as shown in the figure
and the density function is also normally distributed.
The metrics of the algorithm, Mean absolute error, Mean squared error and mean square root error
are described in the below figure:

Profiling and describing the potential segments :


Sorting the number of companies in accordance to the vacancy, Salary range with head () we can
view the top 20 Companies hiring.
Pie chart :

Companies hiring in different cities, Bengaluru has more openings than other cities using
matplotlib.pyplot visualisation is done with the following code.
1.Companies.
2.Cities.
3. Experience.
Creating new described potential segment which includes all segments that are available in a
Dataset.
Describing the potential segments such as company and Salary
Df[“Company Size”].fillna(‘0’, inplace=True)df[“Salary”].fillna(‘0’,inplace =True)
Renaming the dataFrame of Experience by max, min so that it could be easy to use for the
segmentation
[“Experience.remove(columns={0:”min_experience”,1:”max_experience”},inplace=True)]
Describing the new segments for the table with df.head( ).

Target Segments:
So from the analysis we can see that the optimum targeted segment should be belonging to the
following categories:
1) Geographic: There are many companies and vacancies in Bangalore/Bangaluru.

2) Demographic:
a. Role: With a large area of applications, Data Scientist role has the high demand with limited
skillset which makes it optimum to target companies offering the role.
b. Company Size: From the correlation matrix, it is clear that companies with small employee
strength require more skills. Hence, it is optimum to target companies with employee strength
greater than 1000.
d. Eligibility criteria: As eligibility criteria is strongly negatively correlated with Enrollment
Type, maximum experience and average experience, it is optimum to target companies which
require Under Graduation as criterion.

3) Psychographic:
a. Salary: From the above analysis, the companies offering a salary in the range of 7,00,000
rupees to 12,00,000 rupees can be targeted.

Finally, our target segment should consist of companies with more than 1000 employees, offering
jobs for candidates with Undergraduate degree, offering Data Scientist roles, offering a salary
in the range of 7 – 12 lakh rupees and situated in Bangalore.

Customizing the Marketing Mix:

The marketing mix refers to the set of actions, or tactics, that a company uses to promote its brand
or product in the market. The 4Ps make up a typical marketing mix -Price, Product, Promotion and
Place.
 Price: refers to the value that is put for a product. It depends on segment targeted, ability
of the companies to pay, ability of customers to pay supply - demand and a host of other
direct and indirect factors.
 Product: refers to the product actually being sold – In this case, the service. The product
must deliver a minimum level of performance; otherwise even the best work on the other
elements of the marketing mix won't do any good.
 Place: refers to the point of sale. In every industry, catching the eye of the consumer and
making it easy for her to buy it is the main aim of a good distribution or 'place' strategy.
Retailers pay a premium for the right location. In fact, the mantra of a successful retail
business is 'location, location, location’. In our case the location should be Bangalore.
 Promotion: this refers to all the activities undertaken to make the product or service known
to the user and trade. This can include advertising, word of mouth, press reports, incentives,
commissions and awards to the trade. It can also include consumer schemes, direct
marketing, contests and prizes.
All the elements of the marketing mix influence each other. They make up the business plan for a
company and handle it right, and can give it great success. The marketing mix needs a lot of
understanding, market research and consultation with several people, from users to trade to
manufacturing and several others.

GitHub repository link: https://github.com/kiranchouhan/FeynnLabs_Team_Project.git

You might also like