Abhijit-Bag

Customer Segmentation &
Opportunity Analysis
(using R programming language)
A Project work as per the Dissertation (MCS491) paper of final

semester to obtain the degree of
Master of Science(M.Sc.) in Computer Science
Submitted by
Mr. Abhijit Bag
(Reg.No:161541810001 & Roll No:15499016029)
On 7th MAY,2018
under supervision of
Mr. Subhajit Adhikari
Dinabandhu Andrews Institute of Technology & Management

(Maulana Abul Kalam Azad University of Technology)
Customer Segmentation & Opportunity Analysis 2018
Declaration Of Originality And Compliance

Of Academic Ethics
I hereby declare that this thesis contents original research work done
by me, as part of master of computer science studies. All information
in this document has been obtained and presented in accordance
with the academic rules and ethical conduct.
I also declare that, as required by these rules and conduct I have
fully cited and referenced all the materials.
Name:- Abhijit Bag

Roll No:- 15499016029
Reg. No:- 161541810001
Project Title :- “Customer Segmentation & Opportunity Analysis”
…………………………………….
Signature & Date
2|P a ge
Acknowledgement
I would like to express my sincere, heart-felt gratitude to my
respected guide Assistant Professor Mr.Subhajit Adhikari,
Department of Computer Science in DAITM under MAKAUT, for this
unfailing guidance, prolific encouragement, constructive suggestion
and continuous involvement during each and every phase of this
work.
I am also thankful to Dr. Sanjukta Nandy, Principle, DAITM and Ms.

Paramita Roy, HOD - Department of Computer Science and all
other faculty members and stuffs for providing me all the facilities
and their support to complete these activities.
I would like to express my gratitude to my family & parents for their

belief, mental support and guidance.
Last but not the least; I would like to thank all my classmates of
MSc(CS) batch 2016-18 for their friendly co-operation and support.
3|P a ge
Certificate of Approval
This is certified that the work entitled as “Customer Segmentation &

Opportunity Analysis” has been satisfactorily completed by Abhijit
Bag (Roll No-15499016029; Reg. No-161541810001). It is a
bonafied work carried out under my supervision at DAITM, Kolkata
for fulfilment of MSc in Computer Science during the academic year
2016-2018.
It is understood that by this approval the undersigned do not

necessarily endorse or approve any statement made, opinion
expressed or conclusion drawn there in but approve this project only
for the purpose for which it has been submitted.
…………………………………………….
Signature of examiner
Date:
4|P a ge
To Whom It May Concern
This is certified that the work entitled as “Customer Segmentation &

Opportunity Analysis” has been satisfactorily completed by Mr.
Abhijit Bag (Roll No-15499016029, Reg. No-161541810001). It is a
bonafied work carried out under my supervision at DAITM Kolkata for
partial fulfilment of MSc in Computer Science during the academic
year 2017-18.
……………………………………………………..
Project Guide
Mr. Subhajit Adhikari
Assistant Professor,
Dinabandhu Andrews Institute Of Technology & Management,
Kolkata
……………………………………………………..
Forwarded by
Ms. Paramita Ray
HOD, Dept. of Computer Science,
Dinabandhu Andrews Institute Of Technology & Management
Kolkata.
5|P a ge
Customer Segmentation & Why it Matters

At its most basic, customer segmentation (also known as market
segmentation) is the division of potential customers in a given market into
discrete groups. That division is based on variables and descriptors of those
customers having similar enough:
1. Needs, i.e., so that a single whole product can satisfy them.

2. Buying characteristics, i.e., responses to messaging, marketing
channels, and sales channels, that a single go-to-market approach
can be used to sell to them competitively and economically.
More details are mentioned below:
6|P a ge
There are three main approaches to market segmentation:
 A priori segmentation, the simplest approach, uses a classification

scheme based on publicly available characteristics — such as
industry and company size — to create distinct groups of customers
within a market. However, a priori market segmentation may not
always be valid, since companies in the same industry and of the
same size may have very different needs.
 Needs-based segmentation is based on differentiated, validated

drivers (needs) that customers express for a specific product or
service being offered. The needs are discovered and verified through
primary market research, and segments are demarcated based on
those different needs rather than characteristics such as industry or
company size.
 Value-based segmentation differentiates customers by their

economic value, grouping customers with the same value level into
individual segments that can be distinctly targeted.
Benefits of Customer Segmentation

At the expansion stage, executing a marketing strategy without any
knowledge of how your target market is segmented is akin to firing shots at a
target 100 feet away — while blindfolded. The likelihood of hitting the target is
a matter of luck more than anything else.
Without a deep understanding of how a company’s best current customers

are segmented, a business often lacks the market focus needed to allocate
and spend its precious human and capital resources efficiently. Furthermore,
a lack of best current customer segment focus can cause diffused go-to-
market and product development strategies that hamper a company’s ability
to fully engage with its target segments. Together, all of those factors can
ultimately impede a company’s growth.
7|P a ge
If best current customer segmentation is done right, however, the business

benefits are numerous. For example, a best current customer segmentation
exercise can tangibly impact your operating results by:
1. Improving your whole product: Having a clear idea of who wants to buy
your product and what they need it for will help you differentiate your
company as the best solution for their individual needs. The result will be
increased satisfaction and better performance against competitors. The
benefits also extend beyond your core product offering, since any insights
into your best customers will allow your organization to offer better
customer support, professional services, and any other offerings that
make up their whole product experience.
2. Focusing your marketing message: In parallel with improvements to the
product, conducting a customer segmentation project can help you
develop more focused marketing messages that are customized to each of
your best segments, resulting in higher quality inbound interest in your
product.
3. Allowing your sales organization to pursue higher percentage
opportunities: By spending less time on less lucrative opportunities and
more on your most successful segments, your sales team will be able to
increase its win rate, cover more ground, and ultimately increase
revenues.
4. Getting higher quality revenues: Not all revenue dollars are created
equal. Sales into the wrong segment can be more expensive to sell and
maintain, and may have a higher churn rate or lower upsell potential after
the initial purchase has been made. Staying away from these types of
customers and focusing on better ones will increase your margins and
promote the stability of your customer base.
Conducting best current customer segmentation research can have

numerous other ancillary benefits, of course, but this guide will focus primarily
on how it can impact the four cited above. The bottom line is that if you are
able to sell more of your product to your most profitable customers, then you
will be able to scale the business more efficiently and ensure that everything
you do — from lead generation to new product development — revolves
around the right things.
8|P a ge
Customer Segmentation Using Cluster

Analysis
In brief, cluster analysis uses a mathematical model to discover groups of
similar customers based on finding the smallest variations among customers
within each group. The process is not based on any predetermined
thresholds or rules (as are most simple segmentation methods), but rather
the data itself generates the customer prototypes that inherently exist within
the population of customers.
The two main advantages of cluster analysis over simple threshold/rule-

based segmentation are -
 practicality – it would be practically impossible to use predetermined
rules to segment customers over many dimensions, and
 homogeneity – variances within each resulting group are very small in
cluster analysis, whereas rule-based segmentation typically groups
customers who are actually very different from one another.
The customer segmentation process can be performed with various

clustering algorithms. We focused on k-means clustering in R. While the
algorithm is quite simple to implement, half the battle is getting the data into
the correct format and interpreting the results. We went over formatting the
order data, running the kmeans() function to cluster the data with several
hypothetical kk clusters, using silhouette() from the cluster package to
determine the optimal number of kk clusters, and interpreting the results by
inspection of the k-means centroids.
How K-Means Algorithm Works

The k-means clustering algorithm works by finding like groups based on
Euclidean distance, a measure of distance or similarity. The practitioner
selects kk groups to cluster, and the algorithm finds the best centroids for
the kk groups. The practitioner can then use those groups to determine which
factors group members relate. For customers, these would be their buying
preferences.
9|P a ge
K-Means clustering intends to partition n objects into k clusters in which each

object belongs to the cluster with the nearest mean. This method produces
exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori
and must be computed from the data. The objective of K-Means clustering is
to minimize total intra-cluster variance, or, the squared error function:
Algorithm
1. Clusters the data into k groups where k is predefined.

2. Select k points at random as cluster centres.
3. Assign objects to their closest cluster centre according to the Euclidean
distance function.
4. Calculate the centroid or mean of all objects in each cluster.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each
cluster in consecutive rounds.
K-Means is relatively an efficient method. However, we need to specify the

number of clusters, in advance and the final results are sensitive to
initialization and often terminates at a local optimum. Unfortunately there is no
global theoretical method to find the optimal number of clusters. A practical
approach is to compare the outcomes of multiple runs with different k and
choose the best one based on a predefined criterion. In general, a large k
probably decreases the error but increases the risk of over fitting.
Getting Started With Data

To start, we’ll get need some orders to evaluate. If you’d like to follow along,
we will be using the bikes data set, which has already been retrieved. We’ll
load the data first using the xlsx package for reading Excel files.
10 | P a g e
Next, we’ll get the data into a usable format, typical of an SQL query from
an ERP database. The following code merges the customers, products and
orders data frames using the dplyr package.
Developing A Hypothesis For Customer

Trends
Developing a hypothesis is necessary as the hypothesis will guide our
decisions on how to formulate the data in such a way to cluster customers.
For the Cannondale orders, our hypothesis is that bike shops purchase
Cannondale bike models based on features such as Mountain or Road Bikes
and price tier (high/premium or low/affordable). Although we will use bike
model to cluster on, the bike model features (e.g. price, category, etc) will be
used for assessing the preferences of the customer clusters (more on this
later).
To start, we’ll need a unit of measure to cluster on. We can select quantity
purchased or total value of purchases. We’ll select quantity purchased
because total value can be skewed by the bike unit price. For example, a
premium bike can be sold for 10X more than an affordable bike, which can
mask the quantity buying habits.
11 | P a g e
Manipulating The Data Frame

Next, we need a data manipulation plan of attack to implement clustering on
our data. We’ll user our hypothesis to guide us. First, we’ll need to get the
data frame into a format conducive to clustering bike models to customer id’s.
Second, we’ll need to manipulate price into a categorical variables
representing high/premium and low/affordable. Last, we’ll need to scale the
bike model quantities purchased by customer so the k-means algorithm
weights the purchases of each customer evenly.
We’ll tackle formatting the data frame for clustering first. We need to spread
the customers by quantity of bike models purchased.
Next, we need to convert the unit price to categorical high/low variables. One
way to do this is with the cut2() function from the Hmisc package. We’ll
segment the price into high/low by median price. Selecting g = 2 divides the
unit prices into two halves using the median as the split point.
Last, we need to scale the quantity data. Unadjusted quantities presents a

problem to the k-means algorithm. Some customers are larger than others
meaning they purchase higher volumes. Fortunately, we can resolve this
issue by converting the customer order quantities to proportion of the total
bikes purchased by a customer. The prop.table() matrix function provides a
convenient way to do this. An alternative is to use the scale() function, which
normalizes the data. However, this is less interpretable than the proportion
format.
The final data frame (first five rows shown below) is now ready for clustering.
12 | P a g e
K-Means Clustering
Now we are ready to perform k-means clustering to segment our customer-
base. Think of clusters as groups in the customer-base. Prior to starting we
will need to choose the number of customer groups, kk, that are to be
detected. The best way to do this is to think about the customer-base and our
hypothesis. We believe that there are most likely to be at least four customer
groups because of mountain bike vs road bike and premium vs affordable
preferences. We also believe there could be more as some customers may
not care about price but may still prefer a specific bike category. However,
we’ll limit the clusters to eight as more is likely to overfit the segments.
Running The K-Means Algorithm on the dataset

The code below does the following:
1. Converts the customerTrends data frame into kmeansDat.t. The
model and features are dropped so the customer columns are all that
are left. The data frame is transposed to have the customers as rows
and models as columns. The kmeans() function requires this format.
2. Performs the kmeans() function to cluster the customer
segments. We set minClust = 4 and maxClust = 8. From our
hypothesis, we expect there to be at least four and at most six groups
of customers. This is because customer preference is expected to vary
by price (high/low) and category1 (mountain vs bike). There may be
other groupings as well. Beyond eight segments may be overfitting the
segments.
13 | P a g e
3. Uses of the silhouette() function to obtain silhouette

widths. Silhouette is a technique in clustering that validates the best
cluster groups. The silhouette() function from the cluster package
allows us to get the average width of silhouettes, which will be used to
programmatically determine the optimal cluster size.
Next, we plot the silhouette average widths for the choice of clusters. The
best cluster is the one with the largest silhouette average width, which turns
out to be 5 clusters.
Which customers are in each segment?

Now that we have clustered the data, we can inspect the groups find out
which customers are grouped together. The code below groups the customer
names by cluster X1 through X5.
Determining The Preferences Of The Customer Segments

The easiest way to determine the customer preferences is by inspection of
factors related to the model (e.g. price point, category of bike, etc). Advanced
algorithms to classify the groups can be used if there are many factors, but
typically this is not necessary as the trends tend to jump out. The code below
14 | P a g e
attaches the k-means centroids to the bike models and categories for trend
inspection.
Now, on to cluster inspection.
CLUSTER 1
We’ll order by cluster 1’s top ten bike models in descending order. We can
quickly see that the top 10 models purchased are predominantly high-end
and mountain. The all but one model has a carbon frame.
CLUSTER 2
Next, we’ll inspect cluster 2. We can see that the top models are all low-
end/affordable models. There’s a mix of road and mountain for the primary
category and a mix of frame material as well.
15 | P a g e
CLUSTERS 3, 4 & 5
Inspecting clusters 3, 4 and 5 produce interesting results. For brevity, we
won’t display the tables. Here’s the results:
 Cluster 3: Tends to prefer road bikes that are low-end.

 Cluster 4: Is very similar to Cluster 2 with the majority of bikes in the
low-end price range.
 Cluster 5: Tends to refer road bikes that are high-end.
Reviewing Results
Once the clustering is finished, it’s a good idea to take a step back and
review what the algorithm is saying. For our analysis, we got clear trends for
four of five groups, but two groups (clusters 2 and 4) are very similar.
Because of this, it may make sense to combine these two groups or to switch
from kk = 5 to kk = 4 results.
16 | P a g e
Conclusion & Future Scope

While this guide provides a step-by-step process for identifying, prioritizing,
and targeting your best current customer segments, simply following it does
not guarantee success. To be effective, you must prepare and plan for the
various challenges and hurdles that each step may present, and always
make sure to adapt your process to any new information or feedback that
might change its output.
Additionally, you cannot force feed this process on your business. If the key
stakeholders that will be impacted by the best current customers
segmentation process do not fully buy-in, then the outputs produced from it
will be relatively meaningless.
If you properly manage the best current customer segmentation process,

however, the impact it can have on every part of your organization — sales,
marketing, product development, customer service, etc. — is immense. Your
business will possess stronger customer focus and market clarity, allowing it
to scale in a far more predictable and efficient manner.
Ultimately, that means no longer needing to take on every customer that is

willing to pay for your product or service, which will allow you to instead hone
in on a specific subset of customers that present the most profitable
opportunities and efficient use of resources. That is critical for every
business, of course, but at the expansion stage, it can often be the difference
between incredible success and certain failure.
17 | P a g e
REFERENCES
 Dr. R. Gardener “The Essential R Reference” (2014),
 Concepts of customer segmentation http://www.business-science.io
 https://labs.openviewpartners.com/customer-segmentation/
 Source data related to our analysis has been collected from
https://github.com/mdancho84/orderSimulatoR/tree/master/data
 https://www.kaggle.com/
 https://www.r-project.org/
 https://www.rstudio.com/
18 | P a g e

Abhijit-Bag

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abhijit-Bag

Uploaded by

Copyright:

Available Formats

Customer Segmentation &

A Project work as per the Dissertation (MCS491) paper of final

Master of Science(M.Sc.) in Computer Science

Dinabandhu Andrews Institute of Technology & Management

Declaration Of Originality And Compliance

Name:- Abhijit Bag

I am also thankful to Dr. Sanjukta Nandy, Principle, DAITM and Ms.

I would like to express my gratitude to my family & parents for their

This is certified that the work entitled as “Customer Segmentation &

It is understood that by this approval the undersigned do not

To Whom It May Concern

This is certified that the work entitled as “Customer Segmentation &

Customer Segmentation & Why it Matters

1. Needs, i.e., so that a single whole product can satisfy them.

More details are mentioned below:

There are three main approaches to market segmentation:

 A priori segmentation, the simplest approach, uses a classification

 Needs-based segmentation is based on differentiated, validated

 Value-based segmentation differentiates customers by their

Benefits of Customer Segmentation

Without a deep understanding of how a company’s best current customers

If best current customer segmentation is done right, however, the business

Conducting best current customer segmentation research can have

Customer Segmentation Using Cluster

The two main advantages of cluster analysis over simple threshold/rule-

The customer segmentation process can be performed with various

How K-Means Algorithm Works

K-Means clustering intends to partition n objects into k clusters in which each

1. Clusters the data into k groups where k is predefined.

K-Means is relatively an efficient method. However, we need to specify the

Getting Started With Data

Developing A Hypothesis For Customer

Manipulating The Data Frame

Last, we need to scale the quantity data. Unadjusted quantities presents a

Running The K-Means Algorithm on the dataset

3. Uses of the silhouette() function to obtain silhouette

Which customers are in each segment?

Determining The Preferences Of The Customer Segments

 Cluster 3: Tends to prefer road bikes that are low-end.

Conclusion & Future Scope

If you properly manage the best current customer segmentation process,

Ultimately, that means no longer needing to take on every customer that is

You might also like