You are on page 1of 5

Machine Learning and Business

Analytics Surprize Quiz


K Guhan
H20146

Overview of Business Problem


We started today’s session with a data set consisting of sales transactions of a firm. Each line item
was every individual item procured by every customer against different invoices across a time period
of 1 year. There were several applications of this data to do country wise sales analytics, peak time
of business analysis etc. However, we decided to understand the types of customers (customer
segmentation) using the RFM model (recency, frequency and monetary) to understand a little bit
about our customers.

Data Transformation
To do this, we had to transform the transactional sales data into customer-wise sales data for which
we used basic excel functions and pivot table. For recency, we considered the latest month of
transaction of a customer as 0 and earliest month as 12. For frequency, we count the number of
distinct invoices against the customer in the data range. For monetary value, it is the total amount of
money spent by the customer in the data range. We have now successfully converted a sheet with
100,000 line-items of data with individual sales transactions to one with 1158 line-items of
consolidated customer data.

Outlier Analysis
The expected method of outlier detection is to check if the values lie in the range of (Q1-1.5*IQR) to
(Q3+1.5*IQR) and remove these values. However, there is NO ONE WAY to remove outliers from a
data set. It highly depends on business context. In this case, out of the 1158 items of customer data,
following was the summary of quartile values.

Quartil Frequen
e cy MV
329.63
Q1 1 25
Q3 5 1840.6
1510.9
IQR 4 68
   
-
1936.8
Acceptan
ce Range
-5 2
4107.0
11 51

However, for frequency, we found that 109(9.4%) line-items of the data lied outside of the
acceptance range. Similarly, for monetary, we found that 114(9.8%) line-items of the data lied
outside of the acceptance range. However, we cannot consider data to be outliers as outliers cannot
be more than 1% of the data set. Hence, we cannot eliminate these outliers based on the IQR logic.
In this case, we eliminate the outliers by graphical method using histogram function on R. We
eliminate 2 line-items from frequency and 2 line-items from monetary for abnormally high values on
the histogram.

From the business context, we eliminate returns (negative sales) as they make no sense for our
model to understand customer segmentation. So, we have eliminated 10 line-items from the data
set for that reason.

We now move the save the data in a new separate .csv file to import into R.

Clustering using R
Identifying optimal number of clusters
We do cluster analysis on R to identify the different segments of customers and for this we use
kmeans function on R. Following is a snippet of the code used for the same.

We are using the kmeans$tot.withinss function to identify the error versus number of clusters and
map it in an array called nocl. We then plot a curve to use the elbow technique to identify the
optimal number of clusters. The elbow curve is shown above as well. Based on the errors, we can
identify n=4 as the optimal no of clusters for our customer data.

Running kmeans and appending cluster numbers to data set


Now with k=4, we run kmeans code to develop the clusters for the customer segmentation. Each
cluster is a type of customer whose characteristics we have to identify. Following is the code for the
same.

We create an array of the clusters named clus and bind it with the customer data set using cbind
function. We name this file as “newdata” shown below.
Now we have essentially assigned cluster numbers to each line item of the customer data and we
need to check the size of each cluster for which we use km$size. Sizes of the 4 clusters are given
below.

Individual Cluster Analysis


To do individual cluster analysis, we have to compare the summary table of the entire data set with
the summary tables of the individual clusters generated.

Following is the summary of the whole data set.

Note that high value of recency means very old customer who has not visited in recent past.

To generate the summary of individual clusters, we run the following R code on the “newdata” file.

Cluster 1 : (Cluster Size = 12)

Recency: Cluster Means are very low (0.083) compared to overall mean (3.11) => Very Recent Customer

Frequency: Cluster Means are very high (48.83) compared to overall mean (4.96) => Very High Volume

Monetary: Cluster means are very high (22617) compared to overall mean (1784) => High Monetary
Value
Cluster 2: (Cluster Size = 756)

Recency: Cluster Means are very low (1.642) compared to overall mean (3.11) => Recent Customer

Frequency: Cluster Means are little low (3.734) compared to overall mean (4.96) => Avg. Volume
Ordering

Monetary: Cluster means are low (1221.1) compared to overall mean (1784) => Low Monetary Value

Cluster 3: (Cluster Size = 271)

Recency: Cluster Means are very high (8.244) compared to overall mean (3.11) => Very Old Customer

Frequency: Cluster Means are low (1.638) compared to overall mean (4.96) => Low Volume Ordering

Monetary: Cluster means are very low (466.7) compared to overall mean (1784) => Low Monetary Value

Cluster 4: (Cluster Size = 105)

Recency: Cluster Means are very low (0.781) compared to overall mean (3.11) => Recent Customer

Frequency: Cluster Means are high (17.37) compared to overall mean (4.96) => High Volume Ordering

Monetary: Cluster means are high (6862) compared to overall mean (1784) => High Monetary Value

Customer Segmentation Summary

Customer Type No of Customers Recency Frequency Monetary

Type 1 12 Very Recent High Volume High Value


Type 2 756 Recent Average Low Value
Type 3 271 Very Old Low Volume Low Value
Type 4 105 Recent High Volume High Value

You might also like