You are on page 1of 8

CUSTOMER ANALYTICS RETAIL PROJECT

Anirudh Rao Anusha Krishna Prasad Arcchit Mittal


Student Student Student
Singapore Management University Singapore Management University Singapore Management University
anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg

Nandeep NV Premnath Prathiban Raviteja Kalidindi


Student Student Student
Singapore Management University Singapore Management University Singapore Management University
anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg

ABSTRACT Transaction dataset captures 350,000


transactional records occurred in the year 2001.
HE Ptd. Ltd. is a retail chain specialized in home These datasets are in comma delimited text
entertainment, game consoles and software, Hi- format.
Fi, MP3 players and telephony. In 2000, the Table 1: Customers File Data Fields
company implemented a customer relationship
management (CRM) system to capture
information of its customers and their transaction
records. The system is mainly served as a
database management system and its applications This is one of the most important entity of this
are mainly confined to simple tubular reporting retail dataset as this constitutes the characteristic
of monthly and annual sales. of each customer. Some of the attributes that
Recently, Mr. David Chou, the CEO of the were associated with this dataset were ID, DOB,
company had attended an industry seminar Registration Date and Gender. As part of our
organized by the MITB (SSA) programme of the ETL logic we have made extracted few more
School of Information Systems, Singapore attributes like Age ( Derived from the DOB) and
Management University. At the seminar, Mr. the number of years the customer is as part of this
Chou learned that the data captured in the CRM organization (Derived from the Registration
can provide actionable insights for the company Date).
to gain competitive advantage by using This data constitutes of totally 140,000 records of
appropriate customer analytics techniques. The customers.
analysts assigned to this project will provide an
analysis on the different insights that can be
achieved through data analytics. As analysts, we Table 2: Customer Transactions Data Fields

will be applying customer analytics techniques


like RFM, MBA, Association Analysis and
product preference Analysis.

1. MASTER DATA PREPERATION

For the start, 4 datasets are provided for the POC


project. The Customer dataset consists of This data set comprises of OLTP (Online
approximately 140,000 customer records. The Transaction Processing Data), which is recorded
as part of the transaction made by the customers.
This data does not provide much valid Table 4: Customer Campaign Data Fields
information as there are not in normalized form.
So these data should be combined with some
other data to provide valuable information. As
part of some of these data we can extract some
more hidden data which can be helpful.
This data consists of only 2 attributes, one is the
Card-ID and the campaign response. So the data
Firstly we can find the day at which this set lacks the richness of data but there is a high
transaction was performed from the timestamp potential that these dataset can be combined with
attribute. This day can help us in the retail dataset other datasets, which helps us to bring out some
to find out the crowd distribution from the of the hidden data insights.
perspective of the Day.
2. FACT TABLE
Secondly each transaction ID is associate with
one Customer, it also helps us to classify the In order to bring some insights from this retail
customer into categories based on their timing of dataset, we need to make sure that the dataset
transaction. In this case we have categories as should be organized and more customer focused.
follows: So some of the business intelligence concepts
like star schema was applied over the dataset.
9:00 a.m - 11:59 a.m - Morning Stars
12:00 Noon - 15:00 p.m - Afternoon Avengers The Star-Schema is defined as a schema where a
centralized fact table, containing factual data are
15:00 p.m - 18:00 p.m - Evening Evangeline's
connected to the dimensional data, containing the
descriptive data and the table was also
This table constitutes of only ~60K customer aggregated from the customer perspective. After
from the whole population of 140K customers. implementing this fact table , each record in the
So this help us to understand that there are some entity will comprises of detailed representation
80K customer who are dormant and are the of customers, who has made transaction.
potential target. Many strategic decisions can be
made in order to make those customer active,
As part of this process we have implemented the
which in turn brings revenue to the organizations.
following:

Table 3: Product Table Data Fields


● Each record represents one single customer,
its demographics and purchasing patterns.
● We derived the age, gender and the Number
of years as customer from the Customer
table.
● In order to find their purchasing pattern , we
This entity comprises the information about the derived the following the transaction dataset
products and its characteristic like ID, category :
and unit price. As these were descriptive data
there were no much transformation made.
○ Payment Distribution - This will give us the For the recency we have taken the recent visit of
information about the number of times he a customer from the Last date of the year
has used each mode of payment. available in the given data of the transaction
table. We have taken 31st
○ Favorite Day - This column provides the top
3 days from his visit. December as the cut off data and we have
subtracted the transaction dates with it. After
○ Favorite Category - one of the important
getting the Days from the most recent purchases
column which can help the organization to we have divided this into 5 blocks and the least
implement the strategy of upsell and cross days in which a customer has shopped will be the
sell. This column help the organization to most frequent buyers .i.e., the customers having
tell his top 3 favorite category that he has the least value will be our more frequent
brought the most number of times, arranging customers and they are given the value 5 which
as per the count. is followed by 5, 4,3,2,1.
○ Classifier - As we have classified each For frequency we have taken the count of
customer from the transaction level and individual unique customer ids and grouped
these based on number of transactions. The
since that does not provide much valuable
customers which the highest number of
information. We have aggregated customer transactions counts will be given the 5th top most
level and consider only the classifier ranking and the lowest will be given 1. The order
category which possess the largest value. is given as 5, 4,3,2,1.
○ We also aggregated the total number of For frequency we have taken the count of
items and the total monetary spent by each individual unique customer ids and grouped
customer. these based on sum of sales or transactions. The
customers which the highest number of sales will
be given the 5th top most ranking and the lowest
3. RFM ANALYSIS will be given 1. The order is given as 5, 4,3,2,1.
The customers are based on 20% margins. The
RFM analysis is a popular way of customer top 20% customers will be our top customers.
segmentation technique which can help the
For each customer we get 3 different values
retailers to maximize the return on their
ranging from 1-5 for each criteria based on RFM.
marketing investments. While planning for the
marketing spend or formulating a new promotion In our case once we get the scores we have given
the retail marketers need to be careful about how more weightage to recency. That is for the reason
they segment and target their customers. being the most recent customers are more likely
to buy again. We have multiplied the recent
Under the RFM analysis each customer is ranked
column (Recency) by 3. We have given 3:2:1 for
on three factors, namely Recency, Frequency and
the values on Recency: Frequency: Monetary.
Monetary value. It helps companies identify
The value of the monetary is minimized for 1.
customers that are more likely to respond to a
new offer.

3.1 Analysis Method for RFM Analysis

The RFM values had to be calculated based on


each customer as we have done our analysis on
the customer level.
Once this is done we have created a spark matrix
with all 7 products as variables. Before this
aggregation the product level data was in a single
variable column. So we get 7 new columns
regarding the product transaction details.
For the final calculations of the MBA algorithm
and is run and confidence level is set at 70% and
lift is more than 1.

Figure 1: Code for RFM Analysis

4. MARKET BASKET ANALYSIS

The MBA Analysis provides insights on the


customer purchase pattern for various products.
Through this analysis, we can achieve a product
mix of which products are being purchased the
most and provide recommendations as to which
product can be bundled and sold to customers.
Figure 2: MBA Analysis using R

4.1 Analysis Method for MBA Analysis


For the calculation of the MBA analysis we have 5. CUSTOMER CAMPAIGN
created a spark matrix.
Campaign management is a series of marketing
We have joined the table of product details with
tactics and programs that are all designed to
transactions details. We have mapped the
achieve a specific business goal. There are
customer details with the transaction details and
different ways to define and achieve goals and
giving a binary value 0 and 1. So there will be
objectives in a campaign.
binary values for 7 respective categories.
 Increase sales Revenue.
Sub category is not taken into consideration as
we have no data regarding what kind of sub  Acquire new customers.
category is being purchased as it is numerical.  Increase customer retention.
 Enhance up-sell/cross-sell opportunities.
We tried to create a customer basket based on
week and month and that didn’t provide us any
These majorly consist of enhancing pipeline on
desired inferences so finally we have taken the open opportunities, tracking customer interest,
basket value for entire year and tried to infer the promotions on websites (e.g., free trials, demo
customer behavior for an entire year. In this case center), tracking search engine marketing, and
the basket constitutes of products bought by the awareness events that drive PR and brand
customer n the entire year. awareness for the company and pricing
promotions on current customers.
5.1 Campaign Analysis
The campaign dataset had two variables – Card
id and the Campaign response. The Card id gave
the information about the customer ID related
one particular unique customer and the
Campaign response gave the response of a
customer on the particular campaign with “T”
being responded and “F” being not responded.
The campaign was targeted for 5957 customers
which caters to about 10% of the unique A total of 325 customers had responded to the
customer id’s that has been mapped to the campaign which corresponds to about 6% of the
customer table. total targeted customers. Profiling these
customers helped us to know what group of
customers are buying in the campaigns, their age
Figure 3: Customer Response group and gender classification also helped us in
tracking the right members for targeting such
Campaign Response campaigns in the future.
Responded
5%
6. PRODUCT PREFERENCE ANALYSIS
Responded
The overall product preference analysis is
implemented to determine the overall preference
Not Not Responded
of top 2 products, cumulative revenue generated,
Responded
95%
preferred date and timing based on gender and
their respective age group.
6.1 Analysis for Product Preference
First, separate sheet is created for male and
The dataset did not have any other information female customers from master table using filter
and it was not possible to infer into any results option in excel.
without any additional information. So the
dataset was mapped to the main Master table that The age group on up to 30 years, for male is
had all the information and was also at a segregated separately for total revenue, top 2
customer level. The mapping of this customer preference, day and timing with simple IF
master table to the campaign file was done in R function in excel. After the data is segregated, a
software. The end mapped file was placed in PIVOT table is used to determine the 5
excel of the slicing and dicing of the numbers and parameters for age range.
to arrive at inferences. Once the campaign file
was mapped to the master table, we were able to The same is repeated for age group of 30-40
gain actionable insights from the responses of the years, 40-50 years,50-60 years, 60-70 years,70-
customers The ID’s were tracked at a customer 80 years and 90 years for both Male and Female
level. After the mapping we got information category.
about the customer’s demographic information
including individual sales and visits. After the values are determined, an interactive
word tree map is coded with the concept of parent
Figure 4: Campaign Analysis using R
node as starting node and child nodes as
branching node. For e.g.: the Prod preference is
the parent node for Male, Female node and Male,
female node is the parent node for age groups and 7.1 Master Table
the process goes on till the graph reaches the leaf a. The highest sold was MP3 Players in the
node if the 5 parameters in this case. retail chain. 2001 was the year when Apple
came up with its first iPod.
After coding the interactive graph the overall
b. Most of the purchases was done through
values are imputed in JSON format.
Credit cards.
c. 40% of the people prefer credit 32%- DC
Figure 5: D3.js code for Product Preference Graph 28% cash considered their first preference.
d. Mp3 and Hardware are the first preferred
products.
e. Hardware and Software is the second and
third preferred product.
f. 60% of the customer base are of the age
group from 30 -50 who provide 65% of total
sales.
g. 30% of the customer has spent 3 years with
us. Which is off age 20 to 40 comprising
h. Both male and female provide us equal
revenue.
i. Most of the customers shopped on Mondays.

7.2 RFM & MBA Analysis


a. The customers have been ranked based on
deciles. The Unique customers have been
deciled in 10 equal parts and the inferences
have been shared on these segmentations.
b. The total unique customer of 59,400 when we
Figure 6: Product Preference Graph
take the top 5000 customers based on RFM
(decile 1 based on RFM) more that 50% of
the top decile customers have bought mp3
players.
c. The MP3 players re the most sold category
and the most sought out category.
d. Equal number of males and females have
bought mp3 players with 13,011 count in the
top 1 decile of the RFM score.
e. From the MBA analysis we could infer that
customers who bought hardware and
software people tend to buy MP3 players.
f. Customers who bought Telephony and
7. INFERENCES AND CONCLUSIONS
hardware also bought software.
g. Customers who bought hardware
individually also bought software.
7.3 Campaign Analysis with some more credit card companies and
boost its customer engagement.
a. The year 2001 belonged to the MP3 players. b. From a study conducted on consumer
Several new players emerged into the shopping behavior it is said that people
market and there was a huge spike in the usually shop on Mondays and even from our
sales of MP3 players. People got familiar to analysis we infer the same. So the retail chain
the MP3 players in the same year. From the can target customers more on Mondays.
data driven through our analysis we could c. The company should send out offers and
find that the most sold item in the campaign campaigns on the low selling products like
was the MP3 players and so can be attributed gaming consoles and Hifi.
by assumption that the campaign was d. 30% of the customers have bought only one
targeted on MP3 players. category, so the retail chain should
b. The 325 customers were ranked and concentrate on these customers and try
segmented into deciles to infer some upselling and cross selling other products
actionable insights. When deciles we could based on their purchase pattern.
infer that the top ranking 30 customers e. Since evenings are little less engaged the
(decile 1) bought around 845 MP3 players retail chain should try engaging the
and the top 2 decile customers bought 442 customers at evenings by giving attractive
MP3 players. Contributing to around 73% of offers.
the total sales of MP3 players. f. 2001 was the year when MP3 was getting
c. The top decile who bought the MP3 players popular and most sold. So the company
happen to be around the age group of 25-50 should continue target most of it sales on
who can be named as the working group and MP3 players.
also has the right money to buy MP3 players
as they were expensive when they were
initially sold. 9. REFERNECES
d. The top Decile made a purchase of
$220,000. [1] Bowman, M., Debray, S. K., and Peterson, L. L. 1993.
e. Most of the purchases during the campaign Reasoning about naming systems. ACM Trans. Program.
happened during the evening time when Lang. Syst. 15, 5 (Nov. 1993), 795-825. DOI=
http://doi.acm.org/10.1145/161468.16147.
compared to morning and noon. From this it
[2] Ding, W. and Marchionini, G. 1997. A Study on Video
can also be inferred that the campaign Browsing Strategies. Technical Report. University of
targeted most on the working group who fall Maryland at College Park.
between the age group of 25-50. [3] Fröhlich, B. and Plate, J. 2000. The cubic mouse: a new
f. Most of the shopping happened on device for three-dimensional input. In Proceedings of the
SIGCHI Conference on Human Factors in Computing
Saturdays when compared to other days. Systems (The Hague, The Netherlands, April 01 - 06,
g. Most of the purchases were made through 2000). CHI '00. ACM, New York, NY, 526-531. DOI=
Debit cards. This may be on a reason that the http://doi.acm.org/10.1145/332040.332491.
retail company had some tie ups with
various debit card companies and offers
related to purchases made from these cards.

8. RECOMMENDATIONS

a. As 60% of the sales transactions happened


through credit cards the company can partner

You might also like