Student Student Student Singapore Management University Singapore Management University Singapore Management University anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg
Nandeep NV Premnath Prathiban Raviteja Kalidindi
Student Student Student Singapore Management University Singapore Management University Singapore Management University anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg anirudhnr.2015@mitb.smu.edu.sg
ABSTRACT Transaction dataset captures 350,000
transactional records occurred in the year 2001. HE Ptd. Ltd. is a retail chain specialized in home These datasets are in comma delimited text entertainment, game consoles and software, Hi- format. Fi, MP3 players and telephony. In 2000, the Table 1: Customers File Data Fields company implemented a customer relationship management (CRM) system to capture information of its customers and their transaction records. The system is mainly served as a database management system and its applications This is one of the most important entity of this are mainly confined to simple tubular reporting retail dataset as this constitutes the characteristic of monthly and annual sales. of each customer. Some of the attributes that Recently, Mr. David Chou, the CEO of the were associated with this dataset were ID, DOB, company had attended an industry seminar Registration Date and Gender. As part of our organized by the MITB (SSA) programme of the ETL logic we have made extracted few more School of Information Systems, Singapore attributes like Age ( Derived from the DOB) and Management University. At the seminar, Mr. the number of years the customer is as part of this Chou learned that the data captured in the CRM organization (Derived from the Registration can provide actionable insights for the company Date). to gain competitive advantage by using This data constitutes of totally 140,000 records of appropriate customer analytics techniques. The customers. analysts assigned to this project will provide an analysis on the different insights that can be achieved through data analytics. As analysts, we Table 2: Customer Transactions Data Fields
will be applying customer analytics techniques
like RFM, MBA, Association Analysis and product preference Analysis.
1. MASTER DATA PREPERATION
For the start, 4 datasets are provided for the POC
project. The Customer dataset consists of This data set comprises of OLTP (Online approximately 140,000 customer records. The Transaction Processing Data), which is recorded as part of the transaction made by the customers. This data does not provide much valid Table 4: Customer Campaign Data Fields information as there are not in normalized form. So these data should be combined with some other data to provide valuable information. As part of some of these data we can extract some more hidden data which can be helpful. This data consists of only 2 attributes, one is the Card-ID and the campaign response. So the data Firstly we can find the day at which this set lacks the richness of data but there is a high transaction was performed from the timestamp potential that these dataset can be combined with attribute. This day can help us in the retail dataset other datasets, which helps us to bring out some to find out the crowd distribution from the of the hidden data insights. perspective of the Day. 2. FACT TABLE Secondly each transaction ID is associate with one Customer, it also helps us to classify the In order to bring some insights from this retail customer into categories based on their timing of dataset, we need to make sure that the dataset transaction. In this case we have categories as should be organized and more customer focused. follows: So some of the business intelligence concepts like star schema was applied over the dataset. 9:00 a.m - 11:59 a.m - Morning Stars 12:00 Noon - 15:00 p.m - Afternoon Avengers The Star-Schema is defined as a schema where a centralized fact table, containing factual data are 15:00 p.m - 18:00 p.m - Evening Evangeline's connected to the dimensional data, containing the descriptive data and the table was also This table constitutes of only ~60K customer aggregated from the customer perspective. After from the whole population of 140K customers. implementing this fact table , each record in the So this help us to understand that there are some entity will comprises of detailed representation 80K customer who are dormant and are the of customers, who has made transaction. potential target. Many strategic decisions can be made in order to make those customer active, As part of this process we have implemented the which in turn brings revenue to the organizations. following:
Table 3: Product Table Data Fields
● Each record represents one single customer, its demographics and purchasing patterns. ● We derived the age, gender and the Number of years as customer from the Customer table. ● In order to find their purchasing pattern , we This entity comprises the information about the derived the following the transaction dataset products and its characteristic like ID, category : and unit price. As these were descriptive data there were no much transformation made. ○ Payment Distribution - This will give us the For the recency we have taken the recent visit of information about the number of times he a customer from the Last date of the year has used each mode of payment. available in the given data of the transaction table. We have taken 31st ○ Favorite Day - This column provides the top 3 days from his visit. December as the cut off data and we have subtracted the transaction dates with it. After ○ Favorite Category - one of the important getting the Days from the most recent purchases column which can help the organization to we have divided this into 5 blocks and the least implement the strategy of upsell and cross days in which a customer has shopped will be the sell. This column help the organization to most frequent buyers .i.e., the customers having tell his top 3 favorite category that he has the least value will be our more frequent brought the most number of times, arranging customers and they are given the value 5 which as per the count. is followed by 5, 4,3,2,1. ○ Classifier - As we have classified each For frequency we have taken the count of customer from the transaction level and individual unique customer ids and grouped these based on number of transactions. The since that does not provide much valuable customers which the highest number of information. We have aggregated customer transactions counts will be given the 5th top most level and consider only the classifier ranking and the lowest will be given 1. The order category which possess the largest value. is given as 5, 4,3,2,1. ○ We also aggregated the total number of For frequency we have taken the count of items and the total monetary spent by each individual unique customer ids and grouped customer. these based on sum of sales or transactions. The customers which the highest number of sales will be given the 5th top most ranking and the lowest 3. RFM ANALYSIS will be given 1. The order is given as 5, 4,3,2,1. The customers are based on 20% margins. The RFM analysis is a popular way of customer top 20% customers will be our top customers. segmentation technique which can help the For each customer we get 3 different values retailers to maximize the return on their ranging from 1-5 for each criteria based on RFM. marketing investments. While planning for the marketing spend or formulating a new promotion In our case once we get the scores we have given the retail marketers need to be careful about how more weightage to recency. That is for the reason they segment and target their customers. being the most recent customers are more likely to buy again. We have multiplied the recent Under the RFM analysis each customer is ranked column (Recency) by 3. We have given 3:2:1 for on three factors, namely Recency, Frequency and the values on Recency: Frequency: Monetary. Monetary value. It helps companies identify The value of the monetary is minimized for 1. customers that are more likely to respond to a new offer.
3.1 Analysis Method for RFM Analysis
The RFM values had to be calculated based on
each customer as we have done our analysis on the customer level. Once this is done we have created a spark matrix with all 7 products as variables. Before this aggregation the product level data was in a single variable column. So we get 7 new columns regarding the product transaction details. For the final calculations of the MBA algorithm and is run and confidence level is set at 70% and lift is more than 1.
Figure 1: Code for RFM Analysis
4. MARKET BASKET ANALYSIS
The MBA Analysis provides insights on the
customer purchase pattern for various products. Through this analysis, we can achieve a product mix of which products are being purchased the most and provide recommendations as to which product can be bundled and sold to customers. Figure 2: MBA Analysis using R
4.1 Analysis Method for MBA Analysis
For the calculation of the MBA analysis we have 5. CUSTOMER CAMPAIGN created a spark matrix. Campaign management is a series of marketing We have joined the table of product details with tactics and programs that are all designed to transactions details. We have mapped the achieve a specific business goal. There are customer details with the transaction details and different ways to define and achieve goals and giving a binary value 0 and 1. So there will be objectives in a campaign. binary values for 7 respective categories. Increase sales Revenue. Sub category is not taken into consideration as we have no data regarding what kind of sub Acquire new customers. category is being purchased as it is numerical. Increase customer retention. Enhance up-sell/cross-sell opportunities. We tried to create a customer basket based on week and month and that didn’t provide us any These majorly consist of enhancing pipeline on desired inferences so finally we have taken the open opportunities, tracking customer interest, basket value for entire year and tried to infer the promotions on websites (e.g., free trials, demo customer behavior for an entire year. In this case center), tracking search engine marketing, and the basket constitutes of products bought by the awareness events that drive PR and brand customer n the entire year. awareness for the company and pricing promotions on current customers. 5.1 Campaign Analysis The campaign dataset had two variables – Card id and the Campaign response. The Card id gave the information about the customer ID related one particular unique customer and the Campaign response gave the response of a customer on the particular campaign with “T” being responded and “F” being not responded. The campaign was targeted for 5957 customers which caters to about 10% of the unique A total of 325 customers had responded to the customer id’s that has been mapped to the campaign which corresponds to about 6% of the customer table. total targeted customers. Profiling these customers helped us to know what group of customers are buying in the campaigns, their age Figure 3: Customer Response group and gender classification also helped us in tracking the right members for targeting such Campaign Response campaigns in the future. Responded 5% 6. PRODUCT PREFERENCE ANALYSIS Responded The overall product preference analysis is implemented to determine the overall preference Not Not Responded of top 2 products, cumulative revenue generated, Responded 95% preferred date and timing based on gender and their respective age group. 6.1 Analysis for Product Preference First, separate sheet is created for male and The dataset did not have any other information female customers from master table using filter and it was not possible to infer into any results option in excel. without any additional information. So the dataset was mapped to the main Master table that The age group on up to 30 years, for male is had all the information and was also at a segregated separately for total revenue, top 2 customer level. The mapping of this customer preference, day and timing with simple IF master table to the campaign file was done in R function in excel. After the data is segregated, a software. The end mapped file was placed in PIVOT table is used to determine the 5 excel of the slicing and dicing of the numbers and parameters for age range. to arrive at inferences. Once the campaign file was mapped to the master table, we were able to The same is repeated for age group of 30-40 gain actionable insights from the responses of the years, 40-50 years,50-60 years, 60-70 years,70- customers The ID’s were tracked at a customer 80 years and 90 years for both Male and Female level. After the mapping we got information category. about the customer’s demographic information including individual sales and visits. After the values are determined, an interactive word tree map is coded with the concept of parent Figure 4: Campaign Analysis using R node as starting node and child nodes as branching node. For e.g.: the Prod preference is the parent node for Male, Female node and Male, female node is the parent node for age groups and 7.1 Master Table the process goes on till the graph reaches the leaf a. The highest sold was MP3 Players in the node if the 5 parameters in this case. retail chain. 2001 was the year when Apple came up with its first iPod. After coding the interactive graph the overall b. Most of the purchases was done through values are imputed in JSON format. Credit cards. c. 40% of the people prefer credit 32%- DC Figure 5: D3.js code for Product Preference Graph 28% cash considered their first preference. d. Mp3 and Hardware are the first preferred products. e. Hardware and Software is the second and third preferred product. f. 60% of the customer base are of the age group from 30 -50 who provide 65% of total sales. g. 30% of the customer has spent 3 years with us. Which is off age 20 to 40 comprising h. Both male and female provide us equal revenue. i. Most of the customers shopped on Mondays.
7.2 RFM & MBA Analysis
a. The customers have been ranked based on deciles. The Unique customers have been deciled in 10 equal parts and the inferences have been shared on these segmentations. b. The total unique customer of 59,400 when we Figure 6: Product Preference Graph take the top 5000 customers based on RFM (decile 1 based on RFM) more that 50% of the top decile customers have bought mp3 players. c. The MP3 players re the most sold category and the most sought out category. d. Equal number of males and females have bought mp3 players with 13,011 count in the top 1 decile of the RFM score. e. From the MBA analysis we could infer that customers who bought hardware and software people tend to buy MP3 players. f. Customers who bought Telephony and 7. INFERENCES AND CONCLUSIONS hardware also bought software. g. Customers who bought hardware individually also bought software. 7.3 Campaign Analysis with some more credit card companies and boost its customer engagement. a. The year 2001 belonged to the MP3 players. b. From a study conducted on consumer Several new players emerged into the shopping behavior it is said that people market and there was a huge spike in the usually shop on Mondays and even from our sales of MP3 players. People got familiar to analysis we infer the same. So the retail chain the MP3 players in the same year. From the can target customers more on Mondays. data driven through our analysis we could c. The company should send out offers and find that the most sold item in the campaign campaigns on the low selling products like was the MP3 players and so can be attributed gaming consoles and Hifi. by assumption that the campaign was d. 30% of the customers have bought only one targeted on MP3 players. category, so the retail chain should b. The 325 customers were ranked and concentrate on these customers and try segmented into deciles to infer some upselling and cross selling other products actionable insights. When deciles we could based on their purchase pattern. infer that the top ranking 30 customers e. Since evenings are little less engaged the (decile 1) bought around 845 MP3 players retail chain should try engaging the and the top 2 decile customers bought 442 customers at evenings by giving attractive MP3 players. Contributing to around 73% of offers. the total sales of MP3 players. f. 2001 was the year when MP3 was getting c. The top decile who bought the MP3 players popular and most sold. So the company happen to be around the age group of 25-50 should continue target most of it sales on who can be named as the working group and MP3 players. also has the right money to buy MP3 players as they were expensive when they were initially sold. 9. REFERNECES d. The top Decile made a purchase of $220,000. [1] Bowman, M., Debray, S. K., and Peterson, L. L. 1993. e. Most of the purchases during the campaign Reasoning about naming systems. ACM Trans. Program. happened during the evening time when Lang. Syst. 15, 5 (Nov. 1993), 795-825. DOI= http://doi.acm.org/10.1145/161468.16147. compared to morning and noon. From this it [2] Ding, W. and Marchionini, G. 1997. A Study on Video can also be inferred that the campaign Browsing Strategies. Technical Report. University of targeted most on the working group who fall Maryland at College Park. between the age group of 25-50. [3] Fröhlich, B. and Plate, J. 2000. The cubic mouse: a new f. Most of the shopping happened on device for three-dimensional input. In Proceedings of the SIGCHI Conference on Human Factors in Computing Saturdays when compared to other days. Systems (The Hague, The Netherlands, April 01 - 06, g. Most of the purchases were made through 2000). CHI '00. ACM, New York, NY, 526-531. DOI= Debit cards. This may be on a reason that the http://doi.acm.org/10.1145/332040.332491. retail company had some tie ups with various debit card companies and offers related to purchases made from these cards.