Marketing Analytics Project: Hanoi University of Science and Technology

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MARKETING ANALYTICS PROJECT
GROUP 6
Bùi Đức Hiển Hien.bd203215@sis.hust.edu.vn

Nguyễn Thu Nga Nga.nt213079@sis.hust.edu.vn
Trần Thị Minh Tâm Tam.ttm213096@sis.hust.edu.vn
Business Administration Major
Business Analytics Advanced Program
Supervisor Dr. Nguyễn Tiến Dũng

Supervisor’s signature
Department Business Administration
School Economics and Management
Hanoi, 12/2023
Contents
PART 0: INTRODUCTION.....................................................................................................................
PART I: MARKETING ANALYTICS BY SIMULATION WITH RAND()......................................
I. INTRODUCTION....................................................................................................................................
II. ANALYSIS............................................................................................................................................
III. SUMARY............................................................................................................................................
PART 2: MARKETING ANALYTICS BY SIMULATION WITH CRYSTAL BALL..................
I. Problem...............................................................................................................................11
II. Solve the problem...........................................................................................................11
1. Finding optimal order using the uniform distribution..............................................................11
2.Finding optimal order using the triangular distribution.............................................................17
3. Finding optimal order using the normal distribution......................................................21
4. Is the optimum number of chairs to order the average demand? Why or why not?.....23
5. Why is it not sufficient to simply calculate the average of the simulation demand to
validate the simulation?.............................................................................................................24
PART 3: MARKETING ANALYSIS....................................................................................................
I. About data set......................................................................................................................................
1. Insight of analyze the dataset.............................................................................................24
III. Visual Exploratory Data Analysis.................................................................................28
1. Diploma distribution and income level.............................................................................28
2. Diploma distribution by marital situation........................................................................29
1. Graphical methods...................................................................................................................31
2. Statistical methods...................................................................................................................32
Question 1: Is the correlation between the annual income and the amount of spending
statistically significant?................................................................................................................33
1. Hypothesis statement...............................................................................................................33
2. Analysis plan formulation: Rank Correlation Tests...................................................................33
Question 2: Is there correlation between the diploma owned and the marital status?................34
1. Hypothesis statement...............................................................................................................34
2. Analysis plan formulation: Rank Correlation Tests...................................................................34
V. Customer Segmentation: RFM Analysis and two-step cluster analysis algorithm using SPSS:
35
1.RFM Analysis(Recency – Frequency – Monetary Analysis)......................................................35
2
a. Define the Multivariate Technique to be used....................................................................35
2. Two-step cluster analysis algorithm:...................................................................................40
Reference:................................................................................................................................................
PART 0: INTRODUCTION
Throughout the 2023-1 semester at HUST, we delved into the crucial role of marketing
analytics in the big data era. We learned how analyzing customer and competitive data
enhances business decision-making through many tools such as excel, crystal ball, spss. These
are all very good tools and are often used in statistics, especially in marketing analysis.
However, managing and analyzing the ever-growing volume of data, both structured and
unstructured, presents substantial challenges.
Marketing Analytics bridges this gap by applying data science to marketing problems.
We explored customer data analysis techniques and their underlying principles, equipping
ourselves with crucial analytical skills for real-world scenarios.
Through three project assignments, we honed our skills in fundamental tools like Excel,
Crystal Ball, and SPSS.
 Problem 1: Using Excel's RAND() function, we simulated Camden Electronics' bidding

process for Camden's purchasing agent, Ms. Greene, based on provided data and information,
ultimately aiming to identify optimal marketing strategies.
 Problem 2: This problem involved employing both Excel's RAND() function and the Crystal
Ball tool to simulate various scenarios in finding the optimal order number through different
probability distributions.
 Problem 3: We downloaded customer segmentation data from Kaggle and utilized Excel to
conduct descriptive analysis and inferential statistics, drawing conclusions and predictions
based on the results. Next, we analyzed RFM in SPSS, gaining insights into past business
performance and informing future marketing plans.
The following sections will delve deeper into the methodologies and solutions we
employed for each problem, leveraging our newly acquired marketing analytics knowledge.
3
PART I: MARKETING ANALYTICS BY SIMULATION WITH RAND()
I. INTRODUCTION
Camden Electronics Inc. is a small electronics firm that produces a variety of special-
purpose analog-to-digital converters, which are used primarily for process control. Its
business has grown to a sales level of $250,000,000 per year.
Ms. Greene is a purchasing agent, she has been able to purchase batches of components that
have been classified as rejects by the Components Division of Cynctron Manufacturing
Company.
A study made at Cynctron has shown that it is not economically feasible for them to
test and sort the batch in the Products Division. Cynctron always sells these rejects in the
same way. A list is kept of purchasing agents who will buy parts with specific characteristics;
these purchasing agents are emailed and informed when a batch of potential interest is
available. If interested, each purchasing agent makes a sealed bid and Cynctron sells the batch
of components to the winner. Ms. Greene has recently been informed of the availability of a
batch of 100 components identified by Cynctron as MATS314Q.
Over the past 2 years, Ms. Greene has bid on 85 previous batches of these components
and has saved the winning unit bids.
Table 1: History bidding data
Unit bid per Unit bid per Unit bid per
Contract est. good Contract est. good Contract est. good
unit unit unit
1 334.10 30 317.60 59 324.30
2 282.60 31 282.70 60 267.40
3 286.80 32 326.70 61 309.60
4 263.80 33 327.30 62 286.10
5 269.00 34 270.50 63 275.40
6 323.00 35 260.20 64 317.60
7 335.00 36 328.00 65 260.80
8 281.50 37 264.40 66 305.90
4
9 279.00 38 334.60 67 286.90
10 281.70 39 326.70 68 310.70
11 330.00 40 268.30 69 260.10
12 270.20 41 289.30 70 309.00
13 294.10 42 318.50 71 299.40
14 270.10 43 303.20 72 294.60
15 278.70 44 321.30 73 323.10
16 261.60 45 311.80 74 313.30
17 307.40 46 300.00 75 272.40
18 273.50 47 269.80 76 270.10
19 311.30 48 299.00 77 338.70
20 316.20 49 313.20 78 264.50
21 309.60 50 306.20 79 337.20
22 301.80 51 303.80 80 318.40
23 300.20 52 332.80 81 271.80
24 296.00 53 324.50 82 273.50
25 309.70 54 332.40 83 266.20
26 275.50 55 290.40 84 335.00
27 307.10 56 278.50 85 337.70
28 331.50 57 280.50
29 327.00 58 277.60
With the information from Cynctron on the characteristics of item MATS314Q, the
engineering department has informed Ms. Greene that Camden can use the item now. They
have received a contract to supply 25 pressure sensing and control devices, each of which
would require four of these components.
For Camden to manufacture the same circuit by regular methods, the cost would be
$7500 for setup plus a cost of $310 per unit produced. Cynctron’s price for tested and
guaranteed MATS314Q’s in quantities of 200 or less is $550 each.
With this information and history data given, Ms. Greene is about to make her analysis
and determine what to bid. She estimated that other bidders are bidding about $300 per good
unit.
5
This study aims to determine the best winning price which the Camden Electronics
Inc. can win the bid.
II. ANALYSIS
Step 1 : Calculating the Frequency, Probability and Cumulative Probability
Base on the history data, we will divide 85 contracts into 8 group distance 10 units
from $ 270 to $340. After that we continue to calculate the Frequency, Probability and
Cumulative Probability of each group.
Table 2: The function to calculate Frequency
Table 3: The function to calculate Probability
6
Table 4: The Frequency, Probability and Cumulative Probability result
Step 2 : Calculating the P(Win/Bid), Slope and Intercept

As demand of this sistituation, we will calculate the P(Win/Bid), Slope and Intercept
of a batch with 100 units bidding.
The formulas:
P(Win|Bid) = Cumulative Probability
Slope= delta P(Win|Bid) / delta Bid
Intercept= P(Win|Bid) -Slope*Bid
Table 5: The function to calculate Slope
Table 6: The function to calculate Intercept
7
Step 3: Building the cost structure
As history data, the total cost would be $7500 for setup plus variable cost. Here is the fomular
of variable cost:
Variable cost = Variable production cost * No of pressure sensing
Table 7: The function to calculate the cost structure
Step 4: Calculating the EV

As calcutaling above, the total cost is $15,250. It is easy to see that the higher bid, the
higher winning probability. Moreover it also shows the results in higher EV value.
For this case, Ms. Greene will win the bidding if she offers the price around $340/unit
( $34.000 for batchs 100 units).
Table 8: The EV result
Step 5: Calculating the P(Win/Bid), Slope and Intercept to Generate Random Bids for
Simulation
ECDF = Rand()
Slope = ΔBid/ΔECDF
Intercept = Bid −Slope * ECDF
Table 9: The P(Win/Bid), Slope and Intercept Random Bids result
8
Step 6: Simulation with 1000 Trial
To calcutlate the competitor’s bid, we will use VLOOKUP function.
Table 10: The function to calculate Competitor’s bid
Finding the Status. If Ms. Green’s bidding cost > Competitors’ bidding, she will win.
The result is shown in the table 11.
Table 11: The function to calcu late Competitor’s bid
9
III. SUMARY
The study demonstrate how to convert data to solve managerial decisions. Real
historical (empirical) data does not necessarily fit a known distribution, yet these data
frequencies and rankings can be used to estimate the appropriate empirical probability
distribution. Subsequently, the empirical distributions are used in decision trees and
simulations to make optimum managerial decisions. Base on history data bid given and
compare with the expected price of competitors, Camden Electronics Inc. should offer a price
$340/unit to win the bidding.
10
PART 2: MARKETING ANALYTICS BY SIMULATION WITH
CRYSTAL BALL
I. Problem
A retailer orders chair at a cost of $175 each and will sell them at $250. They forecast that
the demand will be about 8000 but in the range of 7000 to 9000. If they can’t sell all chairs at
the end of the season, they must sell the rest at half of the initial price. We need to find the
optimal order number.
II. Solve the problem.
The retailer won to retain in the inventory because of many reasons:
- The chairs might not be in good condition after being stored, or their designs may have
become outdated and no longer align with modern trends.
- As retailers, they aim to avoid incurring inventory costs, which would entail a substantial
investment for renting storage space.
1. Finding optimal order using the uniform distribution
a. Rand() function:
demand min 7000

demand max 9000
Order 8000
Buying price 175
Selling price 250
Recycling price 125
To simulate, we will set up 10,000 trials by using the series function. We utilize the
Rand () function which is an approximately uniform distribution to simulate the demand for
chairs in this season. The discrete uniform random variable function in Excel is:
= demand min + rand()(demand max – demand min)
11
To calculate the profit/loss of each demand, we multiply the number of chairs sold
during the season by the profit of each chair, and plus the multiple of the number of chairs for
sale multiply the loss of each chair when sold off and minus costing (= buying price multiple
number of chairs order)
Finding unit sold = min(order,demand)
Finding inventory = if(order > demand, order – demand, 0)
And the last, we find profit of each demand:
To calculate the profit/loss of each demand, we multiply the number of chairs sold
during the season by the profit of each chair, and plus the multiple of the number of chairs for
sale multiply the loss of each chair when sold off and minus costing (= buying price multiple
number of chairs order)
12
To find the optimal order number, we will find the average profit so we can use Solver
parameters to optimize profit, bringing the highest revenue to the seller.
With objective is average profit, by changing value cells is Order (this number at first
I took as 8000, the number is between 7000 and 9000, it can change the seller's profit because
it is the deciding factor in the store's sales volume), constraints are: order >=demand min,
order <= demand max and order is an integer.
After running the model, we get the optimal order number = 8216 and average profit
= 570,708.25
13
 To have the highest profit, the store should order 8,216 chairs, to avoid inventory or
shortages, affecting profits.
To check if this is uniform distribution, I find the bin of the number of seats that the
restaurant orders, starting at 7000, then I move forward by 100 seats, until it reaches 9000
then stops, to calculate the frequency of them being placed. order number of times, I use the
Frequency function, then find their probability by taking the frequency/number of seats
ordered. And I highlight the bin column and hold down the ctrl key, highlight the probability
column, then draw a column chart to prove that it is evenly distributed.
Probability
0.06
0.05
0.04
0.03
0.02
0.01
0
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
b. Crystal ball
Firstly, we fill out the given data:
If order < demand, all ordered chairs are sold, there is no inventory, so no.chairs sold
during season will be equal to number chairs ordered, but order > demand, seats are not sold
out, there will be a surplus equal to order - demand, so no.chairs sale off is calculated by
order-demand and profit = no.chairs sell off * recycling price + no.chairs sold during season *
selling price –( no.chairs sell off + no.chairs sold during season) * buying price.
To be able to use crystal ball to optimize costs so that the store can have the highest
profit, we will have to set up variables.
14
We define the assumption variables( Demand), decision variables(Number chairs of
order) and forecast(profit/loss)
Then, we use OptQuest to find maximize the mean of Profit/Loss
15
Set up conditions: objectives: maximize the Mean of Profit/Loss, decision variable:
Order
After running the simulation, we get opt quest results:
16
We get the results of average profit = $574,704 and the optimal order number = 8219
chairs
c. Compare the results of optimize order using Excel and Crystall ball
Method 1 Method 2
No.chairs ordered 8,216 8,266
Profits $570,708.25 $570,366.59
The two results of the two methods have relatively similar values. The rand() function
is a simple way to generate random numbers. However, this function may not generate
exactly distributed random values. Crystal Ball is a more powerful piece of software that can
generate more complex random distributions, so the Crystal Ball method can generate demand
values with greater precision than the rand() function method.
2.Finding optimal order using the triangular distribution
a. Rand function
This is Triangular distribution, so we use the following formula:
17
This function needs to be inverse to use in Excel for simulation, we have:
Demand= Round(If(RAND () <Range1 / Range, Min + Range1*RAND () ^0.5, Max-

range2* (1-RAND ()) ^0.5),0)
And unit sold = min(demand, most likely demand), Inventory = if(most likely demand
> demand, most likely demand – demand, 0), profit = buying price * unit sold + Recycling
price * inventory – buying price * most likely demand
18
To find the optimal order number, we will find the average profit so we can use Solver
parameters to optimize profit, bringing the highest revenue to the seller.
With objective is average profit, by changing value cells is Most likely demand (this
number at first I took as 8000, it can change the seller's profit because it is the deciding factor
in the store's sales volume), constraints are: Most likely demand >=demand min, Most likely
demand <= demand max and Most likely demand is an integer.
To achieve the highest possible profit, the analysis recommends that the store order
precisely 8698 chairs. This figure navigates the delicate balance between satisfying customer
demand and preventing either an excess of unsold inventory or a shortage of available chairs,
both of which would hinder profitability. In essence, the analysis has pinpointed the optimal
order quantity that maximizes earnings while ensuring a seamless alignment between supply
and demand.
The chart to visually illustrate and validate the even distribution pattern.
19
Triangle distribution
0.12
0.1
0.08
0.06
0.04
0.02
0
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
b. Crystal Ball
20
Method 1 Method 2
Profits $570,708.25 $573,472.57
The results of optimizing the number of seat orders using Excel and Crystal Ball
exhibit a minor difference, but this variance is not deemed significant. Possible factors
contributing to this difference include the chosen approach, uncertainties in demand, and
sensitivity to profit.
The result of Crystal Ball is more exactly than using Solver.
3. Finding optimal order using the normal distribution

a. Using Rand() function
We setup Cost Structure:
We estimate demand qual to ROUND(average demand+NORM.S.INV(RAND())*100,0)
21
Unit sold, inventory, profit all have the same calculation formula as uniform
distribution. And then I use Solver to optimize profits, find out how many chairs to order will
bring the highest profit. We see that, the result is the max average profit = 595,243 and the
sellers should be order 8026 chairs.
And then, we will show the normal distribution graph:
Normal distribution
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
b. Using Crystal Ball
In this step, we still define this name: Assumption(Demand), decision(No.chairs

ordered) and the last one, forecast is Profit/Loss
22
The result show us number of chairs seller’s should be order 8176 chairs and
maximize profit is 572,228
Method 1 Method 2
Profits $595,243 $572,288.31
The results of optimizing order quantities using Excel and Crystal Ball differ slightly,
but the difference is not significant. This is because both methods are only simulations, and
they can only account for a portion of the uncertainty in demand. In reality, there are many
other factors that can affect demand, such as market trends, competitor actions, and supply
chain disruptions. Therefore, it is important to consider all of these factors when making
business decisions.
4. Is the optimum number of chairs to order the average demand? Why or why
not?
The optimum number of chairs to order is not the average demans. The optimum
number of chairs to order may not align precisely with the average demand due to the inherent
variability in demand. In this particular situation involving Adirondack chairs, the demand is
uncertain and can fluctuate within a range of 7000 to 9000. Determining the optimal order
quantity involves considering multiple factors such as costs, carrying or holding costs,
shortage costs, and the uncertainty associated with demand. Relying solely on average
23
demand might not adequately account for potential variations or spikes in demand, risking the
possibility of stockouts or excessive inventory.
5. Why is it not sufficient to simply calculate the average of the simulation demand
to validate the simulation?
Calculating the average simulated demand alone is insufficient for validating a

simulation due to several reasons. Firstly, the average doesn't convey the variability in
demand; it only provides a central tendency, concealing the range and distribution of demand,
which are crucial aspects to model in a simulation. Effective validation involves assessing
whether the simulated data aligns with the assumptions made about demand patterns and
relationships within the simulation. Since simulations rely on these assumptions, checking
their validity against real-world observations is essential. Additionally, the ultimate goal of a
simulation is predictive power—accurately forecasting future demand patterns. Simply
calculating the average does not evaluate the simulation's ability to achieve this predictive
accuracy. To ensure a robust and reliable simulation, it is essential to consider and assess both
central tendencies and variability, validating the model against assumptions and real-world
observations.
PART 3: MARKETING ANALYSIS

I. About data set
1. Insight of analyze the dataset
Customer segmentation is a powerful marketing technique that involves dividing a

customer base into distinct segments based on shared characteristics, behaviors, or
demographics. The primary purpose of customer segmentation is to better understand and
serve customers in a more personalized and targeted way. Marketing segmentation helps to
understand customer needs better and reach the right customer with right messaging.
The data set contains information on 2205 Ifood corporate consumers, including:
• Customer profiles
• Product preferences
• Campaign successes/failures
• Channel performance
The management wishes to identify the persona of a potential client who exhibits
favorable behavior toward the company's market marketing effort. Throughout particular, we
shall examine the following questions throughout the article:
· Who are the customers who spend the most and are the most loyal?
· Which customers are most likely to leave the company?
· Who are the new customers and how can they increase the likelihood of recurring purchases?
24
· Which customers could potentially convert to higher spending customers?
· Which customers have left the brand and are least likely to return?
· Which customers need to be cared for and retained before becoming churn customers?
· What different customer segments require different care, incentives and marketing activities?
2. Analyze orientation
This essay identifies customers using the RFM method, then explores customers who have
accepted offers in the campaign to find customers who have positive behavior with the
marketing campaign and are likely to make large purchases.
3. Summary of the data
The data contains 2,205 observations and 39 columns. Dataset has 5 categorical variables
and 23 numerical variables. After reviewing data columns and comparing them to the dataset
description, looking for missing values, checking column types and assessing unique values,
we can conclude that there is no missing value
The data set consists of 2205 customers of Ifood company with data on:
Customer profiles
· Age = Customer's age

· Income = Customer's yearly household income
· Customer_Days = How many times customer came to company
· Dt_Customer = Date of customer's enrollment with the company
· Recency = Number of days since customer's last purchase
· Education = Customer's education level
· Education_2n Cycle = Customer's education level is 2nd cycle (8 years)
· Education_Basic = Customer's education level is basic (5 years)
· Education_Graduation = Customer's education level is graduation (12 years)
· Education_Master = Customer's education level is master (18 years)
25
· Education_PhD = Customer's education level is PhD (21 years)
Marital_Status = Customer's marital status
· Marital_Divorced = Customer's marital status is divorced

· Marital_Married = Customer's marital status is married
· Marital_Single = Customer's marital status is signle
· Marital_Together = Customer's marital status is together
· Marital_Widow = Customer's marital status is widow
Has_child
· Kidhome = Number of children in customer's household

· Teenhome = Number of teenagers in customer's household
Spending - Product preferences
· MntWines = Amount spent on wine in the last 2 years

· MntFruits = Amount spent on fruits in the last 2 years
· MntMeatProducts = Amount spent on meat in the last 2 years
· MntFishProducts = Amount spent on fish in the last 2 years
· MntSweetProducts = Amount spent on sweets in the last 2 years
· MntGoldProds = Amount spent on gold in the last 2 years
Channel performance
· NumDealsPurchases = Number of purchases made with a discount

· NumWebPurchases = Number of purchases made through the company's web site
· NumCatalogPurchases = Number of purchases made using a catalogue
· NumStorePurchases = Number of purchases made directly in stores
· NumWebVisitsMonth = Number of visits to company's web site in the last month
Campaign successes/failures
· AcceptedCmp3 = 1 if customer accepted the offer in the 3rd campaign, 0 otherwise

· AcceptedCmp4 = 1 if customer accepted the offer in the 4th campaign, 0 otherwise
· AcceptedCmp5 = 1 if customer accepted the offer in the 5th campaign, 0 otherwise
· AcceptedCmp1 = 1 if customer accepted the offer in the 1st campaign, 0 otherwise
· AcceptedCmp2 = 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
· Complain = 1 if customer complained in the last 2 years, 0 otherwise
Added new variable:
26
· Educational_years = Number of education year for each diploma
· Marital_Situation = 0 means “Alone”- Divorced, Single, Widow, 1 means “In relationship”
· Has_child = 0 means “No child” and 1 means “Has child”
· Spending = Total of amount spent on products.
· AcceptedCmpOverall = Count how many times customer accepted the offer in campaigns.
Statistic summary
Income Age Spending

Mean 51622,09 51,10 606,82
Standard Error 441,10 0,25 12,81
Median 51287 50 397
Mode 7500 44 46
Standard Deviation 20713,06 11,71 601,68
Sample Variance 429031013,05 137,03 362013,15
Kurtosis -0,85 -0,80 -0,34
Skewness 0,01 0,09 0,86
Range 112004 56 2520
Minimum 1730 24 5
Maximum 113734 80 2525
Sum 113826719 112666 1338042
Count 2205 2205 2205
Total product consumed by customers in the survey
Wines Fruits Meat Fish Sweet Gold

675093 58219 364513 83253 59818 97146
27
Channel Performance
According to the Channel Performance chart, clients mostly purchase items online
(9042 purchases) and in-store (11768 purchases). The high incidence of in-store transactions
might be attributed to the company's fresh items such as meat, fish, etc.
Last month's website visits were also extremely high: 11,768 visits out of a total of 2,205
consumers. Each user visits the website 5 to 6 times each month on average.
III. Visual Exploratory Data Analysis

1. Diploma distribution and income level
28
PhD holders had the highest average salary, at $56,161.
Basic degree holders had the lowest median earnings of $20,306 More than half of our
consumers hold a graduate degree.
According to the chart of average income by graduation level, the higher the diploma,
the greater the average wage. People with a basic education, on the other hand, earn half as
much as those with two cycles.
In this section, we explore whether the average wage of persons with a PhD differs
statistically from that of those with a Master's degree.
2. Diploma distribution by marital situation
The degree distribution appears to be same in the "Coupled" and "Alone" populations.
· In relationships: Married, Together (1422 obs)

29
· Divorced, Single, Widowed, Alone (783 obs)
In both plots, the graduation level continues to dominate with 50%. PhD, Master's, and
2nd cycle follow.
We could be tempted to assume that there is no relationship between education and

marital status.
3. Spending by income
Spending appears to be connected to income levels in a good way. According to the

graphic, persons with greater salaries tend to spend more.
We will eventually use a statistical test to see whether the association between annual
income and quantity spent is statistically significant.
4. Spending by parental status
My group has classified the customer's spending level into 39 levels (from 0 to over
2000, with a 50-point difference between the ranges).
Customers with children have an extraordinarily high purchasing level of less than $50
USD. Customers with children spend more than those without children in the $50-1200 range.
People without children, however, spend more after 1200.
30
IV. Statistical Hypothesis Testing
We have 3 business problems to answer using Hypothesis Testing:
1. Is the average salary of PhD owners statistically different from Master owners?
2. Is the correlation between the annual income and the amount of spending statistically significant?
3. Is there correlation between the diploma owned and the marital status?
Firstly, normality tests must be done, if data are Normal distribution, parametric statistical
methods will be used. On the contrary, nonparametric statistical methods should be used.
The data will be tested by Graphical methods and Statistical methods
1. Graphical methods
31
From the graph, we can immediately see which variables seem to be Normal or Gaussian-like:
“Age” and “Income” have Normal-like distributions
“Spending” has a Log-normal distribution
“Educationnal_years” has a Multinomial distribution
2. Statistical methods
The tests assume that that the sample was drawn from a Normal distribution
A threshold level is chosen alpha 5%
· Ho: Sample looks Gaussian

· H1: Sample does not look Gaussian
By SPSS, we have results:
Variable Hypothesis test

Age Statistics=0.976, p=0.000 reject H0
Income Statistics=0.976, p=0.000 reject H0
Spending Statistics=0.865, p=0.000 reject H0
Educational_years Statistics=0.833, p=0.000 reject H0
These variables are not Gaussian at a 5% significance level
From this point we have two options:
Normalizing our data to use parametric statistical methods
Using directly nonparametric statistical methods.
32
The following, we answer 2 business problems by using Hypothesis Testing:
Question 1: Is the correlation between the annual income and the amount of spending
statistically significant?
Numerical variables: Spearman Rank Correlation test
1. Hypothesis statement
 H_0: There is no monotonic association between income and spending amount

 H_1: There is a monotonic association between income and spending amount
Spearman rank correlation is a non-parametric test that is used to measure the degree of
association between two variables.
Our second question was to find if there is a statistically significant correlation between the
income and the spending amount.
2. Analysis plan formulation: Rank Correlation Tests
 Significance level: Test our hypothesis at 5% significance level

 Test method: Using the Spearman rank correlation test to determine if two variables are
correlated. This statistical method quantifies the degree to which ranked variables are
associated by a monotonic function, meaning an increasing or decreasing relationship.
Suppose that “Income” is independent variable as x, “Spending” is dependent variable as y.

We use Spearman rank correlation test to determine if two variables are correlated positively
or negatively, in other hand, we test if higher “Income” means higher “Spending”.
The positive “rs” value of 0.8466 suggests a strong positive relationship.
Therefore, the Ho “There is no monotonic association between income and spending

amount” is rejected.
33
So we can conclude the positive monotonic association between “Income” and
“Spending”.
Question 2: Is there correlation between the diploma owned and the marital status?
Our third question was to find if there is a statistically significant correlation between the
diploma and the marital situation.
Marital situation
Alone Incouple Total
Basic 20 34 54
2n cycle 62 136 198
Graduation 401 712 1113
Master 125 239 364
PhD 175 301 476
783 1422
1. Hypothesis statement
H0: Education and Marital_Situation are independent

Ha: Education and Marital_Situation are not independent
2. Analysis plan formulation: Rank Correlation Tests
 Significance level: We will test our hypothesis at a 5% significance level

 Test method: We use the Chi-square test for independence to determine whether there is a
significant relationship between our two categorical variables.
34
The p-value in above test is larger than 0.05 so we can not reject H_0: Education and
Marital_Situation are independent.
Conclude: “Education and Marital_Situation are independent”
V. Customer Segmentation: RFM Analysis and two-step cluster

analysis algorithm using SPSS:
1.RFM Analysis(Recency – Frequency – Monetary Analysis)

a. Define the Multivariate Technique to be used
When doing and reading research, understanding the different types of variables is
important which are mainly 4 types:
1. Independent variable: a variable thought to be the cause of some effect. This term is
usually used in experimental research to describe a variable that the experimenter has
manipulated.
2. Dependent variable: a variable thought to be affected by changes in an independent
variable. You can think of this variable as an outcome.
3. Predictor variable: a variable thought to predict an outcome variable. This term is basically
another way of saying ‘independent variable’.
35
4. Outcome variable: a variable thought to change as a function of changes in a predictor
variable. The term is also synonymous with ‘dependent variable’.
a. Variables can be split into categorical and continuous, and within these types there are
different levels of measurement:
b. Categorical (entities are divided into distinct categories)
i. Binary variable There are only two categories (e.g., dead or alive).
ii. Nominal variable: There are more than two categories (e.g., whether someone is an omnivore,
vegetarian, vegan, or fruitarian).
iii. Ordinal variable: The same as a nominal variable but the categories have a logical order (e.g.,
whether people got a fail, a pass, a merit or a distinction in their exam).
c. Continuous (entities get a distinct score):
i. Interval variable: Equal intervals on the variable represent equal differences in the property
being measured (e.g., the difference between 6 and 8 is equivalent to the difference between
13 and 15).
ii. Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also
make sense (e.g., a score of 16 on an anxiety scale means that the person is, in reality, twice
as anxious as someone scoring 8). For this to be true, the scale must have a meaningful zero
point.
b. Dataset
After Cleaning the data, we got 2205 transactions. We focused on consumers from the
ABC company only, where a transaction represents one customer, and each transaction
contains a set of items purchased by the customer. It is useful to cluster the customers so that
customers with similar buying patterns are in a cluster. It is useful for:
1. Characterizing different customer groups.
2. Targeted Marketing.
3. Predict buying patterns of new customers based on profile.
36
Figure 1: The clean data set and variables’ view in spss.
With the above data, to gain a deeper understanding of customer behavior and
painpoint the most valuable segments within our dataset, we're employing RFM analysis prior
to clustering. This technique leverages three key metrics—Recency, Frequency, and Monetary
value—to create more informative variables based on customers' past purchase patterns. By
identifying the most profitable customer segments, RFM analysis will enable us to tailor our
segmentation strategy effectively.
Figure 2: RFM analysis for Customer Segmentation using SPSS: Direct Marketing> RFM
Analysis function.
Traditional "one-size-fits-all" marketing approaches fall short. Recognizing the power

of customer segmentation, we prioritize understanding distinct customer groups. Instead of
relying solely on broad demographics like age or location, we delve deeper to uncover the
unique characteristics of each segment. This data-driven approach, exemplified by our
customer of ABC company based customer analysis, allows us to craft truly relevant and
impactful campaigns that resonate with each segment's specific needs and preferences.
37
Figure 3: RFM analysis: choosing format
Figure 4: RFM analysis from customer data.

For this analysis, “Recency” measures how recently a customer made a purchase,
counted in days since their last one. AcceptCmpOverall – Frequency tracks customer
engagement with specific campaigns, showing how many times they accepted an offer across
five campaigns. Finally, Spending - Monetary captures the total amount a customer has spent
on products.
Figure 5: Variables view from the RFM analysis.

 Recency_Score: it shows the fewer days between a customer's last purchase and the reference
date, the higher the recency score, reflecting more frequent visits
38
 Frequency_Score: it shows how often a customer makes repeat purchases, with higher
numbers meaning they buy more often
 Monetary_Score: it shows how much a customer spends, with higher numbers meaning they
spend more
 RFM_Score: it shows the total RFM score, customers with high RFM scores, indicating
frequent purchases, recent activity, and significant spending, tend to be the most vocal
advocates for your brand
Because I chose "Assign ties to the same bin" in the binning section, which means that
duplicate values will be assigned to the same bin, my results are as follows:
Output: The heat map of mean monetary distribution shows the average monetary
value for categories defined by recency and frequency scores. Darker areas indicate a higher
average monetary value. In other words, customers with recency and frequency scores in the
darker areas tend to spend more on average than those with recency and frequency scores in
the lighter areas.
On the vertical axis, 'Recency' scores tell us how long it's been since a customer's last
visit. Higher scores indicate recent visits, while lower scores mean longer gaps between
purchases
Darker blue implies bigger spender.
The horizontal axis tracks 'Frequency,' with lower values reflecting infrequent visits
and higher values signifying frequent market visit.
39
So, we can see that customer persona is someone who shops frequently, spends
generously, and visits the market regularly. This insight guides our segmentation strategy
towards those with higher overall 'RFM scores.
Our next phase in the case design involves implementing the chosen multivariate
analysis. We've opted for the two-step cluster algorithm in SPSS, and we will elaborate on
this in the following section.
2. Two-step cluster analysis algorithm:
Two-step cluster analysis offers distinct advantages over K-means clustering for
identifying meaningful groups within data. It leverages a dual-stage approach, first employing
pre-clustering techniques and then refining results through hierarchical methods. This hybrid
strategy excels at uncovering naturally occurring segments, reflecting real-world patterns.
Additionally, it can uncover subtle relationships between individuals who share multiple
distinct characteristics. Notably, it can handle both categorical and numerical variables
concurrently without requiring standardization, a common prerequisite for K-means. Finally,
it effectively manages large datasets, making it well-suited for contemporary data-rich
environments.
Figure 6: Feeding the variables into the algorithm
40
Figure 7: Silhouette measure of cohesion and separation
Figure 8: Cluster size
41
Figure 9: Predictor Importance
42
43
 Results:
In this section we will conclude that our final results and readings from the cluster segments.
Cluster 1 (30.5%): This is the largest customer group, accounting for 30.5% of the total
customer base. This group has low activity, average value, average loyalty, and low risk. This
group could be new customers or customers who are in the early stages of their relationship.
Cluster 4 (23.1%): This is a medium-sized customer group, accounting for 23.1% of the total
customer base. This group has medium activity, low value, high loyalty, and high risk. This
group could be loyal customers but do not generate much value for the business.
customer base. This group has medium activity, high value, average loyalty, and high risk.
This group could be potential customers with the potential to generate a lot of value for the
business.
customer base. This group has high activity, average value, low loyalty, and medium risk.
This group could be new customers with the potential to generate a lot of value for the
business.
Cluster 3 (9.2%): This is the smallest customer group, accounting for 9.2% of the total
customer base. This group has high activity, low value, high loyalty, and high risk. This group
could be loyal customers but do not generate much value for the business.
 Summary
For the group of new customers or customers who are in the early stages of their relationship.
Goal: Stimulate the activity of this customer group.
Strategies:
+ Send attractive promotions or offers to attract this customer group to shop more.
+ Optimize the online shopping experience to make it easier for customers to find the
products or services they are interested in.
+ Strengthen customer service to solve customer problems and build relationships with them.
For the group of loyal customers but do not generate much value for the business.
Goal: Expand the market scale.
Strategies:
44
+ Develop new marketing programs to attract potential customers.
+ Strengthen the company's presence on social media channels.
+ Collaborate with other businesses to reach new markets.
For the group of potential customers with the potential to generate a lot of value for the
business.
Goal: Increase the loyalty of this customer group.
Strategies:
+ Build a loyalty program to encourage this customer group to shop more and receive special
benefits.
+ Strengthen customer service to make customers feel cared for and valued.
+ Send marketing messages that are relevant to the needs and preferences of customers in this
cluster.
 Conclusion:
RFM analysis, a data-driven method, serves as both a customer segmentation technique and a
basis for strategic marketing decisions. Our research employs RFM values (recency,
frequency, and monetary) to segment customers into cohesive groups, allowing us to tailor
marketing strategies based on their purchasing behaviors.
45
Reference:
Jonathan P. Pinder. Introduction to Business Analytics Using Simulation. School of Business

Wake Forest University Winston-Salem, NC, United States.
46

Marketing Analytics Project: Hanoi University of Science and Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Marketing Analytics Project: Hanoi University of Science and Technology

Uploaded by

Copyright:

Available Formats

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MARKETING ANALYTICS PROJECT

Bùi Đức Hiển Hien.bd203215@sis.hust.edu.vn

Business Administration Major

Business Analytics Advanced Program

Supervisor Dr. Nguyễn Tiến Dũng

 Problem 1: Using Excel's RAND() function, we simulated Camden Electronics' bidding

Table 3: The function to calculate Probability

Step 2 : Calculating the P(Win/Bid), Slope and Intercept

Table 6: The function to calculate Intercept

Step 4: Calculating the EV

1. Finding optimal order using the uniform distribution

demand min 7000

= demand min + rand()(demand max – demand min)

Finding unit sold = min(order,demand)

Finding inventory = if(order > demand, order – demand, 0)

And the last, we find profit of each demand:

Firstly, we fill out the given data:

Then, we use OptQuest to find maximize the mean of Profit/Loss

After running the simulation, we get opt quest results:

2.Finding optimal order using the triangular distribution

Demand= Round(If(RAND () <Range1 / Range, Min + Range1*RAND () ^0.5, Max-

The result of Crystal Ball is more exactly than using Solver.

3. Finding optimal order using the normal distribution

We setup Cost Structure:

We estimate demand qual to ROUND(average demand+NORM.S.INV(RAND())*100,0)

And then, we will show the normal distribution graph:

b. Using Crystal Ball

In this step, we still define this name: Assumption(Demand), decision(No.chairs

Calculating the average simulated demand alone is insufficient for validating a

PART 3: MARKETING ANALYSIS

1. Insight of analyze the dataset

Customer segmentation is a powerful marketing technique that involves dividing a

3. Summary of the data

· Age = Customer's age

Marital_Status = Customer's marital status

· Marital_Divorced = Customer's marital status is divorced

· Kidhome = Number of children in customer's household

Spending - Product preferences

· MntWines = Amount spent on wine in the last 2 years

· NumDealsPurchases = Number of purchases made with a discount

· AcceptedCmp3 = 1 if customer accepted the offer in the 3rd campaign, 0 otherwise

Added new variable:

Income Age Spending

Total product consumed by customers in the survey

Wines Fruits Meat Fish Sweet Gold

III. Visual Exploratory Data Analysis

2. Diploma distribution by marital situation

· In relationships: Married, Together (1422 obs)

We could be tempted to assume that there is no relationship between education and

Spending appears to be connected to income levels in a good way. According to the

4. Spending by parental status

We have 3 business problems to answer using Hypothesis Testing:

The data will be tested by Graphical methods and Statistical methods

“Age” and “Income” have Normal-like distributions

“Spending” has a Log-normal distribution

“Educationnal_years” has a Multinomial distribution

A threshold level is chosen alpha 5%

· Ho: Sample looks Gaussian

By SPSS, we have results:

Variable Hypothesis test

These variables are not Gaussian at a 5% significance level

From this point we have two options: