You are on page 1of 43

Assignment 8

BIA 7 – IIMB

Priyanka Sindhwani

4/25/2017
Question 1: What strategy should Vijaya Kumar adopt for developing forecasting model for
demand estimation of 20,000 spare parts?

The spare parts are categorized based on Fast, medium and slow. Also they have been further
classified as per their revenue generation

The items which contribute to more in sales, higher in value and are fast moving should be
targeted first therefore in forecasting, those item forecasting will be taken up first, also if the item
demand is fluctuating and cannot be stationarize they should not be picked up initially as we wont
be able to have a good forecast model for it

Question 2: Develop forecasting models for data provided in the Excel sheet titled ‘‘L&T
Spare Parts Forecasting’’ and discuss the choice for using a particular forecasting model.

Choosing Data set 1 : “ L&T spare parts forecasting

Following steps are followed


 Take the data set, check for stationarity

 As data is non-stationary, we differenced the data with difference = 2 to achieve stationarity

 Acf and Pacf plots along with adf.test to confirm stationarity

Augmented Dickey-Fuller Test

data: difference
Dickey-Fuller = -10.305, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
Now the data can be stationaries with difference 1,We determine the Arima model

In the model we take the difference of 2 and run the following model, converting data to log
Call:
arima(x = log(ITEM1), order = c(2, 1, 2), seasonal = list(order = c(1, 0, 1)))

Coefficients:
ar1 ar2 ma1 ma2 sar1 sma1
-0.1756 -0.7045 -0.3018 0.9442 0.3896 -0.0741
s.e. 0.1377 0.1402 0.1148 0.1763 0.5946 0.6401

sigma^2 estimated as 0.02693: log likelihood = 16.13, aic = -18.26

Checked for accuracy


MAPE: 6.8

Residuals plot:

Prediction
Point focusLo 95 Hi 95
May-13 387.9071 280.861 535.7529
Jun-13 360.8679 250.6561 519.5391
Jul-13 348.5916 221.3342 549.0157
Aug-13 315.7873 176.7107 564.3209
Sep-13 324.2867 170.6734 616.159
Observed vs fitted plot and forecast plot

H0: The data are independently distributed


Ha: The data are not independently distributed; they exhibit serial correlation.

For Ljung or box pierce

Box-Ljung test

data: resid(Model1)
X-squared = 6.2612, df = 10, p-value = 0.7929

As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 2 : “ L&T spare parts forecasting
 Take the data set, check for stationarity

 As data is non-stationary, we differenced the data with difference = 1 to achieve stationarity


 To check for Stationarity

Augmented Dickey-Fuller Test

data: difference
Dickey-Fuller = -14.185, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary

As p-value <0.05, alternative hypothesis is retained

Model:
Series: log(ITEM1)
ARIMA(2,1,3)(1,0,1)[12]

Coefficients:
ar1 ar2 ma1 ma2 ma3 sar1 sma1
-1.3387 -0.9938 0.5009 -0.2139 -0.8754 0.8697 -0.5095
s.e. 0.0531 0.0182 0.2918 0.2230 0.3509 0.2575 0.5084

sigma^2 estimated as 0.02462: log likelihood=18.94


AIC=-21.88 AICc=-18.19 BIC=-6.91

Accuracy
Mape : 8.83
Month Lower Point Forecast Upper
May 2013 220.9346 302.2483 413.4890
Jun 2013 184.0721 252.4793 346.3088
Jul 2013 175.8134 241.3509 331.3185
Aug 2013 164.6951 227.1128 313.1862
Sep 2013 154.5706 213.4543 294.7696

Residual plot and test

H0: The data are independently distributed


Ha: The data are not independently distributed; they exhibit serial correlation.

For Ljung or box pierce

Box-Ljung test

data: resid(Model1)
X-squared = 7.0564, df = 10, p-value = 0.7201

As p-value > 0.05, we retain the null , thus residual is independently distributed

Observed vs fitted plot and forecast plot


Choosing Data set 3 : “ L&T spare parts forecasting
Following steps are followed

 Take the data set, check for stationarity

For this item we would be using HW due to its good fit


model <- hw(ITEM1, initial ="optimal", h=12, aplha=NULL, beta = NULL, gamma = NULL, seasonal
= c("additive"),level = c(.95))
Model Information:
Holt-Winters' additive method
Call:
hw(x = ITEM1, h = 12, seasonal = c("additive"), level = c(0.95),
Call
initial = "optimal", beta = NULL, gamma = NULL, aplha = NULL)
Smoothing parameters
alpha = 1e-04
beta = 1e-04
gamma = 0.0072
Initial states
l = 370.8672
b = 3.7651
s=113.2595 -43.1134 3.3791 41.3779 2.2251 -72.807
11.0039 -39.5464 -26.1081 62.2355 6.7286 -58.6345
sigma: 57.4596
AIC AICc BIC
619.7052 636.7052 649.9743
Error measure
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.2049926 57.45958 42.62411 -1.284119 9.578407 0.5958036 -0.01930456
Forecasts:
Point Forecast Lo 95 Hi 95
May 2013 565.8691 453.2504 678.4878
Jun 2013 624.6372 512.0185 737.2559
Jul 2013 540.6159 427.9972 653.2346
Aug 2013 531.3781 418.7594 643.9969
Sep 2013 584.6043 471.9856 697.2231
Oct 2013 505.2144 392.5956 617.8332
Nov 2013 583.4437 470.8249 696.0625
Dec 2013 626.7479 514.1290 739.3667
Jan 2014 592.6612 480.0423 705.2801
Feb 2014 551.0512 438.4322 663.6702
Mar 2014 709.0731 596.4540 821.6922
Apr 2014 543.0636 430.4404 655.6867

MAPE: 9.5

Observed vs fitted and residual plots


H0: The data are independently distributed
Ha: The data are not independently distributed; they exhibit serial correlation.

For Ljung or box pierce

Box-Ljung test

data: resid(model)
X-squared = 6.1255, df = 10, p-value = 0.8046

As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 4 : “ L&T spare parts forecasting
Following steps are followed

 Take the data set, check for stationarity

ACF AND PACF ( also data can be stationeries with difference of 1)

Augmented Dickey-Fuller Test

data: ITEM1
Dickey-Fuller = -7.7298, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
MODEL:

Series: ITEM1
ARIMA(0,1,4)(1,1,1)[12]

Coefficients:
ma1 ma2 ma3 ma4 sar1 sma1
-0.9629 -0.0397 -0.0776 0.3997 -0.0628 -0.9997
s.e. 0.1660 0.2086 0.3390 0.2748 0.2569 0.9762

sigma^2 estimated as 6333: log likelihood=-215.59


AIC=445.18 AICc=449.18 BIC=456.26

MAPE: 8.5
Point Forecast Lo 95 Hi 95
May 2013 850.1776 674.8568 1025.4984
Jun 2013 598.4989 423.0542 773.9437
Jul 2013 803.1488 627.8436 978.4541
Aug 2013 807.7244 631.9686 983.4803
Sep 2013 616.4064 431.9856 800.8272

Observed vs fitted, and residual plots

Box-Ljung test

data: resid(Model1)
X-squared = 10.55, df = 10, p-value = 0.3936
As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 5 : “ L&T spare parts forecasting

Following steps are followed

Take the data set, check for stationarity

Stationarizing the data for seasonality component,


Acf and pacf plots

Augmented Dickey-Fuller Test

data: ITEM1
Dickey-Fuller = -5.5116, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary

Rejecting null
Model
ARIMA(0,1,2)(1,1,1)[12]

Coefficients:
ma1 ma2 sar1 sma1
-0.9401 0.1894 -0.328 -0.7643
s.e. 0.2113 0.2172 0.392 1.4232

sigma^2 estimated as 21712: log likelihood=-236.86


AIC=483.71 AICc=485.71 BIC=491.63

MAPE:14.69

Forecast:
Point Forecast Lo 95 Hi 95
May 2013 675.0444 372.4832 977.6056
Jun 2013 563.3441 261.3652 865.3230
Jul 2013 566.6540 255.5933 877.7148
Aug 2013 560.3258 240.4409 880.2107
Sep 2013 480.4308 151.9588 808.9028

Forecast plot, residual plot and observed vs fit


Forecast plot
Box-Ljung test

data: resid(Model1)
X-squared = 6.8745, df = 10, p-value = 0.7372

As p-value > 0.05, we retain the null , thus residual is independently distributed

Choosing Data set 6 : “ L&T spare parts forecasting


Following steps are followed

Take the data set, check for stationarity


Model:

Model1
Series: ITEM1
ARIMA(2,2,1)(1,1,1)[12]

Coefficients:
ar1 ar2 ma1 sar1 sma1
-0.5983 -0.5790 -0.9942 -0.0728 -0.5472
s.e. 0.1403 0.1471 0.1084 0.5797 0.7806

sigma^2 estimated as 386: log likelihood=-159.67


AIC=331.34 AICc=334.34 BIC=340.67
> acc<-accuracy(Model1)
> acc
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1.68527 15.37225 9.934065 -13.32776 29.17808 0.6831978 -0.1452887

Point Forecast Lo 95 Hi 95
May 2013 48.82538 9.200454 88.45031
Jun 2013 30.28432 -13.234616 73.80325
Jul 2013 56.92338 12.241527 101.60523
Aug 2013 73.82233 20.312473 127.33218
Sep 2013 46.75125 -12.056058 105.55855

Residual plot, forecast and observed vs fitted


Box-Ljung test

data: resid(Model1)
X-squared = 13.157, df = 10, p-value = 0.215

Question 3. Which forecasting techniques should L&T use to forecast different spare items?

Answer 3. For most of the spare part, sesonal arima is able to predict well , with less than 10%
error, therefore that methodology can be used.
PARTA: CLUSTERING

Question 2: List and derive the metrics that can be used in ‘‘hierarchical clustering’’ and
‘‘partition around medoids’’ clustering algorithms.

Answer 2:

Took only numeric variables and the derived variables for clustering algorithms was as follows

 Total sales in INR


 Total cost in INR
 Total discount
 Markdown sensitivity
 Profit by area

Q2. Do you find outliers in the derived data from Q1? If yes, how can the same be treated for
use in cluster modeling?

Answer 2. Yes there were outliers in the Q1,

Firstly we tried if we could combine the outliers into an cluster, but as that was not possible after
seeing the dendogram, we decided to remove the outliers in this exercise

Q3. Develop a hierarchical clustering model with the modified data from Q2. How many
clusters seem appropriate? Justify

Answer 3.

Hierchachal Clustering:

Reading the dataset:

Selecting the numeric variables mentioned above

Removing outlier:
Removing the top 3 outliers

Elbow plot: to determine the optimium cluster


Optimum clusters based on the dendogram and elbow plot is 3 clusters, therefore it is ideal to have
3 clusters
4. Develop a partition around medoids clustering model with the modified data from
Q2What are the advantages of using partitioning around medoids (PAM) over K-means?
How do you decide on the appropriate number of clusters in this scenario?
Total.sales.in.INR Total.discount Total.cost Markdown.sensitivty profitbyarea

87 -0.2507675 -0.4011870 -0.2730098 -0.1195077 -0.08538677

161 -1.0164227 -0.9291651 -1.0128436 -0.1195077 -0.86135776

123 1.6491207 1.6081255 1.7261156 -0.1195077 1.52198171

141 0.4894637 0.9023874 0.6192578 -0.1195077 -0.14081327

Advantage:

PAM or k-medoid is based on centroids (or medoids) calculating by minimizing the absolute
distance between the points and the selected centroid, rather than minimizing the square distance.
As a result, it's more robust to noise and outliers than k-means.

K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to
converge - it really must only be used with distances that are consistent with the mean. So e.g.
Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids

Robustness of medoid

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also
is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the
median, you will see plenty of explanations and examples why the median is more robust to
outliers than the arithmetic mean. Essentially, these explanations and examples will also hold
for the medoid. It is a more robust estimate of a representative point than the mean as used in k-
means.

The number of appropriate clusters is decided on the distance from centroid and we can
determine this by elbow /scree plot
5. Validate the goodness of resulting clusters from hierarchical and PAM models obtained in
Q3 and Q4. Which is a better model as per validation measures?

Validating results
Clustering Methods:
hierarchical kmeans pam

Cluster sizes:
3 4 5

Validation Measures:
3 4 5

hierarchical Connectivity 8.0829 8.9758 12.8647


Dunn 0.2461 0.2461 0.2089
Silhouette 0.5567 0.5459 0.5350
kmeans Connectivity 7.6516 22.2611 27.8167
Dunn 0.1391 0.1269 0.1051
Silhouette 0.5558 0.5313 0.5294
pam Connectivity 0.3778 4.2667 31.9274
Dunn 0.1470 0.1620 0.0181
Silhouette 0.4960 0.5101 0.4073

Optimal Scores:

Score Method Clusters


Connectivity 0.3778 pam 3
Dunn 0.2461 hierarchical 3
Silhouette 0.5567 hierarchical 3

Connectivity should be minimized, while both the Dunn index and the silhouette width should be
maximized.

Thus, it appears that hierarchical clustering outperforms the other clustering algorithms under
each validation measure, for nearly every number of clusters evaluated.

PART B – TIME SERIES FORECASTING

7. Conduct exploratory data analysis on “Cluster=1, Brand=CRESCENT SET, Brick=CKD”


combination to identify the following relationships. Give a short description about the
relationships observed.

Answer 7

a. Relationship between Sales units(sales_units) & Discount % (discount_per)


Sales Unit vs discount per
3500 0.5
0.45
3000
0.4
2500 0.35
sales_units
2000 0.3
discount_per
0.25
1500 0.2
1000 0.15
0.1
500
0.05
0 0
9.2013 21.2013 33.2013 45.2013 5.2014 17.2014 29.2014 41.2014 1.2015 13.2015

As expected higher discount increase the number of sales , although in the beginning of every year
and end of year, the sales drop even when there is a discount.

Shows this is a seasonal selling product, and thus not high in demand in winters

b. Relationship between Sales units & Net Price (per_unit_netprice)

Sales Unit vs per_unit_netprice


3500 1800
3000 1600
1400
2500 per_unit_netprice
1200
2000 1000 sales_units
1500 800
600
1000
400
500 200
0 0
9.2013 22.2013 35.2013 48.2013 9.2014 22.2014 35.2014 48.2014 9.2015

As it can been seen, net price is directly proportional to sales unit, if net price of a product is
increase, sales go down, which can be clearly seen above

c. Relationship between Sales units & Age (age)

Correlation: -0.5842883, age and sales are negatively correlated. Thus, if the
The product is older, the sale of that product is likely to reduce
8. What is over-fitting and under-fitting in the context of regression models? What are the
consequences of over-fitting?

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the


underlying trend of the data.  Intuitively, underfitting occurs when the model or the algorithm
does not fit the data well enough. 
Overfitting occurs when a statistical model  algorithm captures the noise of the data.   overfitting
occurs when the model or the algorithm fits the data too well.  Specifically, overfitting occurs if the
model or algorithm shows low bias but high variance.

Consequence of over fitting


 The problem is that these concepts do not apply to new data and negatively impact the models
ability to generalize.
 An overfit model is one that is too complicated for your data set. When this happens, the
regression model becomes tailored to fit the quirks and random noise in your specific sample
rather than reflecting the overall population

9. Explain why we have to partition the time series data before building the forecasting
model. Use data for “Cluster=2, Brand=BLINK, Brick=HAREMS” and partition this time
series data as explained below.

i. Consider all weeks until 51st week (including) of 2014 as training data.
ii. Consider weeks from 52nd week of 2014 to 3rd week of 2015 as test data.

Answer 9.
 Data partitioning is a necessary step, as the basic idea is to separate out the available data into
a training set and a testing (or validation) set.
 Primarily because we want to ensure that our model does a good job of predicting the "seen"
data, so that when we are presented with unseen data, we have some level of confidence about
the predictive power of the model.
 For cross-sectional data analysis, we usually take great care to make sure that the training and
testing samples are randomly chosen or in case of unbalanced data, carefully chosen.

10. Develop a time series forecast model using regression on the training data to forecast
sales units for “Cluster=2, Brand=BLINK, Brick=HAREMS” combination using the below
variables as predictors.
a. Lag 1 (i.e. immediate previous week) of sales units
b. Discount %
c. Lag 1 (i.e. immediate previous week) of discount %
d. Promotion week flag
e. Age
Apply appropriate transformations and evaluate the model fit.

Answer 10.
As the data was not stationary, we did a log transformation on sales_unit .
The model developed was as follows
Call:
lm(formula = log(TS_train) ~ as.factor(promo_week_flg) + age +
log(salenew) + disnew + discount_per, data = train_dataset)

Residuals:
Min 1Q Median 3Q Max
-1.24444 -0.15463 0.01581 0.18425 1.14370

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.31963 0.41381 5.606 2.35e-07 ***
as.factor(promo_week_flg)1 0.34650 0.12500 2.772 0.006800 **
age -0.01392 0.00346 -4.022 0.000121 ***
log(salenew) 0.53851 0.08338 6.458 5.64e-09 ***
disnew -1.90192 0.43924 -4.330 3.94e-05 ***
discount_per 2.25105 0.44093 5.105 1.89e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3747 on 88 degrees of freedom


(1 observation deleted due to missingness)
Multiple R-squared: 0.8581, Adjusted R-squared: 0.8501
F-statistic: 106.5 on 5 and 88 DF, p-value: < 2.2e-16
Got R2 of .85, with p-value less than 0.05, thus the model is valid

11. Perform checks to ensure that the model is valid and assumptions of regression are
met. Conduct appropriate statistical test and back the findings by visual examination
12. Based on the model result, explain the following:
a. How is this forecasting model able to account for trend and seasonality?
b. What is price elasticity and determine the price elasticity value (a proxy representing
price elasticity is enough) from the model output?
c. How do you interpret the coefficient of promotion week flag variable?

Answer 12.

a. Seasonality is difficult to account for in training data, as the data doesn’t overlap complete one
year .

b. Price elasticity a measure of the effect of a price change or a change in the quantity supplied on
the demand for a product or service,

Change in sales unit/change in price

week sales_units per_unit_discountprice


discount_per
promo_week_flg
dis_lag sale_lag
47 117 148.5085 0.64 1 0.62 119
48 56 109.8427 0.65 1 0.63 117
49 80 121.0075 0.66 1 0.64 124
50 34 101.0529 0.67 1 0.61 96
Let base price by 450, all the discount to be availed on base price
Week 49,discount 0.66, week 48, discount 0.66

Change in sales unit = (Sales in week 49 – Sales in 48) = (80-56) = 24


Change in price = (base price*discount of week 49 – base pricec*discount of week 48)
(450*80 – 450*56) = 297 – 292.5 = 4.5 , 24/4.5 = 5.3
c. Promo flag is a factor variable, therefore if the promotion is done the value is 1 else 0, looking at
the model , one unit change in intercept increase (i.e if no promotion is done), the week in which
promotion is done it goes up by (0.3465)

13.How do you check if the forecasting model is able to explain most of the important
features of the time series? Explain white noise in the context of time series.

To check the data following should be covered

 Is there a trend, meaning that, on average, the measurements tend to increase (or
decrease) over time?
 Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows
related to calendar time such as seasons, quarters, months, days of the week, and so on?
 Are their outliers? In regression, outliers are far away from your line. With time series data,
your outliers are far away from your other data.
 Is there a long-run cycle or period unrelated to seasonality factors?
 Is there constant varianceover time, or is the variance non-constant?
 Are there any abrupt changes to either the level of the series or the variance?

White Noise :

A white noise process is one with a mean zero and no correlation between its values at different
times

Consider a time series {wt:t=1,...n}{wt:t=1,...n}. If the elements of the series, wiwi, are
independent and identically distributed (i.i.d.), with a mean of zero, variance σ2σ2 and no serial
correlation (i.e. Cor(wi,wj)≠0,∀i≠jCor(wi,wj)≠0,∀i≠j) then we say that the time series is discrete
white noise (DWN).

In particular, if the values wiwi are drawn from a standard normal distribution (i.e.
wt∼N(0,σ2)wt∼N(0,σ2)), then the series is known as Gaussian White Noise.
14. Using the forecast model built, generate sales units forecast for test period (52nd week
of 2014 to 3rd week of 2015).
a. Assess the forecast model accuracy on the test time period which is not used for modeling
by calculating MAPE for the test period.

Answer: Sales will be as follows


fit lwr upr
1 36.36765 28.19258 46.91326
2 29.80780 22.56843 39.36937
3 33.35176 25.24706 44.05821
4 25.76207 19.22271 34.52605

Mape for this would be :

week Lag1_Sales_Units Discount_per Lag1_Discount_per Promotion_week_flag Age Actual sales MAPE


52.2014 48 0.569002728 0.579304707 1 96 45 -0.06667
1.2015 36 0.493558044 0.569002728 1 97 44 0.173463
2.2015 30 0.491279893 0.493558044 1 98 36 0.172006
3.2015 33 0.428838745 0.491279893 1 99 30 -0.11173
4%
This is with lag
Without lag MAPE will be

week Actual_Sales_Units Predicted Mape (Actual-predicted/actual


52.2014 45 36.3677 0.19183
1.2015 44 29.8078 0.32255
2.2015 36 33.3518 0.073562222
3.2015 30 25.7621 0.141264333
Mape 18%
PART C – OPTIMIZATION
15. Formulate an optimization model and solve it to determine the optimal discount % to
be given for “Cluster=2, Brand=BLINK, Brick=HAREMS” combination for each of the 4 weeks
of EOSS

Objective Function:
Max((606-(D1*606)*S1)+((606-(D2*606))*S2)+(606-(D3*606)*S3)+((606-(D3*606)*S3)+((606-
D4*606)*S4)+(2476-S1-S2-S3-S4)*(0.4*606))

Decision variable
Sales 4 week
S1,S2,S3,S4
Discount for 4 weeks
D1,D2,D3,D4
Constraints
Sales contraints
S1<= 2476
S2+S1 <=2476
S2+S2+S1 <= 2476
S1+S2+S3+S4<= 2476

D1 >= 57.9%
D2-D1 >= 0
D3-D2 >=0
D4-D2 >= 0

2476 – S1 >= 0
2476 – (S1+S2)>= 0
2476-(S1+S2+S3)>= 0
2476-(S1+S2+S3+S4 >= 0
D1,D2,D3,D4 <= 0.6
Excel output
Objective function : 602000.4
Sales
S1 : 63 , S2 = 33, S3 = 29 , S4 = 28
Discount
D1: 0.579 , D2 = 0.6 ,D3 = 0.6, D4 = 0.6
16. What are the weekly forecasted sales units if the optimal discounts identified are
implemented for the EOSS?

Answer 16 S1 : 63, S2 : 33, S3:29, S4 :28

We know that actual revenue realized by the retailer for “Cluster=2, Brand=BLINK,
Brick=HAREMS” combination during the 4 weeks of EOSS is INR1 41,320. Then, what is the
incremental lift in revenue the retailer would have achieved in these 4 weeks if he/she
implemented our analytics solution instead?

Current revenue 602000.4


previous revenue 41,320
lift in revenue 560,680

Question 3: Read the case titled, “Machine Learning Algorithms to Drive CRM in the Online
E-Commerce Site at VMWare”, and answer the following questions:

Problem definition

1. Outline the business problem and how it can be converted to an analytics problem.

Answer 1.

Business problem

Large revenue is generated in upgrading to a newer version of workstation everyyear. As this year
VMWare is not launching any new version of workstation, therefore they want to tap into the
untapped customers , target new customers, upsell to existing customers and cross –sell to the
customers that don’t have workstation as yet

Data analytics problem

To classify the user into people likely to make a purchase and people who will not buy by studying
their consumer data collected offline and online and connecting with email address, Rank them in
order of preference

Sampling

2. What is the right cross-validation strategy for this problem? What would happen if we
choose random sampling in this scenario or stratified sampling scenario?
Answer 2

In the current case we should ideally be doing a time-based cross validation. In this method we
simulate the real world by aggregating data to a period and then predicting for the next period

 Simple random samples involve the random selection of data from the entire population so
that each possible sample is equally likely to occur.
 stratified random sampling divides the population into smaller groups, or strata, based on
shared characteristics

In the current problem, we should be going ahead with stratified sampling, as we need to target
audience who meet only certain behavioral characteristics

3. What could be the training data and validation datasets for the model? How should we
go about choosing that and what should be the reasons for the same?

Answer 3.
For training , we would aggregate data upto September 2015 and predict the workstation buyers
during oct- dec 2015.
For validation, we could aggregate data up to December 2015 and compare the predictions against
actual workstation buyers from Jan-March 2016
For scoring , we can aggregate the data upto March 2016

Choice of Evaluation Metric

4. What are the pros and cons of using accuracy as a metric vs. precision vs. recall vs. F-
score vs. Area-under-curve? Which is better and why? What is area-under-curve?

Accuracy: Accuracy simply measures how often the classifier makes the correct prediction. It’s
the ratio between the number of correct predictions and the total number of predictions (the
number of test data points)

Precision and recall are actually two metrics. But they are often used together. Precision answers
the question: Out of the items that the classifier predicted to be true, how many are actually true?
Whereas, recall answers the question: Out of all the items that are true, how many are found to be
true by the classifier?

The precision score quantifies the ability of a classifier to not label a negative example as positive.
The precision score can be interpreted as the probability that a positive prediction made by the
classifier is positive. The score is in the range [0,1] with 0 being the worst, and 1 being perfect
F-score:
The F1 score, commonly used in information retrieval, measures accuracy using the statistics
precision p and recall r. Precision is the ratio of true positives (tp) to all predicted positives (tp +
fp). Recall is the ratio of true positives to all actual positives (tp + fn). The F1 score is given by
The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize
both precision and recall simultaneously. Thus moderately good performance on both will be
favored over extremely good performance on one and poor performance on the other.

The better among this will be accuracy as in this case we need accuracy to be as precise as
possible as the marketing campaign will be targeted accordingly

AUC:
The AUC score can also be defined when the target classes are of type string. For binary
classification, when the target label is of type string, then the labels are sorted alphanumerically
and the largest label is considered the "positive" label.

6.How do you define lift? Plot the lift curve for the different techniques?

Lift = confidence/expected confidence


Lift is nothing but the ratio of Confidence to Expected Confidence.In the area of association rules
- "A lift ratio larger than 1.0 implies that the relationship between the antecedent and the
consequent is more significant than would be expected if the two sets were independent.

Feature Selection

7. What feature selection techniques could be used to reduce the number of features?
What other feature selection techniques could be used in this scenario?

Selection techniques used in this scenario was the odds ratio of the target variable against each
of the features. If odds ratio is greater than 1indicates that the feature is favorable towards
purchase and odds ratio less than 1 indicates opposite. Higher degree of odd would mean high
favorability

OTHER TECHNIQUES
Filter Methods:
Filter feature selection methods apply a statistical measure to assign a scoring to each feature.
The features are ranked by the score and either selected to be kept or removed from the dataset.
The methods are often univariate and consider the feature independently, or with regard to the
dependent variable.
Other feature methods are as follows

Wrapper Methods

Wrapper methods consider the selection of a set of features as a search problem, where different
combinations are prepared, evaluated and compared to other combinations. A predictive model
us used to evaluate a combination of features and assign a score based on model accuracy.

Embedded Methods

Embedded methods learn which features best contribute to the accuracy of the model while the
model is being created. The most common type of embedded feature selection methods are
regularization methods.

Regularization methods are also called penalization methods that introduce additional
constraints into the optimization of a predictive algorithm (such as a regression algorithm) that
bias the model toward lower complexity (fewer coefficients).

Modeling Techniques

8. How is random forest different from gradient boosting?

The biggest difference between RFs and GBTs is how they optimize the bias–variance tradeoff
Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak
learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves).
Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating
the output from many models).

On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high
variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees
are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce
bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for
large, unpruned trees, so that the bias is initially as low as possible.

9. How could clustering be combined with classification for this problem?

Cluster before Classification will help us understand the audience better with their unique
identification

Thus it will be ideal to run cluster understand the audience and then do classification to further
have more precision in marketing activity and target only the most potential buyers

13. List a few uses of the propensity model in an online store.


Propensity modeling correlates customer characteristics with anticipated behaviors or
propensities. It tracks buying habits as well as other actions such as a customer’s propensity to
open a marketing email, sign up to a loyalty program, or participate in feedback surveys.

 You have three customer segments defined by their shopping frequency,  the frequent
shoppers, the slow-and-steady customers, and the at-risk customers. Applying a propensity
modeling predictive tool to each of these customer segments will allow you to develop a far more
successful, long-term sales strategy
  What is the retention probability of your frequent shoppers? Is the frequency between
their shopping trips, or the amount of money they spend on each shop, increasing or declining—
and if so, why? Why do your frequent shoppers prefer to shop with you, and how can you
leverage this knowledge to influence your slow-and-steady and at-risk customers? 

14. How can the model be white-boxed to explain the importance of various features in the
model?

Whitebox models are models whose workings can be explained to the sales teams. For example:
Customer X is more likely to upgrade if the support for the older version is coming to an end OR if
a compelling newer version is being launched

Q 4 :Use Naïve Bayes’s algorithm and calculate the probability of positive sentiment for the
following comment: Good location but the staff was unfriendly

The probability has been calculated in python as per following steps

Following steps were followed

Text pre processing

 Feature _ words : Saving the words in vocabulary into feature words


 TP = Documents that are classified as positive sentiment in the training dataset
 TN = Documents that are classified as negative sentiment in the training dataset.

 Loading the document and classifying the same into positive or negative sentiments
 Tokenization the data set : Tokenization is the process of breaking a stream of text up into
words, phrases, symbols, or other meaningful elements called tokens.
Feature: Extraction :Feature extractor is used to convert each comment to a feature set.

Stemming: Different forms of the same word. Stemming is a process of transforming a word into
its stem.

Tokenization :Process of extracting documents into meaningful elements.

N Gram Analysis: is a sequence of n consecutive words(e.g. “machine learning” is 2-gram)

Post text processing following code was used,

Using the vocab set

V = {Beautiful, Good Service, Good Location, Superb, Cleanliness, Mosquitoes, Unfriendly, bad
experience}

Trained the model on the following sentence

Positive Comments:

1. Service was very good. Excellent breakfast in beautiful restaurant included in price. I was
happy there and extended my stay for extra two days.

2. Really helpful staff, the room was clean, beds really comfortable. Great roof top restaurant
with yummy food and very friendly staff.

3. Good location. The Cleanliness part was superb.

4. I stayed for two days in deluxe A/C room (Room no. 404). I think it is renovated recently.
Staff behaviour, room cleanliness all are fine.

Negative Comments

1. The room and public spaces were infested with mosquitoes. I killed a dozen or so in my room
prior to sleeping but still woke up covered in bites.

2. Unfriendly staff with no care for guests.

3. Very worst and bad experience, Service I got from the hotel reception is too worst and
typical.

Then we test run them on following sentence

Good location but the staff was unfriendly

Python code : Prob of positive was 0.36


…………………………………………End of Assignment…………………………………………....

You might also like