Customer Analytics and Revenue Prediction

Action Learning Project—Team: DATA ROCKS
Executive Summary
The aim of this project is two-fold, analyzing the customer behavior and predicting each
customer’s transaction revenue. Generalized linear model (GLM) is used for customer behavior
analysis, and a two-stage model consisting of Logistic Regression and GLM is adopted for revenue
prediction. For prediction models, the logistic model has 98% of prediction accuracy in classifying
the customers who are likely to make a purchase from those who do not and gives the probability
values whereas the GLM model identifies the value of each transaction made by the customer and
it has an RMSE value of 1.119.
The customer behavior has been analyzed from the interpretation model and some insights such
as ideal time of sale which is around 4PM to 6PM, Countries and Cities that generate maximum
revenue such as New York, San Francisco and Chicago , the quarter that generated maximum
sale (second quarter) have been identified.
Based on the analysis that was carried out some recommendations are provided such as: (1)
reconsideration budget allocated to each channel; (2) need for improvement of the interface design,
especially for bowser Firefox and Chrome; (3) engaging the regular visitors by offer promotion;
(4) ensuring enough inventory for peak visit time, and also the maintenance of the website.
Objective of Analysis
The first step of this project is to understand the customer characteristics and behavior like how
many pages they usually view, through which channel they come in to online store, number of
times they visit the store before making a purchase etc. and determine how do these characteristics
influence and lead to a purchase decision.
The second step is to identify the customers who are likely to make a purchase based on the training
data that has been provided, and determine the quantity of purchase (transaction revenue) for each
of the customer in the testing dataset in order to make predictions and shed lights on further
decision-making process.
Data and Methodology

Business Understanding
In order to predict the value of the transaction revenue each customers will generate, different
stages in purchasing journey needs to be understood. First stage is to predict the probability of a
user making purchase (P(Purchase or No Purchase)). The second stage is to predict the purchase
amount of each users, and in the end to multiply the probability and the purchase amount to give
the final predicted transaction revenue values.
Data Understanding
An online-store dataset split into training and testing datasets was provided, which contains
451,626 and 452,027 records respectively. Twelve variables were presented including 4 JSON
columns (Appendix I Variable Description). R programming is used for developing the model.
Data Preparation
The data provided had missing values (NA’s), variables containing multiple fields and variables
with only one level of data. Thereby data cleaning was adopted to replace the NA values with
zeros, flattening the JSON fields into individual variables, eliminating the variables with only one
level of data. R and Rstudio platform are used in this process and in the following analysis.
For categorical variables, according to frequency tables, values which occurred frequently were
aggregated to multiple levels, whereas other values that were not so frequent was aggregated to a
level called ‘Others’. For continuous variables, missing data were replaced by median value.
However, missing values in transaction revenue were replaced with 0.
Variables that have only one level which is “not available in demo dataset” were deleted, along
with variables that have more than 60% missing data, for they were not informative for prediction
or user behavior analysis. After cleaning the data, dummy coding was implemented on all the
categorical variables for the convenience of analysis and modeling.
Modeling
Interpretation Generalized Linear Model
For interpretation, generalized linear regression model (a conventional model for a continuous
response variable/dependent variable with categorical and continuous predictors) was established
to understand how the customer behaviors and geographical factors influence the transaction
revenue.The dependent variable is log(transaction revenue+1).
Due to the results of correlation procedure and their importance to business decisions, ten variables
were included into the model: channel grouping, browser, device category, the hours of visit start
time, new visit, visit number, quarter, country, city and pageviews. To get greater insight into
customer behaviors, six types of the channel groups and four types of browsers were adopted based
on their frequency; ten countries and four cities were adopted based on their total amount of
transaction revenue. Moreover, considering the diminishing value of pageviews, after a turning
point, the more pageviews will lead to less marginal effect on transaction revenue, the square of
pageviews was also included into the model.
Except for the features included in the model, following levels were set as corresponding baseline:
users from other country and city, who come from other channel, other browsers, visit from
desktop and visit during the fourth quarter of a year. Therefore, the interpretation model is as:
(Appendix II Complete GLM for Interpretation ):
Log(transaction revenue+1) = β0 + β1(ChannelGrouping) + β2 (Broswer) + β3(DeviceCategory)

+ β4(Country) + β5(City) + β6(NewVisits) + β7(VisitNumber)
+ β8(Quarter) + β9(Hour) + β10(Pageviews) + β102(Pageviews)
Prediction Models
Logistic Model
In first stage logistic model, out of the 451,626 visitors , 70% was used for Development of the
model and the remaining 30% of the data was used for Validation of the model. Some of the
dummy variables were transformed to explore the data further such as – top 20 Countries that
generated the most transaction revenue and top 5 frequently occurred browsers. In addition, a
dummy variable called Transrev was generated to distinguish those who generated transaction
revenue from those who did not.
Logistic Regression is carried out initially in order to identify the Probability of Purchase for each
visitor. Logistic regression is a type of probabilistic statistical classification model. It is used to
predict binary response from a binary predictor (target = 0 or 1), and it is used for predicting the
outcome of a categorical dependent variable based on one or more predictor variables. Out of the
451,626 observations, 70% was used for Developing the Logistic Regression model and the rest
30% was used for Validation.
Variable reduction is a crucial step for accelerating model without losing the actual predictive
power of the model. Here, Information Value has been used to determine the predictive power /
strength of different variables. A scorecard (Appendix III Scorecard for variables) is generated
based on which key variables were identified and selected for modeling. The Logistic regression
model that was used is as follows:
Y(Transrev) = β0 + β1(Country) + β2(DeviceCategory) + β3(ChannelGrouping) + β4(Hits)

+ β5(PageViews)
Y = Log(P/1-P) and P = 𝑒 𝑦 /( 1+ 𝑒 𝑦 )
Generalized Linear Model

In second stage, the dependent variable -- transaction revenue -- is a continuous variable, a
generalized linear model was applied to make prediction. After filtering out the records where the
transaction revenue equals zero, the rest of the data, referred as conditional training data set, were
adopted to train predictive model. In order to improve prediction accuracy, in variable selection
stage, forward, backward and stepwise selection were conducted to identify variables that give the
best model fit. Compared with backward and stepwise selection, forward selection model gives
the lowest AIC (estimator of the relative quality of statistical models) which is 589.76, therefore,
forward selection model as followed was chosen to proceed.
ln(transactionRevenue+1) = β0(hits) + β1(newVisits) + β2(isMobile) + β3(source_direct) +

β4(visitNumber) + β5(Chrome) + β6(OS_Windows) + β7(OS_Linux) + β8(metro_NY) +
β9(OtherChannel) + β10(Referral) + β11(pageviews) + β12(country_Ukraine) + β13(source_mall)
+ β14(country_Venezuela) + β15(qrt3) + β16(country_Indonesia) + β17(OS_Android) +
β18(source_youtube) + β19(OS_ChromeOS) + β20(country_Japan) + β21(country_Chile) +
β22(IE) + β23(mobile)
Key Findings
Prediction Findings
For logistic model, after the model has been trained with the Validation dataset, the model is
applied to the testing dataset to identify the customers who were likely to make a purchase.The
model accurately predict 98.5% of the data in validation data set. For model fitness, the McFadden
R2 value (Maximum Likelihood estimate of Logistic Regression) is 0.44. Probability value is
predicted for each of the customer. Over 674 customers have more than 50% probability of
purchase and are identified to be high-value customers.(Appendix III Model Summary). For
generalized linear model, after splitted conditional training data set into development and
validation by 60% and 40%, a generalized linear model was built with development data set using
variables resulted from forward selection, and prediction was made for validation data set. As a
result, the RMSE for the prediction of validation data set was 1.119 which indicated great
predictive performance of generalized linear model. (Appendix IV)
Interpretation Findings
Frequency procedure has been used to detect customers behavior under each record for purchase.
In the aspect of user behavior, the most popular channel is organic research, from which 42.14%
of the customer comes to the website, followed by Social and Direct. Among 41 types of browsers,
68.57% of customers use Chrome, which far exceeds the sum of other browsers. In the aspect of
geography, customers from United States generated 94.2% of the total transaction revenue among
215 counties. Followed by US, customers from Canada, Japan and Venezuela also generate larger
transaction revenue than other countries. However, customers from Venezuela is only 0.23% of
total customers. Within US, New York, Mountain View, San Francisco and Chicago are the four
cities that generate larger transaction revenue than others.(Appendix V)
To further understanding how marketing channels contributions in different scenario Markov

Chain method (Appendix VII) is used to do attribution analysis. Transaction revenue is recorded
as 1 and 0;one transaction is recognized as one success conversion; one fullvisitorId is one
unique customer. After arranging data by date in ascending order, customers who generate
revenues were grouped as first_purchase customer and returning customer. For example,
customer 1 and customer 2 got touch with the website from several channels
(1:social>referral>organic research> first transaction) (2:display>referral> first
transaction )before finally making a purchase. We take all those purchase paths and each point
(channel) of one path into consideration. Markov chain method is applied here to calculate the
how many weighted transactions each channel contributed. The results show that for new
customer, referral and display are the most effective channels, respectively contributing to 319
(18% of total) and 324 (21% of total) first-time transactions. For returning customers, social
channel contributes most transactions: 735 (19.1% of total).
For interpretation model, the model fitness can be explained by the adjusted R-square (estimator
of the relative quality of statistical models), which is 0.18. The findings are as follows (Appendix
II Model Summary):
● Among channels, Organic search, referral, direct and social have significant relations with
transaction revenue, especially organic search, the transaction revenue will increase by 6%
comparing to other channels. However, paid search is not significant.
● Customers from two kinds of browsers have significant negative effect on purchasing. Firefox
will decrease the transaction revenue by 3.74% and Chrome will decrease by 2.67% comparing
to other browsers.
● In the aspect of device category, desktop contributes to most of the transaction revenue, and
transaction revenue will decrease by around 9% when device is mobile or tablet compared to
desktop.
● Within 10 countries where users have largest purchase revenue, customers from United States
purchase 10.8% more than other countries and customers from Japan purchase 8% less than
other countries. Moreover, , users from cities within United States have significant differences
in purchase revenue comparing with other cities. Chicago users purchase 55.35% more and
Mountain View users purchase 19.84% less.
● New visits has significantly negative relation with transaction revenue, which means that new
customers are less likely to purchase than existing customers, and the transaction revenue will
decrease by 21.11% when it is from a new customer. And for existing customer, with a one
unit increase in visit number, the transaction revenue will decrease by 0.047%.
● Within a year, the second and third season have significant relation with transaction revenue,
which means that comparing with the fourth season, people will purchase 0.08 times more in
the second season and buy 0.037 times less in the third season. Within a day, people buy the
most during the afternoon around 4 pm and purchase at the same low level during midnights.
● Pageviews has a significantly positive impact with purchase revenue. With 1 unit increase of
pageview, purchase revenue will increase by 13.50%. However, according to the quadratic
function shown by square of pageviews(Appendix VI), over 200 pageviews, 1 unit increase of
pageviews will diminish this positive impact.
Recommendations
Based on the result of study, there are several recommendations for the company.
● Website design could be more user friendly for customer to find the information they need
within 200 pages of viewing, especially for the browser as Firefox and Chrome because of
their negative effect on transaction revenue. The recommendation system could be improved
to help customers gain necessary information effectively.
● Marketing department is recommended to focus more on turning new visitors to regular users.
To attract new visitors, follow-up advertisement like frequent email newsletter, could be used.
● More advertisement could be displayed through different channels around 4 pm daily, and
network crash should be avoided during these peak hours. Moreover, inventory management
and logistics should be improved to guarantee sufficient supply in the second quarter yearly.
● Within one-year time frame of this dataset, channel works different for first-time purchasers
and regular customers. To attract more new customers, combination of referral and display
channel should be the company’s focus, which are useful for company to build product
awareness. To maintain loyalty of returning customers, social channel is highly recommended.
● Due to paid research is insignificant to transaction revenue contribution, this channel’s
performance definitely deserves further investigation. It is suggested for the company to
optimize existing or switch to another search engine to do paid research. It is also
recommended to re-allocate marketing budget to other profitable marketing channels.
Limitations
Some detailed information is not accessible, which lead to the limitation for recommendation.
● The dataset does not contain any information about the company and product, therefore,
specific recommendations towards industry, products and service are not available.
● No detailed customer profiles provided, such as demographics, age and income, which
generally could have the use for reference.
● Without knowing about the exact spending, it is impossible to arrive at accurate evaluations of
channels according to reliable analysis like ROI.
Appendices
I． Variables Description
Variable names Description

channelGrouping The channel via which the user came to the Store
date The date on which the user visited the Store
device The specifications for the device used to access the Store
fullVisitorId A unique identifier for each user in the dataset
geoNetwork This section contains information about the geography of the
user.
sessionId The unique identifier for the session
socialEngagementType Engagement type, either "Socially Engaged" or "Not Socially
Engaged".
trafficSource This section contains information about the Traffic Source
from which the session originated
totals This section contains aggregate values across the session.
The element“transactionRevenue” contains the purchase
amount for this visit session. This is the key dependent
variable for this project.
visitID An identifier for this session. This is only unique to the user.
For a completely unique ID, you should use a combination of
fullVisitorId and visitId.
visitNumber The session number for this user. If this is the first session,
then this is set to 1.
visitStartTime The timestamp (expressed as POSIX time)
II． Interpretation Model: Generalized Linear Model
Complete GLM for Interpretation

lm(formula = transactionRevenue ~ channelGrouping_Organic_Search +
channelGrouping_Referral + channelGrouping_Direct + channelGrouping_Affiliates +
channelGrouping_Paid_Search + channelGrouping_Social +
qrt1 + qrt2 + qrt3 + factor(hour) + browser_Firefox + browser_Chrome +
browser_Safari + browser_Internet_Explorer + deviceCategory_mobile +
deviceCategory_tablet + country_Australia + country_Indonesia +
country_Mexico + country_United_States + country_China_Taiwan +
country_Canada + country_China_Hong_Kong + country_Japan + country_Kenya +
country_Venezuela + city_Chicago + city_Mountain_View +
city_New_York + city_San_Francisco + newVisits + pageviews2 +
visitNumber + pageviews, data = train)
Model Summary
Residuals
Min 1Q Median 3Q Max
-12.8418 -0.2542 0.0403 0.1808 26.6318
Estimate Std Error t value Pr(>|t|)

(Intercept) -2.477e-01 3.609e-02 -6.862 6.79e-12 ***
channelGrouping1_OrganicSearch 5.924e-02 3.012e-02 1.966 0.049257 *
channelGrouping1_Referral 3.439e-01 3.052e-02 11.269 < 2e-16 ***
channelGrouping1_Direct 1.496e-01 3.043e-02 4.915 8.87e-07 ***
channelGrouping1_Affiliates 5.529e-02 3.545e-02 1.560 0.118818
channelGrouping1_Paid_Search -1.395e-02 3.334e-02 - 0.418 0.675664
channelGrouping1_Social 2.492e-01 3.070e-02 8.116 4.84e-16 ***
qrt1 4.487e-03 7.255e-03 0.619 0.536222
qrt2 7.980e-02 7.351e-03 10.856 < 2e-16 ***
qrt3 -3.755e-02 6.908e-03 - 5.436 5.44e-08 ***
factor(hour)1 -2.673e-02 1.999e-02 - 1.337 0.181162
factor(hour)2 -3.712e-02 1.996e-02 - 1.860 0.062844
factor(hour)3 -3.318e-03 1.970e-02 - 0.168 0.866239
factor(hour)4 5.316e-03 1.985e-02 0.268 0.788882
factor(hour)5 2.475e-02 1.989e-02 1.245 0.213298
factor(hour)6 3.251e-02 2.001e-02 1.625 0.104135
factor(hour)7 4.338e-02 1.964e-02 2.209 0.027209 *
factor(hour)8 4.892e-02 1.906e-02 2.567 0.010257 *
factor(hour)9 6.600e-02 1.842e-02 3.583 0.000340 ***
factor(hour)10 6.272e-02 1.810e-02 3.465 0.000530 ***
factor(hour)11 6.908e-02 1.791e-02 3.857 0.000115 ***
factor(hour)12 4.910e-02 1.786e-02 2.749 0.005970 **
factor(hour)13 7.789e-02 1.773e-02 4.392 1.12e-05 ***
factor(hour)14 6.479e-02 1.781e-02 3.637 0.000276 ***
factor(hour)15 7.536e-02 1.809e-02 4.166 3.10e-05 ***
factor(hour)16 8.310e-02 1.805e-02 4.604 4.15e-06 ***
factor(hour)17 4.921e-02 1.838e-02 2.677 0.007424 **
factor(hour)18 3.452e-02 1.889e-02 1.828 0.067612
factor(hour)19 4.156e-02 1.918e-02 2.166 0.030284 *
factor(hour)20 2.597e-02 1.942e-02 1.338 0.180983
factor(hour)21 3.223e-02 1.949e-02 1.653 0.098300
factor(hour)22 9.363e-03 1.962e-02 0.477 0.633162
factor(hour)23 -1.882e-02 1.973e-02 -0.954 0.340095

browser1_Firefox -3.814e-02 1.750e-02 -2.180 0.029244 *
browser1_Chrome -2.705e-02 1.232e-02 -2.195 0.028166 *
browser1_Safari -1.307e-02 1.295e-02 -1.009 0.312846
browser1_Internet_Explorer 5.047e-03 2.115e-02 0.239 0.811430
deviceCategory_mobile -9.761e-02 6.776e-03 -14.406 < 2e-16 ***
deviceCategory_tablet -1.002e-01 1.462e-02 -6.853 7.22e-12 ***
country1_Australia -1.004e-02 2.201e-02 -0.456 0.648175
country1_Indonesia 5.569e-02 2.553e-02 2.181 0.029166 *
country1_Mexico -2.334e-02 2.158e-02 -1.082 0.279396
country1_United_States 8.110e-02 6.793e-03 11.938 < 2e-16 ***
country1_China_Taiwan -1.438e-01 2.183e-02 -6.587 4.49e-11 ***
country1_Canada -1.860e-01 1.568e-02 -11.863 < 2e-16 ***
country1_China_HongKong -4.270e-02 3.523e-02 -1.212 0.225418
country1_Japan -9.903e-02 1.780e-02 -5.565 2.62e-08 ***
country1_Kenya 3.535e-02 9.230e-02 0.383 0.701752
country1_Venezuela -2.003e-03 5.286e-02 -0.038 0.969772
city1_Chicago 4.405e-01 2.833e-02 15.547 < 2e-16 ***
city1_Mountain_View -2.211e-01 1.348e-02 -16.407 < 2e-16 ***
city1_New_York 3.260e-01 1.574e-02 20.720 < 2e-16 ***
city1_San_Francisco 1.061e-01 1.772e-02 5.991 2.08e-09 ***
newVisits1 -2.371e-01 6.715e-03 -35.315 < 2e-16 ***
pageviews2 -3.153e-04 4.004e-06 -78.745 < 2e-16 ***
visitNumber -4.710e-04 2.737e-04 -1.721 0.085262 .
pageviews 1.266e-01 4.889e-04 258.940 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.708 on 451569 degrees of freedom
Multiple R-squared: 0.1804, Adjusted R-squared: 0.1803
F-statistic: 1775 on 56 and 451569 DF, p-value: < 2.2e-1
III． Prediction Model: Logistic Model
Scorecard of Variables
It is advisable to choose variables with information value over 0.5.
Model Summary
Prediction for Probability Values

McFadden Index (R2 Value)
IV． Prediction Model: GLM

V． Statistic Description
Channel Frequency Percentage

Organic Search 190316 42.14%
Social 113301 25.09%
Direct 71553 15.84%
Referral 52095 11.53%
Paid Search 12759 2.83%
Affiliates 8222 1.82%
Broswer Frequency Percentage

Chrome 309691 68.57%
Safari 91299 20.22%
Firefox 18676 4.14%
Internet Explorer 9653 2.14%
Device Frequency Percentage

desktop 331901 73.49%
mobile 104289 23.09%
tablet 15436 3.42%
VI． Calculation Equations for Turning Point
Quadratic function f(x) = ax2 + bx + c

Symmetry axis x=-b/2a
As shown in Figure , if a > 0, the parabola has a minimum point

and opens upward. If a < 0, the parabola has a maximum point
and opens downward.
VII． Markov Chain Method
Explanation
1 Assume there are only three shopping journeys

Start ->C1 -> C2 -> C3 -> purchase
Start >- C1 -> null
Start >- C2 -> C3 -> null
2 Split them in pairs then calculate the probability of the transition from state to state
a. (start) -> C1, C1 -> C2, C2 -> C3, C3 -> (conversion) b. (start) -> C1, C1 -> (null)
b. (start) ->C2 , C2 ->C3, C3 ->null
3 Calculate removal effects of each channel (remove each channel from the graph consecutively and
measure how many conversions left)
Attribution of each channel= weighted removal effect for each channel* total number of conversions
Results
channel_ name 2 ： Affiliates channel_name4 : Display
channel_name5 ： Organic Search channel_name6 ： Paid Search
channel_name7：Referral channel_name8 ： Social
train_uniq_paths
channelGrouping conversions
<int> <dbl>
1 4 2
2 5 21
3 6 2
4 7 32
5 8 8
VIII． Code
## ------ 1. JSON: break down columns ------

library(tidyverse)
library(jsonlite)
library(dplyr)
train <- read.csv("train_forclass.csv", header = T)
test <- read.csv("test_forclass.csv", header = T)
#JSON columns are "device", "geoNetwork", "totals", "trafficSource"
tr_device <- paste("[", paste(train$device, collapse = ","), "]") %>% fromJSON(flatten = T)
tr_geoNetwork <- paste("[", paste(train$geoNetwork, collapse = ","), "]") %>% fromJSON(flatten = T)
tr_totals <- paste("[", paste(train$totals, collapse = ","), "]") %>% fromJSON(flatten = T)
tr_trafficSource <- paste("[", paste(train$trafficSource, collapse = ","), "]") %>% fromJSON(flatten = T)
te_device <- paste("[", paste(test$device, collapse = ","), "]") %>% fromJSON(flatten = T)

te_geoNetwork <- paste("[", paste(test$geoNetwork, collapse = ","), "]") %>% fromJSON(flatten = T)
te_totals <- paste("[", paste(test$totals, collapse = ","), "]") %>% fromJSON(flatten = T)
te_trafficSource <- paste("[", paste(test$trafficSource, collapse = ","), "]") %>% fromJSON(flatten = T)
#Check to see if the training and test sets have the same column names
setequal(names(tr_device), names(te_device))
setequal(names(tr_geoNetwork), names(te_geoNetwork))
setequal(names(tr_totals), names(te_totals))
setequal(names(tr_trafficSource), names(te_trafficSource))
#As expected, tr_totals and te_totals are different as the train set includes the target, transactionRevenue
names(tr_totals)
names(te_totals)
#Apparently tr_trafficSource contains an extra column as well - campaignCode
#It actually has only one non-NA value, so this column can safely be dropped later
table(tr_trafficSource$campaignCode, exclude = NULL)
names(tr_trafficSource)
names(te_trafficSource)
#Combine to make the full training and test sets
train <- train %>%
cbind(tr_device, tr_geoNetwork, tr_totals, tr_trafficSource) %>%
select(-device, -geoNetwork, -totals, -trafficSource)
test <- test %>%
cbind(te_device, te_geoNetwork, te_totals, te_trafficSource) %>%
select(-device, -geoNetwork, -totals, -trafficSource)
#Number of columns in the new training and test sets.
ncol(train)
ncol(test)
#Remove temporary tr_ and te_ sets
rm(tr_device); rm(tr_geoNetwork); rm(tr_totals); rm(tr_trafficSource)
rm(te_device); rm(te_geoNetwork); rm(te_totals); rm(te_trafficSource)
#How long did this script take?
write.csv(train, "train_flat.csv", row.names = F)
write.csv(test, "test_flat.csv", row.names = F)
## ------ 2. Cleaning dataset ------

train <- read.csv("train_flat.csv", header = T)
test <- read.csv("test_flat.csv", header = T)
length(unique(test$fullVisitorId))
training.data.raw <- read.csv('train_flat.csv',header=T,na.strings=c(""))

sapply(training.data.raw,function(x) sum(is.na(x)))
sapply(training.data.raw, function(x) length(unique(x)))
train$socialEngagementType <- NULL
train$browserVersion <- NULL
train$operatingSystemVersion <- NULL
train$mobileDeviceBranding <- NULL
train$mobileDeviceModel <- NULL
train$mobileInputSelector <- NULL
train$mobileDeviceInfo <- NULL
train$mobileDeviceMarketingName <- NULL
train$flashVersion <- NULL
train$language <- NULL
train$screenColors <- NULL
train$screenResolution <- NULL
train$latitude <- NULL
train$longitude <- NULL
train$networkLocation <- NULL
train$adwordsClickInfo.criteriaParameters <- NULL

train$adwordsClickInfo.adNetworkType <- NULL
train$adwordsClickInfo.gclId <- NULL
train$adwordsClickInfo.isVideoAd <- NULL
train$adwordsClickInfo.page <- NULL
train$adwordsClickInfo.slot <- NULL
train$keyword <- NULL
train$adContent <- NULL
train$campaignCode <- NULL
train$sessionId <- NULL
#Test - Repeat the Above

testing.data.raw <- read.csv('test_flat.csv',header=T,na.strings=c(""))
sapply(testing.data.raw,function(x) sum(is.na(x)))
sapply(testing.data.raw, function(x) length(unique(x)))
test$socialEngagementType <- NULL
test$browserVersion <- NULL
test$operatingSystemVersion <- NULL
test$mobileDeviceBranding <- NULL
test$mobileDeviceModel <- NULL
test$mobileInputSelector <- NULL
test$mobileDeviceInfo <- NULL
test$mobileDeviceMarketingName <- NULL
test$flashVersion <- NULL
test$language <- NULL
test$screenColors <- NULL
test$screenResolution <- NULL
test$latitude <- NULL
test$longitude <- NULL
test$networkLocation <- NULL
test$adwordsClickInfo.criteriaParameters <- NULL
test$adwordsClickInfo.adNetworkType <- NULL
test$adwordsClickInfo.gclId <- NULL
test$adwordsClickInfo.isVideoAd <- NULL
test$adwordsClickInfo.page <- NULL
test$adwordsClickInfo.slot <- NULL
test$keyword <- NULL
test$adContent <- NULL
test$campaignCode <- NULL
test$sessionId <- NULL
test$isTrueDirect <- NULL
test$referralPath <- NULL
# Factor Variables
is.factor(train_upload$channelGrouping)
contrasts(train_upload$channelGrouping)
is.factor(train_upload$browser)
table(train_upload$BrowserChrome)
sapply(train, function(x) length(unique(x)))

train$visits <- NULL
train$cityId <- NULL

train$browserSize <- NULL
train$isTrueDirect <- NULL
train$bounces <- NULL
train$visits <- NULL
train$isMobile <- NULL
train$transrev <- ifelse(train$transactionRevenue != 'NA',1,0)

train$newVisits[is.na(train$newVisits)] <- 0
train$referralPath[is.na(train$referralPath)] <- 0
train$transrev[is.na(train$transrev)] <- 0
train$referralPath <- NULL
write.csv(train, "train_new.csv", row.names = F)

train_uploaded <- read.csv("train_new.csv",header = T)
sum(is.na(train_upload$newVisits))
train_upload$newVisits[is.na(train_upload$newVisits)] <- 0
train_upload$transrev[is.na(train_upload$transrev)] <- 0
train_upload$transactionRevenue[is.na(train_upload$transactionRevenue)] <- 0
train_upload$isTrueDirect[is.na(train_upload$isTrueDirect)] <- 0
train_upload$referralPath[is.na(train_upload$referralPath)] <- 0
train_upload$referralPath <- NULL
train_upload$CtryOthers [is.na(train_upload$CtryOthers)] <- 0
head(train_upload)
sum(is.na(train_upload$bounces))
train_upload$bounces[is.na(train_upload$bounces)] <- 0
sum(is.na(train_upload$isTrueDirect))
train_upload$isTrueDirect[is.na(train_upload$isTrueDirect)] <- 0
sum(is.na(train_upload$transactionRevenue))
train_upload$transactionRevenue[is.na(train_upload$transactionRevenue)] <- 0
sum(is.na(train_upload$transrev))
# Operating System and Browser Dummy Coding

sum(is.na(train_upload$operatingSystem))#No Missing
table(train_upload$operatingSystem)
train_upload$OS_Windows <- ifelse(train_upload$operatingSystem=="Windows",1,0)
train_upload$OS_Macintosh <- ifelse(train_upload$operatingSystem=="Macintosh",1,0)
train_upload$OS_Android <- ifelse(train_upload$operatingSystem=="Android",1,0)
train_upload$OS_iOS <- ifelse(train_upload$operatingSystem=="iOS",1,0)
train_upload$OS_Linux <- ifelse(train_upload$operatingSystem=="Linux",1,0)
train_upload$OS_Chrome <- ifelse(train_upload$operatingSystem=="Chrome OS",1,0)
if((train_upload$OS_Windows == 1) || (train_upload$OS_Macintosh == 1) || (train_upload$OS_Android

== 1)|| (train_upload$OS_iOS == 1)||(train_upload$OS_Linux ==1)|| (train_upload$OS_Chrome ==1))
{
train_upload$OS_others <- 1
} else
{
train_upload$OS_others <- 0
}
train_upload$operatingSystem <- NULL
train_upload$BrowserChrome[train_upload$browser=="Chrome"] <- 1
train_upload$BrowserChrome[train_upload$browser!="Chrome"] <- 0
train_upload$BrowserSafari[train_upload$browser=="Safari"] <- 1
train_upload$BrowserSafari[train_upload$browser!="Safari"] <- 0
train_upload$BrowserFirefox[train_upload$browser=="Firefox"] <- 1
train_upload$BrowserFirefox[train_upload$browser!="Firefox"] <- 0
train_upload$BrowserIE[train_upload$browser=="Internet Explorer"] <- 1
train_upload$BrowserIE[train_upload$browser!="Internet Explorer"] <- 0
train_upload$OtherBrowser[train_upload$browser!="Chrome" & train_upload$browser!="Safari" &
train_upload$browser!="Firefox"] <- 1
train_upload$OtherBrowser[train_upload$browser=="Chrome" | train_upload$browser=="Safari" |
train_upload$browser=="Firefox"] <- 0
train_upload$browser <- NULL
train_upload$BrowserIE <- NULL
#deviceCategory
#3 level:desktop mobile tablet
train$desktop[train$deviceCategory=="desktop"] <- 1
train$desktop[train$deviceCategory!="desktop"] <- 0
train$mobile[train$deviceCategory=="mobile"] <- 1
train$mobile[train$deviceCategory!="mobile"] <- 0
train$tablet[train$deviceCategory=="tablet"] <- 1
train$tablet[train$deviceCategory!="tablet"] <- 0
train$deviceCategory <- NULL
test$desktop[test$deviceCategory=="desktop"] <- 1
test$desktop[test$deviceCategory!="desktop"] <- 0
test$mobile[test$deviceCategory=="mobile"] <- 1
test$mobile[test$deviceCategory!="mobile"] <- 0
test$tablet[test$deviceCategory=="tablet"] <- 1
test$tablet[test$deviceCategory!="tablet"] <- 0
test$deviceCategory <- NULL
#continent
library(dummies)
train$continent[train$continent=="(not set)"] <- "Americas"
temp <- dummy(train$continent, sep = "_")
train <- cbind(temp,train)
train$continent <- NULL
test$continent[test$continent=="(not set)"] <- "Americas"

temp <- dummy(test$continent, sep = "_")
test <- cbind(temp,test)
test$continent <- NULL
#subcontinent
train$subContinent[train$subContinent=="(not set)"]<-"Northern America"

temp <- dummy(train$subContinent, sep = "_")
train$subContinent <- NULL
test$subContinent[test$subContinent=="(not set)"]<-"Northern America"

temp <- dummy(test$subContinent, sep = "_")
test$subContinent <- NULL
#country
library(dummies)
levels(train$country) <- c(levels(train$country),"Others","US", "HK",
"UK","PuertoRico","SouthKorea","SaudiArabia","SouthAfrica","NewZealand", "St.Lucia")
train$country[train$country=="(not set)"] <- "Others"
train$country[is.na(train$country)==T] <- "Others"
train$country[train$country=="United States"] <- "US"

train$country[train$country=="Hong Kong"] <- "HK"
train$country[train$country=="South Korea"] <- "SouthKorea"
train$country[train$country=="Puerto Rico"] <- "PuertoRico"
train$country[train$country=="Saudi Arabia"] <- "SaudiArabia"
train$country[train$country=="New Zealand"] <- "NewZealand"
train$country[train$country=="St. Lucia"] <- "St.Lucia"
train$country[train$country!="US"&train$country!="Canada"&train$country!="Japan"&
train$country!="Venezuela"
& train$country!="Kenya"& train$country!="Australia"& train$country!="HK"&
train$country!="Taiwan"
& train$country!="Indonesia"& train$country!="Mexico"& train$country!="India"&
train$country!="UK"
& train$country!="Belgium"& train$country!="PuertoRico"& train$country!="Singapore"
& train$country!="SouthKorea"& train$country!="Ukraine"& train$country!="Spain"&
train$country!="Brazil"
& train$country!="Germany"& train$country!="Colombia"& train$country!="France"&
train$country!="Kuwait"
& train$country!="Ecuador"& train$country!="China"& train$country!="Argentina"&
train$country!="Greece"
& train$country!="Chile"& train$country!="Italy"& train$country!="Peru"&
train$country!="SaudiArabia"
& train$country!="Turkey"& train$country!="Romania"& train$country!="Sweden"&
train$country!="Cyprus"
& train$country!="Switzerland"& train$country!="Nicaragua"& train$country!="Russia"&
train$country!="Guatemala"
& train$country!="Finland"& train$country!="Philippines"& train$country!="Pakistan"&
train$country!="Poland"
& train$country!="SouthAfrica"& train$country!="Egypt"& train$country!="Ireland"&
train$country!="NewZealand"
& train$country!="Israel"& train$country!="Malaysia"& train$country!="Armenia"&
train$country!="Netherlands"
& train$country!="Thailand"& train$country!="Kazakhstan"& train$country!="St. Lucia"] <-

"Others"
temp <- dummy(train$country, sep = "_")
train$country <- NULL
levels(test$country) <- c(levels(test$country),"Others","US", "HK",

"UK","PuertoRico","SouthKorea","SaudiArabia","SouthAfrica","NewZealand", "St.Lucia")
test$country[test$country=="(not set)"] <- "Others"
test$country[is.na(test$country)==T] <- "Others"
test$country[test$country=="United States"] <- "US"

test$country[test$country=="Hong Kong"] <- "HK"
test$country[test$country=="South Korea"] <- "SouthKorea"
test$country[test$country=="Puerto Rico"] <- "PuertoRico"
test$country[test$country=="Saudi Arabia"] <- "SaudiArabia"
test$country[test$country=="New Zealand"] <- "NewZealand"
test$country[test$country=="St. Lucia"] <- "St.Lucia"
test$country[test$country!="US"&test$country!="Canada"&test$country!="Japan"&
test$country!="Venezuela"
& test$country!="Kenya"& test$country!="Australia"& test$country!="HK"&
test$country!="Taiwan"
& test$country!="Indonesia"& test$country!="Mexico"& test$country!="India"&
test$country!="UK"
& test$country!="Belgium"& test$country!="PuertoRico"& test$country!="Singapore"
& test$country!="SouthKorea"& test$country!="Ukraine"& test$country!="Spain"&
test$country!="Brazil"
& test$country!="Germany"& test$country!="Colombia"& test$country!="France"&
test$country!="Kuwait"
& test$country!="Ecuador"& test$country!="China"& test$country!="Argentina"&
test$country!="Greece"
& test$country!="Chile"& test$country!="Italy"& test$country!="Peru"&
test$country!="SaudiArabia"
& test$country!="Turkey"& test$country!="Romania"& test$country!="Sweden"&
test$country!="Cyprus"
& test$country!="Switzerland"& test$country!="Nicaragua"& test$country!="Russia"&
test$country!="Guatemala"
& test$country!="Finland"& test$country!="Philippines"& test$country!="Pakistan"&
test$country!="Poland"
& test$country!="SouthAfrica"& test$country!="Egypt"& test$country!="Ireland"&
test$country!="NewZealand"
& test$country!="Israel"& test$country!="Malaysia"& test$country!="Armenia"&
test$country!="Netherlands"
& test$country!="Thailand"& test$country!="Kazakhstan"& test$country!="St. Lucia"] <-
"Others"
temp <- dummy(test$country, sep = "_")
temp<-NULL
test$country <- NULL
#region
train$region_California[train$region=="California"] <- 1
train$region_California[train$region!="California"] <- 0
train$region_NewYork[train$region=="New York"] <- 1
train$region_NewYork[train$region!="New York"] <- 0
train$region_England[train$region=="England"] <- 1
train$region_England[train$region!="England"] <- 0
train$region_Others[train$region!="California" & train$region!="New York" &
train$region!="England" ] <- 1
train$region_Others[train$region=="California" | train$region=="New York" | train$region=="England" ]
<- 0
train$region <- NULL
test$region_California[test$region=="California"] <- 1
test$region_California[test$region!="California"] <- 0
test$region_NewYork[test$region=="New York"] <- 1
test$region_NewYork[test$region!="New York"] <- 0
test$region_England[test$region=="England"] <- 1
test$region_England[test$region!="England"] <- 0
test$region_Others[test$region!="California" & test$region!="New York" & test$region!="England" ] <-
1
test$region_Others[test$region=="California" | test$region=="New York" | test$region=="England" ] <-
0
test$region <- NULL
#metro
train$metro_CA[train$metro=="San Francisco-Oakland-San Jose CA"] <- 1
train$metro_CA[train$metro!="San Francisco-Oakland-San Jose CA"] <- 0
train$metro_NY[train$metro=="New York NY"] <- 1
train$metro_NY[train$metro!="New York NY"] <- 0
train$metro_London[train$metro=="London"] <- 1
train$metro_London[train$metro!="London"] <- 0
train$metro_others[train$metro=="London" | train$metro=="New York NY"|train$metro=="San
Francisco-Oakland-San Jose CA"] <- 0
train$metro_others[train$metro!="London" & train$metro!="New York NY" & train$metro!="San
Francisco-Oakland-San Jose CA"] <- 1
train$metro <- NULL
test$metro_CA[test$metro=="San Francisco-Oakland-San Jose CA"] <- 1

test$metro_CA[test$metro!="San Francisco-Oakland-San Jose CA"] <- 0
test$metro_NY[test$metro=="New York NY"] <- 1
test$metro_NY[test$metro!="New York NY"] <- 0
test$metro_London[test$metro=="London"] <- 1
test$metro_London[test$metro!="London"] <- 0
test$metro_others[test$metro=="London" | test$metro=="New York NY"|test$metro=="San Francisco-
Oakland-San Jose CA"] <- 0
test$metro_others[test$metro!="London" & test$metro!="New York NY" & test$metro!="San Francisco-
Oakland-San Jose CA"] <- 1
test$metro <- NULL
#city
train$city_MountainView[train$city=="Mountain View"] <- 1
train$city_MountainView[train$city!="Mountain View"] <- 0
train$city_NewYork[train$city=="New York"] <- 1
train$city_NewYork[train$city!="New York"] <- 0
train$city_SanFrancisco[train$city=="San Francisco"] <- 1
train$city_SanFrancisco[train$city!="San Francisco"] <- 0
train$city_others[train$city=="Mountain View" | train$city=="New York"|train$city=="San Francisco"]
<- 0
train$city_others[train$city!="Mountain View"& train$city!="New York" & train$city!="San Francisco"]
<- 1
train$city <- NULL
test$city_MountainView[test$city=="Mountain View"] <- 1

test$city_MountainView[test$city!="Mountain View"] <- 0
test$city_NewYork[test$city=="New York"] <- 1
test$city_NewYork[test$city!="New York"] <- 0
test$city_SanFrancisco[test$city=="San Francisco"] <- 1
test$city_SanFrancisco[test$city!="San Francisco"] <- 0
test$city_others[test$city=="Mountain View" | test$city=="New York"|test$city=="San Francisco"] <- 0
test$city_others[test$city!="Mountain View"& test$city!="New York" & test$city!="San Francisco"] <-
1
test$city <- NULL
#pageviews
#replacing NA with 1 in train and test dataset
train$pageviews[is.na(train$pageviews)==T] <- 1
test$pageviews[is.na(test$pageviews)==T] <- 1
#newVisits#
train$newVisits[train$newVisits==1]<-1
train$newVisits[train$newVisit!=1| is.na(train$newVisits)==T]<-0
test$newVisits[test$newVisits==1]<-1
test$newVisits[test$newVisits!=1|is.na(test$newVisits)==T]<-0
## ------ 3. Logistic Model ------

# IV Value
data = train_upload %>%
as_tibble()
#replace '.' in variable names not compatible with f_train_lasso

vars = names(data) %>%
str_replace_all( '\\.', '_')
names(data) <- vars

# convert response factor variable to dummy variable

data = data %>%
mutate( transrev = ifelse( transrev == 'bad', 1, 0 )
, transrev = as.factor(transrev) )
summary(data)
install.packages('scorecard')
library(scorecard)
scorecard::iv(data,y='transrev')
iv = iv(data, y = 'transrev') %>%
as_tibble() %>%
mutate( info_value = round(info_value, 3) ) %>%
arrange( desc(info_value) )
install.packages('knitr')
library(knitr)
iv %>%
knitr::kable()
#Weight of Evidence – Variable wise
install.packages('caroline')
library(caroline)
train$transrev<-as.factor(ifelse(train$transrev == 0, "Bad", "Good"))
pct(train$transrev)
op<-par(mfrow=c(1,2), new=TRUE)
plot(as.numeric(train$transrev), ylab="Good-Bad", xlab="n", main="Good ~ Bad")
hist(as.numeric(train$transrev), breaks=2,
xlab="Good(1) and Bad(2)", col="blue")
par(op)
pct <- function(x){
tbl <- table(x)
tbl_pct <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(tbl_pct) <- c('Count','Percentage')
kable(tbl_pct)
}
gbpct <- function(x, y=train$transrev){
mt <- as.matrix(table(as.factor(x), as.factor(y))) # x -> independent variable(vector), y->dependent
variable(vector)
Total <- mt[,1] + mt[,2] # Total observations
Total_Pct <- round(Total/sum(mt)*100, 2) # Total PCT
Bad_pct <- round((mt[,1]/sum(mt[,1]))*100, 2) # PCT of BAd or event or response
Good_pct <- round((mt[,2]/sum(mt[,2]))*100, 2) # PCT of Good or non-event
Bad_Rate <- round((mt[,1]/(mt[,1]+mt[,2]))*100, 2) # Bad rate or response rate
grp_score <- round((Good_pct/(Good_pct + Bad_pct))*10, 2) # score for each group
WOE <- round(log(Good_pct/Bad_pct)*10, 2) # Weight of Evidence for each group
g_b_comp <- ifelse(mt[,1] == mt[,2], 0, 1)
IV <- ifelse(g_b_comp == 0, 0, (Good_pct - Bad_pct)*(WOE/10)) # Information value for each group
Efficiency <- abs(Good_pct - Bad_pct)/2 # Efficiency for each group
otb<-as.data.frame(cbind(mt, Good_pct, Bad_pct, Total,
Total_Pct, Bad_Rate, grp_score,
WOE, IV, Efficiency ))
otb$Names <- rownames(otb)
rownames(otb) <- NULL

otb[,c(12,2,1,3:11)] # return IV table
}
A1 <- gbpct(train$browser)
op1 <- par(mfrow = c(1,2))

plot(train$browser, train$transrev,
main = "Browser",
xlab="Type",
ylab="Good-Bad")
barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),

main="Browser",
xlab="Category",
ylab="WOE")
par(op1)
new <- kable(A1, caption = 'Browser ~ Good-Bad')
A2 <- gbpct(train$deviceCategory)

plot(train$deviceCategory, train$transrev,
main = "device category",
xlab="Type",
ylab="Good-Bad")

main="device category",
xlab="Category",
ylab="WOE")
par(op2)
kable(A2, caption = 'Device Category ~ Good-Bad')
A3 <- gbpct(train$country)

plot(train$country, train$transrev,
main = "Country",
xlab="Type",
ylab="Good-Bad")

main="Country",
xlab="Category",
ylab="WOE")
par(op2)
kable(A3, caption = 'Country ~ Good-Bad')

A4 <- gbpct(train$pageviews)

plot(train$pageviews, train$transrev,
main = "Page Views",
xlab="Type",
ylab="Good-Bad")

main="Page Views",
xlab="Category",
ylab="WOE")
par(op2)
kable(A4, caption = 'Page Views ~ Good-Bad')
A5 <- gbpct(train$channelGrouping)

plot(train$channelGrouping, train$transrev,
main = "channelGrouping",
xlab="Type",
ylab="Good-Bad")

main="channelGrouping",
xlab="Category",
ylab="WOE")
par(op2)
kable(A5, caption = 'Channel Grouping ~ Good-Bad')
# Development and Validation Dataset

Development <- train_upload[1:300000,]
Validation <- train_upload[430000:451626,]
#Model Fitting
memory.limit(size=3900000)
memory.limit(size=35000)
model <- glm(transrev ~ country+deviceCategory+channelGrouping+hits+pageviews,family =

binomial(link='logit'),data = Development)
summary(model)
head(Development)
# To find R2 of the Logistic Regression model (Use McFadden R2 value)

install.packages("pscl")
library(pscl)
pR2(model2)
# Analyzing the Predicting Power of the Model

model <- glm(transrev ~ deviceCategory+channelGrouping+newVisits+hits+pageviews,family =
fitted.results <-
predict.glm(model,newdata=subset(Validation,select=c(1,27,24,40,38,37,110)),type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != Validation$transrev)
print(paste('Accuracy',1-misClasificError))
# Test Dataset
test$BrowserChrome[test$browser=="Chrome"] <- 1
test$BrowserChrome[test$browser!="Chrome"] <- 0
test$BrowserSafari[test$browser=="Safari"] <- 1
test$BrowserSafari[test$browser!="Safari"] <- 0
test$BrowserFirefox[test$browser=="Firefox"] <- 1
test$BrowserFirefox[test$browser!="Firefox"] <- 0
test$BrowserIE[test$browser=="Internet Explorer"] <- 1
test$BrowserIE[test$browser!="Internet Explorer"] <- 0
test$OtherBrowser[test$browser!="Chrome" & test$browser!="Safari" & test$browser!="Firefox" &
test$browser!="Internet Explorer"] <- 1
test$OtherBrowser[test$browser=="Chrome" | test$browser=="Safari" | test$browser=="Firefox" |
test$browser=="Internet Explorer"] <- 0
test$browser <- NULL
sum(is.na(test$operatingSystem))#No Missing
table(test$operatingSystem)
test$OS_Windows <- ifelse(test$operatingSystem=="Windows",1,0)
test$OS_Macintosh <- ifelse(test$operatingSystem=="Macintosh",1,0)
test$OS_Android <- ifelse(test$operatingSystem=="Android",1,0)
test$OS_iOS <- ifelse(test$operatingSystem=="iOS",1,0)
test$OS_Linux <- ifelse(test$operatingSystem=="Linux",1,0)
test$OS_Chrome <- ifelse(test$operatingSystem=="Chrome OS",1,0)
if((test$OS_Windows == 1) || (test$OS_Macintosh == 1) || (test$OS_Android == 1)|| (test$OS_iOS ==

1)||(test$OS_Linux ==1)|| (test$OS_Chrome ==1))
{test$OS_others <- 1} else
{ test$OS_others <- 0}
test$operatingSystem <- NULL
test$BrowserChrome[test$browser=="Chrome"] <- 1
test$BrowserChrome[test$browser!="Chrome"] <- 0
test$BrowserSafari[test$browser=="Safari"] <- 1
test$BrowserSafari[test$browser!="Safari"] <- 0
test$BrowserFirefox[test$browser=="Firefox"] <- 1
test$BrowserFirefox[test$browser!="Firefox"] <- 0
test$BrowserIE[test$browser=="Internet Explorer"] <- 1
test$BrowserIE[test$browser!="Internet Explorer"] <- 0
test$OtherBrowser[test$browser!="Chrome" & test$browser!="Safari" & test$browser!="Firefox" &
test$browser!="Internet Explorer"] <- 1
test$OtherBrowser[test$browser=="Chrome" | test$browser=="Safari" | test$browser=="Firefox" |

test$browser=="Internet Explorer"] <- 0
test$browser <- NULL
# Predict Probabilities of Test Dataset

model<- glm(transrev ~ deviceCategory+channelGrouping+newVisits+hits+pageviews,family =
test$transrev[is.na(test$transrev)] <- 0
sum(is.na(test$transrev))
test$fitted.result2 <- predict.glm(model,newdata=subset(test,select=c(1,9,19,20,22)),type='response')
fittedresult_Dummycoding <- ifelse(test$fitted.result2 > 0.5,1,0)
test$transrev <- fittedresult_Dummycoding
head(test)
test$fittedresult <- NULL
write.csv(test, "test_dummycodedfinal.csv", row.names = F)
test_dummycoded <- read.csv("test_dummycodedfinal.csv",header=T)
length(unique(test_dummycoded$fullVisitorId))
length(unique(testing$fullVisitorId)
## ------ 4. GLM for prediction ------

#Split train into validation & train#
set.seed(2244)
split<-sample(nrow(train.con),floor(0.6*nrow(train.con)))
train <- train.con[split, ]
validation <- train.con[-split, ]
train$logtr <- log(train$transactionRevenue+1)
validation$logtr <- log(validation$transactionRevenue+1)
train$transactionRevenue <- NULL

train$fullVisitorId <- NULL
validation$transactionRevenue <- NULL
validation$fullVisitorId <- NULL
#names(train)
#------backward selection------#
fitall <- lm(logtr~.,data=train)
summary(fitall) #NAs in Fitall
step(fitall,direction = "backward")
reg_back <- lm(formula = logtr ~ country_Chile + country_Indonesia + country_Japan +
country_Ukraine + country_Venezuela + visitNumber + hits +
pageviews + newVisits + OrganicSearch + Social + PaidSearch +
Affiliates + Chrome + OS_Macintosh + OS_Windows + OS_iOS +
OS_ChromeOS + mobile + region_California + metro_CA + metro_NY +
city_SanFrancisco + source_youtube + source_mall, data = train) #AIC=592
summary(reg_back)
library(caret)
prediction_back <- predict(reg_back, validation)
rmse_back <- postResample(validation$logtr, prediction_back)
#RMSE=1.1194331 Rsquared=0.1652101 MAE=0.8445481
#------forward selection------#
fitstart=lm(logtr~1,data=train)
step(fitstart,direction = "forward", scope = formula(fitall))
reg_fore <- lm(formula = logtr ~ hits + newVisits + isMobile + source_direct +

visitNumber + Chrome + OS_Windows + OS_Linux + metro_NY +
OtherChannel + Referral + pageviews + country_Ukraine + source_mall +
country_Venezuela + qrt3 + country_Indonesia + OS_Android +
source_youtube + OS_ChromeOS + country_Japan + country_Chile +
IE + mobile, data = train) #AIC=589.76
prediction_fore <- predict(reg_fore, validation)
rmse_fore <- postResample(validation$logtr, prediction_fore)
#RMSE 1.1190123 Rsquared 0.1658659 MAE 0.8433874
#------stepwise selection------#
step(fitstart,direction = "both", scope = formula(fitall))
reg_both<- lm(formula = logtr ~ hits + newVisits + Direct + visitNumber +

Chrome + OS_Windows + OS_Linux + metro_NY + OtherChannel +
Referral + pageviews + country_Ukraine + source_mall + country_Venezuela +
qrt3 + country_Indonesia + OS_Android + source_youtube +
OS_ChromeOS + country_Japan + country_Chile + IE + mobile,
data = train)#AIC=588.53
prediction_both <- predict(reg_both,validation)
rmse_both <- postResample(validation$logtr,prediction_both)
#RMSE 1.1190867 Rsquared 0.1657435 MAE 0.8432084
valerror <- rbind(rmse_back,rmse_fore,rmse_both)
#reg_fore is the best model among three

reg_fore <- lm(formula = logtr ~ hits + newVisits + isMobile + source_direct +
visitNumber + Chrome + OS_Windows + OS_Linux + metro_NY +
OtherChannel + Referral + pageviews + country_Ukraine + source_mall +
country_Venezuela + qrt3 + country_Indonesia + OS_Android +
source_youtube + OS_ChromeOS + country_Japan + country_Chile +
IE + mobile, data = train)
testcl <- read.csv("testcl_0228.csv", header = T)

testcl$logtr_prediction <- predict(reg_fore, testcl)
#sum(is.na(testcl$logtr))
#summary(testcl$logtr)#predict
#summary(train$logtr)#actual
#userpd = testcl[c("fullVisitorId","logtr_prediction")]
testcl$pd <- exp(testcl$logtr_prediction)

testpd <- testcl[,c("pd","fullVisitorId")]
distinct_userpd <- aggregate(pd ~ fullVisitorId, data=testpd, sum)
distinct_userpd$lntr <- log(distinct_userpd$pd + 1)

distinct_userpd$pd <- NULL
length(unique(distinct_userpd$fullVisitorId))#No repeat record 356867
write.csv(distinct_userpd,"lntr_pd.csv",row.names=F)
#test_forclass <- read.csv("test_forclass.csv",header = T)

#test <- read.csv("test_forclass.csv",header = T)
#length(unique(test$fullVisitorId))#356867
#glmpd <- read.csv("glmpd.csv", header = T)
#length(unique(glmpd$fullVisitorId))#356867
#test2 <- read.csv("test_dummycoded.csv", header = T)
#length(unique(test2$fullVisitorId))
lntrpd <- read.csv("lntr_pd.csv", header = T)

propd <- read.csv("test_dummycodedfinal.csv", header = T)
length(unique(propd$fullVisitorId))
#keep only two columns

names(propd)
propd <-propd[,c("fullVisitorId","fitted.result2")]
#aggregate distinct user

distinct_propd <- aggregate(fitted.result2 ~ fullVisitorId, data=propd, sum)
#dont have to sort, just to make sure

distinct_propd$fullVisitorId <- sort(distinct_propd$fullVisitorId)
lntrpd$fullVisitorId <- sort(lntrpd$fullVisitorId)
#find out if there is any different

setdiff(propd$fullVisitorId,lntrpd$fullVisitorId)
setdiff(lntrpd$fullVisitorId,propd$fullVisitorId)
#putting two columns together

finalpd <- merge(distinct_propd,lntrpd,by="fullVisitorId")
names(finalpd)
finalpd$prediction <- (finalpd$fitted.result2)*(finalpd$lntr)
summary(finalpd$prediction)
plot(finalpd)
## ------ 5. GLM for interpretation ------
#constrast hours
class(train$visitStartTime) = c('POSIXt','POSIXct')
head(train$visitStartTime)
train$hour<-as.POSIXlt(train$visitStartTime)$hour
summary(train$hour)
train.reg<-lm(transactionRevenue1~
channelGrouping1_Organic_Search+channelGrouping1_Referral+channelGrouping1_Direct
+channelGrouping1_Affiliates+channelGrouping1_Paid_Search+channelGrouping1_Social
+qrt1+qrt2+qrt3+factor(hour)
+browser1_Firefox+browser1_Chrome+browser1_Safari+browser1_Internet_Explorer
+deviceCategory_mobile+deviceCategory_tablet
+country1_Australia+country1_Indonesia+country1_Mexico+country1_United_States
+country1_Taiwan+country1_Canada+country1_Hong_Kong+country1_Japan+country1_Kenya+country
1_Venezuela+city1_Chicago+city1_Mountain_View+city1_New_York+city1_San_Francisco
+newVisits+ visitNumber+pageviews+I(pageviews^2)
,data=train)
summary(train.reg)
## ------ 6. Markov Chain Method ------
install.packages("ChannelAttribution")
install.packages("tidyverse")
install.packages("reshape2")
install.packages("ggthemes")
install.packages("ggrepel")
install.packages("RColorBrewe")
install.packages("markovchain")
install.packages("visNetwork")
install.packages("expm")
install.packages("stringr")
install.packages("Matrix")
library(tidyverse)
library(reshape2)
library(ggthemes)
library(ggrepel)
library(RColorBrewer)
library(ChannelAttribution)
library(markovchain)
library(visNetwork)
library(expm)
library(stringr)
train <-read.csv("traincl2.csv",header = T)
# assign (other) channel as NA

train <- subset(train, channelGrouping == "Social"
|train$channelGrouping == "Affiliates"
| train$channelGrouping == "Direct"
| train$channelGrouping =="Display"
| train$channelGrouping =="Organic Search"
| train$channelGrouping =="Paid Search"
| train$channelGrouping == "Referral" )
# arrage date by ascending order

dates<- date %>%
mutate(date = as.Date(date, "%Y-%m-%d")) %>%
arrange(date)
# covert revevenue to 0,1 as the counts of conversion

train$conversion <-ifelse(is.na(train$transactionRevenue),0,1)
#1 Spling path based on the purchase number/time

train_path <- train %>%
group_by(fullVisitorId) %>%
mutate(path_no = ifelse(is.na(lag(cumsum(conversion))), 0, lag(cumsum(conversion))) + 1) %>%
ungroup()
#2 customer life cycle; non-first-purchase path for retruning customer

train_path_non_1 <- train_path %>%
filter(path_no > 1) %>%
select(-path_no)
#3 replace direct touchpoint
## adding order of channels in the path
train_path_non_1 <- train_path_non_1 %>%
group_by(train_path_non_1$fullVisitorId) %>%
mutate(ord = c(1:n()),
is_non_direct = ifelse(channelGrouping == "Direct", 0, 1),
is_non_direct_cum = cumsum(is_non_direct)) %>%
## removing Direct when it is the first in the path
filter(is_non_direct_cum != 0) %>%
## replacing Direct with the previous touch point
mutate(channelGrouping = ifelse(channelGrouping == "Direct",
channelGrouping[which(channelGrouping != "Direct")][is_non_direct_cum], channelGrouping)) %>%
ungroup() %>%
select(-ord, -is_non_direct, -is_non_direct_cum)
#4 split a unique channel and multi-channel paths.

##### one- and multi-channel paths #####
train_path_non_1<- train_path_non_1 %>%
mutate(uniq_channel_tag = ifelse(length(unique(channelGrouping)) == 1, TRUE, FALSE)) %>%
ungroup()
train_path_non_1_uniq <- train_path_non_1 %>%

filter(uniq_channel_tag == TRUE) %>%
select(-uniq_channel_tag)
train_path_non_1_multi <- train_path_non_1 %>%

filter(uniq_channel_tag == FALSE) %>%
# for first purchase customer: attribution analysis for multi and unique channel purchase paths
train_multi_paths <- train_path_non_1_multi %>%
summarise(path = paste(train_path_non_1_multi$channelGrouping, collapse = ' > '),
conversion = sum(conversion)) %>%
ungroup() %>%
filter(conversion >1)
mod_attrib_alt <- markov_model(train_multi_paths,
var_path = 'path',
var_conv = 'conversion',
out_more = TRUE)
mod_attrib_alt$removal_effects
mod_attrib_alt$result
# adding unique paths

train_uniq_paths <- train_path_non_1_uniq %>%
filter(conversion > 1) %>%
group_by(channelGrouping) %>%
summarise(conversions = sum(conversion)) %>%
ungroup()
d_multi <- data.frame(mod_attrib_alt$result)
##add conversions generated from unique path to conversions generated form multi channel paths
together
# covert revevenue to 0,1 as the counts of conversion

train$conversion <-ifelse(is.na(train$transactionRevenue),0,1)
#1 Spling path based on the purchase number/time
train_path <- train %>%

mutate(path_no = ifelse(is.na(lag(cumsum(conversion))), 0, lag(cumsum(conversion))) + 1) %>%
ungroup()
#2 customer life cycle; non-first-purchase path for retruning customer

train_path_non_1 <- train_path %>%
filter(path_no > 1) %>%
select(-path_no)
#3 replace direct touchpoint
## adding order of channels in the path
train_path_non_1 <- train_path_non_1 %>%
group_by(train_path_non_1$fullVisitorId) %>%
mutate(ord = c(1:n()),
is_non_direct = ifelse(channelGrouping == "Direct", 0, 1),
is_non_direct_cum = cumsum(is_non_direct)) %>%
## removing Direct when it is the first in the path
filter(is_non_direct_cum != 0) %>%
## replacing Direct with the previous touch point
mutate(channelGrouping = ifelse(channelGrouping == "Direct",
channelGrouping[which(channelGrouping != "Direct")][is_non_direct_cum], channelGrouping)) %>%
ungroup() %>%
select(-ord, -is_non_direct, -is_non_direct_cum)
#4 split a unique channel and multi-channel paths.

##### one- and multi-channel paths #####
train_path_non_1<- train_path_non_1 %>%
mutate(uniq_channel_tag = ifelse(length(unique(channelGrouping)) == 1, TRUE, FALSE)) %>%
ungroup()
train_path_non_1_uniq <- train_path_non_1 %>%

filter(uniq_channel_tag == TRUE) %>%
train_path_non_1_multi <- train_path_non_1 %>%

filter(uniq_channel_tag == FALSE) %>%
# attribution model for splitted multi and unique channel paths
train_multi_paths <- train_path_non_1_multi %>%

summarise(path = paste(train_path_non_1_multi$channelGrouping, collapse = ' > '),
conversion = sum(conversion)) %>%
ungroup() %>%
filter(conversion >1)
mod_attrib_alt <- markov_model(train_multi_paths,
var_path = 'path',
var_conv = 'conversion',
out_more = TRUE)
mod_attrib_alt$removal_effects
mod_attrib_alt$result
# adding unique paths

train_uniq_paths <- train_path_non_1_uniq %>%
filter(conversion > 1) %>%
group_by(channelGrouping) %>%
summarise(conversions = sum(conversion)) %>%
ungroup()
##add conversions generated from unique path to conversions generated form multi channel paths
together as the total conversions after channels are weighed using Markov chain methond

Customer Analytics and Revenue Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Customer Analytics and Revenue Prediction

Uploaded by

Copyright:

Available Formats

Action Learning Project—Team: DATA ROCKS

Data and Methodology

Log(transaction revenue+1) = β0 + β1(ChannelGrouping) + β2 (Broswer) + β3(DeviceCategory)

Y(Transrev) = β0 + β1(Country) + β2(DeviceCategory) + β3(ChannelGrouping) + β4(Hits)

Generalized Linear Model

ln(transactionRevenue+1) = β0(hits) + β1(newVisits) + β2(isMobile) + β3(source_direct) +

To further understanding how marketing channels contributions in different scenario Markov

Variable names Description

II． Interpretation Model: Generalized Linear Model

Complete GLM for Interpretation

visitNumber + pageviews, data = train)

Estimate Std Error t value Pr(>|t|)

factor(hour)23 -1.882e-02 1.973e-02 -0.954 0.340095

III． Prediction Model: Logistic Model

Prediction for Probability Values

McFadden Index (R2 Value)

IV． Prediction Model: GLM

Channel Frequency Percentage

Affiliates 8222 1.82%

Broswer Frequency Percentage

Device Frequency Percentage

VI． Calculation Equations for Turning Point

Quadratic function f(x) = ax2 + bx + c

As shown in Figure , if a > 0, the parabola has a minimum point

VII． Markov Chain Method

1 Assume there are only three shopping journeys

## ------ 1. JSON: break down columns ------

te_device <- paste("[", paste(test$device, collapse = ","), "]") %>% fromJSON(flatten = T)

## ------ 2. Cleaning dataset ------

training.data.raw <- read.csv('train_flat.csv',header=T,na.strings=c(""))

train$adwordsClickInfo.criteriaParameters <- NULL

#Test - Repeat the Above

sapply(train, function(x) length(unique(x)))

train$cityId <- NULL

train$transrev <- ifelse(train$transactionRevenue != 'NA',1,0)

write.csv(train, "train_new.csv", row.names = F)

# Operating System and Browser Dummy Coding

if((train_upload$OS_Windows == 1) || (train_upload$OS_Macintosh == 1) || (train_upload$OS_Android

train_upload$operatingSystem <- NULL

test$continent[test$continent=="(not set)"] <- "Americas"

train$subContinent[train$subContinent=="(not set)"]<-"Northern America"

test$subContinent[test$subContinent=="(not set)"]<-"Northern America"

train$country[train$country=="United States"] <- "US"

& train$country!="Thailand"& train$country!="Kazakhstan"& train$country!="St. Lucia"] <-

levels(test$country) <- c(levels(test$country),"Others","US", "HK",

test$country[test$country=="United States"] <- "US"

test$metro_CA[test$metro=="San Francisco-Oakland-San Jose CA"] <- 1

test$city_MountainView[test$city=="Mountain View"] <- 1

## ------ 3. Logistic Model ------

#replace '.' in variable names not compatible with f_train_lasso

names(data) <- vars

# convert response factor variable to dummy variable

rownames(otb) <- NULL

op1 <- par(mfrow = c(1,2))

barplot(A1$WOE, col="brown", names.arg=c(A1$Levels),

new <- kable(A1, caption = 'Browser ~ Good-Bad')

op2 <- par(mfrow = c(1,2))

barplot(A2$WOE, col="brown", names.arg=c(A2$Levels),

kable(A2, caption = 'Device Category ~ Good-Bad')

op2 <- par(mfrow = c(1,2))

barplot(A12$WOE, col="brown", names.arg=c(A3$Levels),

kable(A3, caption = 'Country ~ Good-Bad')

op2 <- par(mfrow = c(1,2))