Professional Documents
Culture Documents
Customer Analytics and Revenue Prediction
Customer Analytics and Revenue Prediction
Executive Summary
The aim of this project is two-fold, analyzing the customer behavior and predicting each
customer’s transaction revenue. Generalized linear model (GLM) is used for customer behavior
analysis, and a two-stage model consisting of Logistic Regression and GLM is adopted for revenue
prediction. For prediction models, the logistic model has 98% of prediction accuracy in classifying
the customers who are likely to make a purchase from those who do not and gives the probability
values whereas the GLM model identifies the value of each transaction made by the customer and
it has an RMSE value of 1.119.
The customer behavior has been analyzed from the interpretation model and some insights such
as ideal time of sale which is around 4PM to 6PM, Countries and Cities that generate maximum
revenue such as New York, San Francisco and Chicago , the quarter that generated maximum
sale (second quarter) have been identified.
Based on the analysis that was carried out some recommendations are provided such as: (1)
reconsideration budget allocated to each channel; (2) need for improvement of the interface design,
especially for bowser Firefox and Chrome; (3) engaging the regular visitors by offer promotion;
(4) ensuring enough inventory for peak visit time, and also the maintenance of the website.
Objective of Analysis
The first step of this project is to understand the customer characteristics and behavior like how
many pages they usually view, through which channel they come in to online store, number of
times they visit the store before making a purchase etc. and determine how do these characteristics
influence and lead to a purchase decision.
The second step is to identify the customers who are likely to make a purchase based on the training
data that has been provided, and determine the quantity of purchase (transaction revenue) for each
of the customer in the testing dataset in order to make predictions and shed lights on further
decision-making process.
Data Understanding
An online-store dataset split into training and testing datasets was provided, which contains
451,626 and 452,027 records respectively. Twelve variables were presented including 4 JSON
columns (Appendix I Variable Description). R programming is used for developing the model.
Action Learning Project—Team: DATA ROCKS
Data Preparation
The data provided had missing values (NA’s), variables containing multiple fields and variables
with only one level of data. Thereby data cleaning was adopted to replace the NA values with
zeros, flattening the JSON fields into individual variables, eliminating the variables with only one
level of data. R and Rstudio platform are used in this process and in the following analysis.
For categorical variables, according to frequency tables, values which occurred frequently were
aggregated to multiple levels, whereas other values that were not so frequent was aggregated to a
level called ‘Others’. For continuous variables, missing data were replaced by median value.
However, missing values in transaction revenue were replaced with 0.
Variables that have only one level which is “not available in demo dataset” were deleted, along
with variables that have more than 60% missing data, for they were not informative for prediction
or user behavior analysis. After cleaning the data, dummy coding was implemented on all the
categorical variables for the convenience of analysis and modeling.
Modeling
Interpretation Generalized Linear Model
For interpretation, generalized linear regression model (a conventional model for a continuous
response variable/dependent variable with categorical and continuous predictors) was established
to understand how the customer behaviors and geographical factors influence the transaction
revenue.The dependent variable is log(transaction revenue+1).
Due to the results of correlation procedure and their importance to business decisions, ten variables
were included into the model: channel grouping, browser, device category, the hours of visit start
time, new visit, visit number, quarter, country, city and pageviews. To get greater insight into
customer behaviors, six types of the channel groups and four types of browsers were adopted based
on their frequency; ten countries and four cities were adopted based on their total amount of
transaction revenue. Moreover, considering the diminishing value of pageviews, after a turning
point, the more pageviews will lead to less marginal effect on transaction revenue, the square of
pageviews was also included into the model.
Except for the features included in the model, following levels were set as corresponding baseline:
users from other country and city, who come from other channel, other browsers, visit from
desktop and visit during the fourth quarter of a year. Therefore, the interpretation model is as:
(Appendix II Complete GLM for Interpretation ):
Prediction Models
Logistic Model
In first stage logistic model, out of the 451,626 visitors , 70% was used for Development of the
model and the remaining 30% of the data was used for Validation of the model. Some of the
dummy variables were transformed to explore the data further such as – top 20 Countries that
Action Learning Project—Team: DATA ROCKS
generated the most transaction revenue and top 5 frequently occurred browsers. In addition, a
dummy variable called Transrev was generated to distinguish those who generated transaction
revenue from those who did not.
Logistic Regression is carried out initially in order to identify the Probability of Purchase for each
visitor. Logistic regression is a type of probabilistic statistical classification model. It is used to
predict binary response from a binary predictor (target = 0 or 1), and it is used for predicting the
outcome of a categorical dependent variable based on one or more predictor variables. Out of the
451,626 observations, 70% was used for Developing the Logistic Regression model and the rest
30% was used for Validation.
Variable reduction is a crucial step for accelerating model without losing the actual predictive
power of the model. Here, Information Value has been used to determine the predictive power /
strength of different variables. A scorecard (Appendix III Scorecard for variables) is generated
based on which key variables were identified and selected for modeling. The Logistic regression
model that was used is as follows:
Key Findings
Prediction Findings
For logistic model, after the model has been trained with the Validation dataset, the model is
applied to the testing dataset to identify the customers who were likely to make a purchase.The
model accurately predict 98.5% of the data in validation data set. For model fitness, the McFadden
R2 value (Maximum Likelihood estimate of Logistic Regression) is 0.44. Probability value is
Action Learning Project—Team: DATA ROCKS
predicted for each of the customer. Over 674 customers have more than 50% probability of
purchase and are identified to be high-value customers.(Appendix III Model Summary). For
generalized linear model, after splitted conditional training data set into development and
validation by 60% and 40%, a generalized linear model was built with development data set using
variables resulted from forward selection, and prediction was made for validation data set. As a
result, the RMSE for the prediction of validation data set was 1.119 which indicated great
predictive performance of generalized linear model. (Appendix IV)
Interpretation Findings
Frequency procedure has been used to detect customers behavior under each record for purchase.
In the aspect of user behavior, the most popular channel is organic research, from which 42.14%
of the customer comes to the website, followed by Social and Direct. Among 41 types of browsers,
68.57% of customers use Chrome, which far exceeds the sum of other browsers. In the aspect of
geography, customers from United States generated 94.2% of the total transaction revenue among
215 counties. Followed by US, customers from Canada, Japan and Venezuela also generate larger
transaction revenue than other countries. However, customers from Venezuela is only 0.23% of
total customers. Within US, New York, Mountain View, San Francisco and Chicago are the four
cities that generate larger transaction revenue than others.(Appendix V)
For interpretation model, the model fitness can be explained by the adjusted R-square (estimator
of the relative quality of statistical models), which is 0.18. The findings are as follows (Appendix
II Model Summary):
● Among channels, Organic search, referral, direct and social have significant relations with
transaction revenue, especially organic search, the transaction revenue will increase by 6%
comparing to other channels. However, paid search is not significant.
● Customers from two kinds of browsers have significant negative effect on purchasing. Firefox
will decrease the transaction revenue by 3.74% and Chrome will decrease by 2.67% comparing
to other browsers.
● In the aspect of device category, desktop contributes to most of the transaction revenue, and
transaction revenue will decrease by around 9% when device is mobile or tablet compared to
desktop.
● Within 10 countries where users have largest purchase revenue, customers from United States
Action Learning Project—Team: DATA ROCKS
purchase 10.8% more than other countries and customers from Japan purchase 8% less than
other countries. Moreover, , users from cities within United States have significant differences
in purchase revenue comparing with other cities. Chicago users purchase 55.35% more and
Mountain View users purchase 19.84% less.
● New visits has significantly negative relation with transaction revenue, which means that new
customers are less likely to purchase than existing customers, and the transaction revenue will
decrease by 21.11% when it is from a new customer. And for existing customer, with a one
unit increase in visit number, the transaction revenue will decrease by 0.047%.
● Within a year, the second and third season have significant relation with transaction revenue,
which means that comparing with the fourth season, people will purchase 0.08 times more in
the second season and buy 0.037 times less in the third season. Within a day, people buy the
most during the afternoon around 4 pm and purchase at the same low level during midnights.
● Pageviews has a significantly positive impact with purchase revenue. With 1 unit increase of
pageview, purchase revenue will increase by 13.50%. However, according to the quadratic
function shown by square of pageviews(Appendix VI), over 200 pageviews, 1 unit increase of
pageviews will diminish this positive impact.
Recommendations
Based on the result of study, there are several recommendations for the company.
● Website design could be more user friendly for customer to find the information they need
within 200 pages of viewing, especially for the browser as Firefox and Chrome because of
their negative effect on transaction revenue. The recommendation system could be improved
to help customers gain necessary information effectively.
● Marketing department is recommended to focus more on turning new visitors to regular users.
To attract new visitors, follow-up advertisement like frequent email newsletter, could be used.
● More advertisement could be displayed through different channels around 4 pm daily, and
network crash should be avoided during these peak hours. Moreover, inventory management
and logistics should be improved to guarantee sufficient supply in the second quarter yearly.
● Within one-year time frame of this dataset, channel works different for first-time purchasers
and regular customers. To attract more new customers, combination of referral and display
channel should be the company’s focus, which are useful for company to build product
awareness. To maintain loyalty of returning customers, social channel is highly recommended.
● Due to paid research is insignificant to transaction revenue contribution, this channel’s
performance definitely deserves further investigation. It is suggested for the company to
optimize existing or switch to another search engine to do paid research. It is also
recommended to re-allocate marketing budget to other profitable marketing channels.
Limitations
Some detailed information is not accessible, which lead to the limitation for recommendation.
● The dataset does not contain any information about the company and product, therefore,
specific recommendations towards industry, products and service are not available.
● No detailed customer profiles provided, such as demographics, age and income, which
generally could have the use for reference.
● Without knowing about the exact spending, it is impossible to arrive at accurate evaluations of
channels according to reliable analysis like ROI.
Action Learning Project—Team: DATA ROCKS
Appendices
I. Variables Description
Model Summary
Residuals
Min 1Q Median 3Q Max
-12.8418 -0.2542 0.0403 0.1808 26.6318
Scorecard of Variables
It is advisable to choose variables with information value over 0.5.
Action Learning Project—Team: DATA ROCKS
Model Summary
V. Statistic Description
Explanation
2 Split them in pairs then calculate the probability of the transition from state to state
a. (start) -> C1, C1 -> C2, C2 -> C3, C3 -> (conversion) b. (start) -> C1, C1 -> (null)
b. (start) ->C2 , C2 ->C3, C3 ->null
3 Calculate removal effects of each channel (remove each channel from the graph consecutively and
measure how many conversions left)
Attribution of each channel= weighted removal effect for each channel* total number of conversions
Results
channel_ name 2 : Affiliates channel_name4 : Display
channel_name5 : Organic Search channel_name6 : Paid Search
channel_name7:Referral channel_name8 : Social
Action Learning Project—Team: DATA ROCKS
train_uniq_paths
channelGrouping conversions
<int> <dbl>
1 4 2
2 5 21
3 6 2
4 7 32
5 8 8
VIII. Code
#As expected, tr_totals and te_totals are different as the train set includes the target, transactionRevenue
names(tr_totals)
names(te_totals)
#Apparently tr_trafficSource contains an extra column as well - campaignCode
#It actually has only one non-NA value, so this column can safely be dropped later
table(tr_trafficSource$campaignCode, exclude = NULL)
names(tr_trafficSource)
names(te_trafficSource)
#Combine to make the full training and test sets
train <- train %>%
cbind(tr_device, tr_geoNetwork, tr_totals, tr_trafficSource) %>%
select(-device, -geoNetwork, -totals, -trafficSource)
test <- test %>%
cbind(te_device, te_geoNetwork, te_totals, te_trafficSource) %>%
select(-device, -geoNetwork, -totals, -trafficSource)
#Number of columns in the new training and test sets.
ncol(train)
ncol(test)
#Remove temporary tr_ and te_ sets
rm(tr_device); rm(tr_geoNetwork); rm(tr_totals); rm(tr_trafficSource)
rm(te_device); rm(te_geoNetwork); rm(te_totals); rm(te_trafficSource)
#How long did this script take?
write.csv(train, "train_flat.csv", row.names = F)
write.csv(test, "test_flat.csv", row.names = F)
# Factor Variables
is.factor(train_upload$channelGrouping)
contrasts(train_upload$channelGrouping)
is.factor(train_upload$browser)
table(train_upload$BrowserChrome)
head(train_upload)
sum(is.na(train_upload$bounces))
train_upload$bounces[is.na(train_upload$bounces)] <- 0
sum(is.na(train_upload$isTrueDirect))
train_upload$isTrueDirect[is.na(train_upload$isTrueDirect)] <- 0
sum(is.na(train_upload$transactionRevenue))
train_upload$transactionRevenue[is.na(train_upload$transactionRevenue)] <- 0
sum(is.na(train_upload$transrev))
train_upload$BrowserChrome[train_upload$browser=="Chrome"] <- 1
train_upload$BrowserChrome[train_upload$browser!="Chrome"] <- 0
train_upload$BrowserSafari[train_upload$browser=="Safari"] <- 1
train_upload$BrowserSafari[train_upload$browser!="Safari"] <- 0
train_upload$BrowserFirefox[train_upload$browser=="Firefox"] <- 1
train_upload$BrowserFirefox[train_upload$browser!="Firefox"] <- 0
train_upload$BrowserIE[train_upload$browser=="Internet Explorer"] <- 1
train_upload$BrowserIE[train_upload$browser!="Internet Explorer"] <- 0
train_upload$OtherBrowser[train_upload$browser!="Chrome" & train_upload$browser!="Safari" &
train_upload$browser!="Firefox"] <- 1
train_upload$OtherBrowser[train_upload$browser=="Chrome" | train_upload$browser=="Safari" |
train_upload$browser=="Firefox"] <- 0
train_upload$browser <- NULL
train_upload$BrowserIE <- NULL
#deviceCategory
#3 level:desktop mobile tablet
train$desktop[train$deviceCategory=="desktop"] <- 1
train$desktop[train$deviceCategory!="desktop"] <- 0
train$mobile[train$deviceCategory=="mobile"] <- 1
train$mobile[train$deviceCategory!="mobile"] <- 0
train$tablet[train$deviceCategory=="tablet"] <- 1
train$tablet[train$deviceCategory!="tablet"] <- 0
train$deviceCategory <- NULL
test$desktop[test$deviceCategory=="desktop"] <- 1
test$desktop[test$deviceCategory!="desktop"] <- 0
test$mobile[test$deviceCategory=="mobile"] <- 1
test$mobile[test$deviceCategory!="mobile"] <- 0
test$tablet[test$deviceCategory=="tablet"] <- 1
test$tablet[test$deviceCategory!="tablet"] <- 0
test$deviceCategory <- NULL
length(unique(test$fullVisitorId))
#continent
library(dummies)
train$continent[train$continent=="(not set)"] <- "Americas"
temp <- dummy(train$continent, sep = "_")
train <- cbind(temp,train)
train$continent <- NULL
#subcontinent
Action Learning Project—Team: DATA ROCKS
#country
library(dummies)
levels(train$country) <- c(levels(train$country),"Others","US", "HK",
"UK","PuertoRico","SouthKorea","SaudiArabia","SouthAfrica","NewZealand", "St.Lucia")
train$country[train$country=="(not set)"] <- "Others"
train$country[is.na(train$country)==T] <- "Others"
#region
train$region_California[train$region=="California"] <- 1
train$region_California[train$region!="California"] <- 0
train$region_NewYork[train$region=="New York"] <- 1
train$region_NewYork[train$region!="New York"] <- 0
train$region_England[train$region=="England"] <- 1
train$region_England[train$region!="England"] <- 0
train$region_Others[train$region!="California" & train$region!="New York" &
train$region!="England" ] <- 1
train$region_Others[train$region=="California" | train$region=="New York" | train$region=="England" ]
<- 0
train$region <- NULL
test$region_California[test$region=="California"] <- 1
test$region_California[test$region!="California"] <- 0
test$region_NewYork[test$region=="New York"] <- 1
test$region_NewYork[test$region!="New York"] <- 0
test$region_England[test$region=="England"] <- 1
test$region_England[test$region!="England"] <- 0
test$region_Others[test$region!="California" & test$region!="New York" & test$region!="England" ] <-
1
test$region_Others[test$region=="California" | test$region=="New York" | test$region=="England" ] <-
0
test$region <- NULL
length(unique(test$fullVisitorId))
#metro
train$metro_CA[train$metro=="San Francisco-Oakland-San Jose CA"] <- 1
train$metro_CA[train$metro!="San Francisco-Oakland-San Jose CA"] <- 0
train$metro_NY[train$metro=="New York NY"] <- 1
train$metro_NY[train$metro!="New York NY"] <- 0
train$metro_London[train$metro=="London"] <- 1
train$metro_London[train$metro!="London"] <- 0
train$metro_others[train$metro=="London" | train$metro=="New York NY"|train$metro=="San
Francisco-Oakland-San Jose CA"] <- 0
train$metro_others[train$metro!="London" & train$metro!="New York NY" & train$metro!="San
Francisco-Oakland-San Jose CA"] <- 1
train$metro <- NULL
length(unique(test$fullVisitorId))
#city
train$city_MountainView[train$city=="Mountain View"] <- 1
train$city_MountainView[train$city!="Mountain View"] <- 0
train$city_NewYork[train$city=="New York"] <- 1
train$city_NewYork[train$city!="New York"] <- 0
train$city_SanFrancisco[train$city=="San Francisco"] <- 1
train$city_SanFrancisco[train$city!="San Francisco"] <- 0
train$city_others[train$city=="Mountain View" | train$city=="New York"|train$city=="San Francisco"]
<- 0
train$city_others[train$city!="Mountain View"& train$city!="New York" & train$city!="San Francisco"]
<- 1
train$city <- NULL
#pageviews
#replacing NA with 1 in train and test dataset
train$pageviews[is.na(train$pageviews)==T] <- 1
test$pageviews[is.na(test$pageviews)==T] <- 1
length(unique(test$fullVisitorId))
#newVisits#
train$newVisits[train$newVisits==1]<-1
train$newVisits[train$newVisit!=1| is.na(train$newVisits)==T]<-0
test$newVisits[test$newVisits==1]<-1
test$newVisits[test$newVisits!=1|is.na(test$newVisits)==T]<-0
length(unique(test$fullVisitorId))
summary(data)
install.packages('scorecard')
library(scorecard)
scorecard::iv(data,y='transrev')
iv = iv(data, y = 'transrev') %>%
as_tibble() %>%
mutate( info_value = round(info_value, 3) ) %>%
arrange( desc(info_value) )
install.packages('knitr')
library(knitr)
iv %>%
knitr::kable()
#Weight of Evidence – Variable wise
install.packages('caroline')
library(caroline)
train$transrev<-as.factor(ifelse(train$transrev == 0, "Bad", "Good"))
pct(train$transrev)
op<-par(mfrow=c(1,2), new=TRUE)
plot(as.numeric(train$transrev), ylab="Good-Bad", xlab="n", main="Good ~ Bad")
hist(as.numeric(train$transrev), breaks=2,
xlab="Good(1) and Bad(2)", col="blue")
par(op)
pct <- function(x){
tbl <- table(x)
tbl_pct <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(tbl_pct) <- c('Count','Percentage')
kable(tbl_pct)
}
gbpct <- function(x, y=train$transrev){
mt <- as.matrix(table(as.factor(x), as.factor(y))) # x -> independent variable(vector), y->dependent
variable(vector)
Total <- mt[,1] + mt[,2] # Total observations
Total_Pct <- round(Total/sum(mt)*100, 2) # Total PCT
Bad_pct <- round((mt[,1]/sum(mt[,1]))*100, 2) # PCT of BAd or event or response
Good_pct <- round((mt[,2]/sum(mt[,2]))*100, 2) # PCT of Good or non-event
Bad_Rate <- round((mt[,1]/(mt[,1]+mt[,2]))*100, 2) # Bad rate or response rate
grp_score <- round((Good_pct/(Good_pct + Bad_pct))*10, 2) # score for each group
WOE <- round(log(Good_pct/Bad_pct)*10, 2) # Weight of Evidence for each group
g_b_comp <- ifelse(mt[,1] == mt[,2], 0, 1)
IV <- ifelse(g_b_comp == 0, 0, (Good_pct - Bad_pct)*(WOE/10)) # Information value for each group
Efficiency <- abs(Good_pct - Bad_pct)/2 # Efficiency for each group
otb<-as.data.frame(cbind(mt, Good_pct, Bad_pct, Total,
Total_Pct, Bad_Rate, grp_score,
WOE, IV, Efficiency ))
otb$Names <- rownames(otb)
Action Learning Project—Team: DATA ROCKS
A1 <- gbpct(train$browser)
A2 <- gbpct(train$deviceCategory)
A3 <- gbpct(train$country)
A4 <- gbpct(train$pageviews)
A5 <- gbpct(train$channelGrouping)
#Model Fitting
memory.limit(size=3900000)
memory.limit(size=35000)
# Test Dataset
test$BrowserChrome[test$browser=="Chrome"] <- 1
test$BrowserChrome[test$browser!="Chrome"] <- 0
test$BrowserSafari[test$browser=="Safari"] <- 1
test$BrowserSafari[test$browser!="Safari"] <- 0
test$BrowserFirefox[test$browser=="Firefox"] <- 1
test$BrowserFirefox[test$browser!="Firefox"] <- 0
test$BrowserIE[test$browser=="Internet Explorer"] <- 1
test$BrowserIE[test$browser!="Internet Explorer"] <- 0
test$OtherBrowser[test$browser!="Chrome" & test$browser!="Safari" & test$browser!="Firefox" &
test$browser!="Internet Explorer"] <- 1
test$OtherBrowser[test$browser=="Chrome" | test$browser=="Safari" | test$browser=="Firefox" |
test$browser=="Internet Explorer"] <- 0
test$browser <- NULL
sum(is.na(test$operatingSystem))#No Missing
table(test$operatingSystem)
test$OS_Windows <- ifelse(test$operatingSystem=="Windows",1,0)
test$OS_Macintosh <- ifelse(test$operatingSystem=="Macintosh",1,0)
test$OS_Android <- ifelse(test$operatingSystem=="Android",1,0)
test$OS_iOS <- ifelse(test$operatingSystem=="iOS",1,0)
test$OS_Linux <- ifelse(test$operatingSystem=="Linux",1,0)
test$OS_Chrome <- ifelse(test$operatingSystem=="Chrome OS",1,0)
test$BrowserChrome[test$browser=="Chrome"] <- 1
test$BrowserChrome[test$browser!="Chrome"] <- 0
test$BrowserSafari[test$browser=="Safari"] <- 1
test$BrowserSafari[test$browser!="Safari"] <- 0
test$BrowserFirefox[test$browser=="Firefox"] <- 1
test$BrowserFirefox[test$browser!="Firefox"] <- 0
test$BrowserIE[test$browser=="Internet Explorer"] <- 1
test$BrowserIE[test$browser!="Internet Explorer"] <- 0
test$OtherBrowser[test$browser!="Chrome" & test$browser!="Safari" & test$browser!="Firefox" &
test$browser!="Internet Explorer"] <- 1
Action Learning Project—Team: DATA ROCKS
#------backward selection------#
fitall <- lm(logtr~.,data=train)
summary(fitall) #NAs in Fitall
step(fitall,direction = "backward")
reg_back <- lm(formula = logtr ~ country_Chile + country_Indonesia + country_Japan +
country_Ukraine + country_Venezuela + visitNumber + hits +
pageviews + newVisits + OrganicSearch + Social + PaidSearch +
Affiliates + Chrome + OS_Macintosh + OS_Windows + OS_iOS +
OS_ChromeOS + mobile + region_California + metro_CA + metro_NY +
city_SanFrancisco + source_youtube + source_mall, data = train) #AIC=592
summary(reg_back)
library(caret)
prediction_back <- predict(reg_back, validation)
rmse_back <- postResample(validation$logtr, prediction_back)
Action Learning Project—Team: DATA ROCKS
#------forward selection------#
fitstart=lm(logtr~1,data=train)
step(fitstart,direction = "forward", scope = formula(fitall))
#------stepwise selection------#
step(fitstart,direction = "both", scope = formula(fitall))
#constrast hours
class(train$visitStartTime) = c('POSIXt','POSIXct')
head(train$visitStartTime)
train$hour<-as.POSIXlt(train$visitStartTime)$hour
summary(train$hour)
train.reg<-lm(transactionRevenue1~
channelGrouping1_Organic_Search+channelGrouping1_Referral+channelGrouping1_Direct
Action Learning Project—Team: DATA ROCKS
+channelGrouping1_Affiliates+channelGrouping1_Paid_Search+channelGrouping1_Social
+qrt1+qrt2+qrt3+factor(hour)
+browser1_Firefox+browser1_Chrome+browser1_Safari+browser1_Internet_Explorer
+deviceCategory_mobile+deviceCategory_tablet
+country1_Australia+country1_Indonesia+country1_Mexico+country1_United_States
+country1_Taiwan+country1_Canada+country1_Hong_Kong+country1_Japan+country1_Kenya+country
1_Venezuela+city1_Chicago+city1_Mountain_View+city1_New_York+city1_San_Francisco
+newVisits+ visitNumber+pageviews+I(pageviews^2)
,data=train)
summary(train.reg)
install.packages("ChannelAttribution")
install.packages("tidyverse")
install.packages("reshape2")
install.packages("ggthemes")
install.packages("ggrepel")
install.packages("RColorBrewe")
install.packages("markovchain")
install.packages("visNetwork")
install.packages("expm")
install.packages("stringr")
install.packages("Matrix")
library(tidyverse)
library(reshape2)
library(ggthemes)
library(ggrepel)
library(RColorBrewer)
library(ChannelAttribution)
library(markovchain)
library(visNetwork)
library(expm)
library(stringr)
train <-read.csv("traincl2.csv",header = T)
group_by(train_path_non_1$fullVisitorId) %>%
mutate(ord = c(1:n()),
is_non_direct = ifelse(channelGrouping == "Direct", 0, 1),
is_non_direct_cum = cumsum(is_non_direct)) %>%
## removing Direct when it is the first in the path
filter(is_non_direct_cum != 0) %>%
## replacing Direct with the previous touch point
mutate(channelGrouping = ifelse(channelGrouping == "Direct",
channelGrouping[which(channelGrouping != "Direct")][is_non_direct_cum], channelGrouping)) %>%
ungroup() %>%
select(-ord, -is_non_direct, -is_non_direct_cum)
# for first purchase customer: attribution analysis for multi and unique channel purchase paths
train_multi_paths <- train_path_non_1_multi %>%
group_by(fullVisitorId) %>%
summarise(path = paste(train_path_non_1_multi$channelGrouping, collapse = ' > '),
conversion = sum(conversion)) %>%
ungroup() %>%
Action Learning Project—Team: DATA ROCKS
filter(conversion >1)
var_path = 'path',
var_conv = 'conversion',
out_more = TRUE)
mod_attrib_alt$removal_effects
mod_attrib_alt$result
##add conversions generated from unique path to conversions generated form multi channel paths
together
group_by(train_path_non_1$fullVisitorId) %>%
mutate(ord = c(1:n()),
is_non_direct = ifelse(channelGrouping == "Direct", 0, 1),
is_non_direct_cum = cumsum(is_non_direct)) %>%
## removing Direct when it is the first in the path
filter(is_non_direct_cum != 0) %>%
## replacing Direct with the previous touch point
mutate(channelGrouping = ifelse(channelGrouping == "Direct",
channelGrouping[which(channelGrouping != "Direct")][is_non_direct_cum], channelGrouping)) %>%
Action Learning Project—Team: DATA ROCKS
ungroup() %>%
select(-ord, -is_non_direct, -is_non_direct_cum)
var_path = 'path',
var_conv = 'conversion',
out_more = TRUE)
mod_attrib_alt$removal_effects
mod_attrib_alt$result