You are on page 1of 19

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341997784

Prediction of Online Shopper's Purchasing Intention Using Binary Logistic


Regression, Decision Tree, and Random Forest

Preprint · May 2020


DOI: 10.13140/RG.2.2.16567.55209

CITATIONS
READS
0
3,374

3 authors, including:

Muhammad Adlansyah Muda


Radha Ayu Iswari
Institut Teknologi Sepuluh Nopember
Institut Teknologi Sepuluh Nopember
1 PUBLICATION 0 CITATIONS
1 PUBLICATION 0 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Muhammad Adlansyah Muda on 08 June 2020.

The user has requested enhancement of the downloaded file.


1

Prediction of Online Shopper’s Purchasing Intention Using


Binary Logistic Regression, Decision Tree, and Random
Forest
Muhammad Adlansyah Muda[1], Radha Ayu Aswari[2], Muhammad Ahsan[3]
Department of Statistics, Faculty of Science and Data Analytics, Sepuluh Nopember Institute of Technology
Arief Rahman Hakim Street, Surabaya 60111 Indonesia
adlansyahmuda@gmail.com[1], radhaayu@gmail.com[2]

ABSTRACT holiday, online shopping day, etc. E-commers offer some


Online shopping is becoming a trend in recent year. discount, flash sale and other promotion to persuade the
Compared to the physical stores, customers prefer do online customers buying the products. The promotion day will
shopping, mainly because people find it convenient and easy increase the customer intention in online shopping.
to bargain shop from the comfort of their home or office. Nowadays online shopping is one of people daily activity
Online shop platform can provide many products which and help them to save time and energy.
fulfill our needs. People only need internet service and
digital payment, after that the product will deliver to their
In this paper, we propose to analyst the factors of
house of office. Nowadays online shopping is one of people online shopper intention and predict the decision of online
daily activity and help them to save time and energy. The shopping. The dataset tells about online shops’ customer
information provided by the visits of users is fed to machine behavior which influence customer end with shopping.
learning classification algorithms to build a predicting We use three classification methods, binary logistic
model. We use three classification methods, binary logistic regression, decision tree and random forest. For the
regression, decision tree and random forest. In the process model we define
of refining the model and making it better to provide more 0 for false revenue and 1 for the true revenue. The
insightful results, oversampling and feature selection pre-
performance of the classification algorithms used in this
processing steps are employed. The biggest probability of
gaining revenue are new visitor, visitors who has low bounce paper is compared using accuracy. We hope this paper
rate, month sales in November and May, visitor who spent will help the online shop platforms to find their strategy
more times in product related. The best model for this paper to increase the revenue.
is random forest using 5-fold cross-validation with 88,21%
accuracy. II. LITERATURE REVIEW
Keywords—Cross Validation, Feature Selection, Online
There are a number of literature review to achieve
Shopping, Oversampling, Pre-processing
the proposes of the research. The literature review used in
this research is as follows.
I. INTRODUCTION A. Data Cleaning

O
Real word data tend to be incomplete, noisy, and
nline shopping is becoming a trend in recent
inconsistent. Data cleaning (or data cleansing) routines
year. Burdened by expensive store leases,
attempt to fill in missing values, smooth out noise while
unprofitable markets and continuing
identifying outliers, and correct inconsistencies in the
disruption from online shopping, Forever 21 will close
data.
200 stores [1]. Compared to the physical stores,
customers prefer do online shopping, mainly because 1. Missing Values
people find it convenient and easy to bargain shop from Missing values are endemic in real world datasets.
the comfort of their home or office. Online shop platform There are 6 methods how to resolve missing values is as
can provide many products which fulfill our needs. follows [3].
People only need internet service and digital payment, a. Ignore the tuple
after that the product will deliver to their house of office. This is usually done when the class label is missing
Consumers can get full information about the product (assuming the mining task involves classification). This
with its reviews being passed by the existing users. If they method is not very effective, unless the tuple contains
want to buy a product, they are no longer limited to several attributes with missing values.
asking their friends and families because there are many b. Fill in the missing value manually
products reviews on the web which gives opinions of the
In general, this approach is time consuming and
existing users of the product. may not be feasible given a large data set with many
There are some advantages of online shopping are missing values.
save the time, save energy, 24/7 availability, and
comparison of prices. In 2018, an estimated 1.8 billion c. Use a global constant to fill in the missing value
people worldwide purchase goods online. During the Replace all missing attribute values by the same
same year, global e-retail sales amounted to 2.8 trillion constant such as a label like “Unknown” or −∞.
U.S. dollars and projections show a growth of up to 4.8 d. Use a measure of central tendency for the attribute
trillion For normal (symmetric) data distributions, the mean
U.S. dollars by 2021 [2]. The instance of online shopping can be used, while skewed data distribution should
platforms are amazon, Alibaba, Tokopedia, Rakuten, etc. employ the median.
On the special day like Christmas, valentine, summer e. Use the attribute mean or median for all samples
belonging to the same class as the given tuple
2
If the data distribution for a given class is normal,
n
the mean can be used. But, if the data distribution for a
 x x
2

given class if skewed, the median value is a better choice. i


(3)
f. Use the most probable value to fill in the missing s  s2  i 1

n1
value
C. Feature Selection
This may be determined with regression, inference Roughly speaking, feature selection methods are
based tools using a Bayesian formalism, or decision tree applied in one of three conceptual frameworks: the filter
induction. model, the wrapper model, and embedded methods. These
2. Outlier Detection three basic families differ in how the learning algorithm is
One class classification is often called outlier (or incorporated in evaluating and selecting features. In the
novelty) detection because the learning algorithm is being filter model the selection of features is done as a
used to differentiate between data that appears normal and preprocessing activity, without trying to optimize the
abnormal with respect to the distribution of the training performance of any specific data mining technique
data [4]. One of method to outlier detection is looking directly. This is usually achieved through an (ad hoc)
outlier plot at boxplot. It should be noted that the boxplot evaluation function using a search method in order to
rule for detecting outliers has been criticized on the select a subset of features that maximizes this function.
grounds that it might declare too many points outliers Performing an exhaustive search is usually intractable due
when there is skewness. More precisely, if a distribution to the large number of initial features. Therefore, different
is skewed to the right, among the larger values that are methods may apply a variety of search heuristics.
observed, too many might be declared outliers [5]. Wrapper methods select features by “wrapping” the
B. Descriptive Statistics serach around the selected learning algorithm and
Descriptive statistics is a primary step to organize evaluate feature subsets based on the learning
the information and extract a descriptive summary that performance of the data mining technique for each
highlights its salient features. candidate feature subset. The main drawback of this
approach is its computational complexity. Finally,
1. Measures of Center embedded methods incorporate feature search and the
The most important aspect of studying the learning algorithm into a single optimization problem
distribution of a sample of measurements is locating the formulation. When the number of samples and
position of a central value about which the measurements dimensions becomes very large, the filter approach is
are distributed. The two most commonly used indicators usually a choice due to its computational efficiency and
of center are the mean and the median [6]. neutral bias toward any learning methodology [7].
The sample mean of a set of n measurements 𝑥1, 𝑥2,
D. Chi-Squared Test of Independence
..., 𝑥𝑛 is the sum of these measurements divided by n. The
Chi-Squared test is used to check relationship
sample mean is denoted by 𝑥̅.
n between the predictor variable and the response variable.
x i
(1)
The I × J contingency table that cross-classifies n subjects
is as follows.
x  i 1

n Table 1 Contingency Table for I × J


The sample median of a set of n measurements 𝑥1, Variable 2
Variable 1 1 2 ⋯ J Total
𝑥2, ..., 𝑥𝑛 is the middle value when the measurements are 𝑛11 𝑛12 𝑛1𝐽 𝑛1.
1 ⋯
arranged from smallest to largest. 𝑛21 𝑛22 𝑛2𝐽 𝑛2.
2 ⋯
2. Measures of Variation ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
Besides location the center of the data, any I 𝑛𝐼1 𝑛12 ⋯ 𝑛𝐼𝐽 𝑛𝐼.
descriptive study of data must numerically measure the Total 𝑛.1 𝑛.2 ⋯ 𝑛.𝐽 𝑛..
extent of variation around the center. Two data sets may
Hypothesis:
exhibit similar positions of center but may be remarkably
different with respect to variability. To obtain a measure H$ ∶ There is no relationship between two variables
of spread, called the sample variance is constructed by H1 ∶ There is a relationship between two variables
adding
the squared deviations and dividing the total by the Statistical Test:
  
 n e  
2

number of observationsn minus one [6].  


2
I J
2
ij ij (4)

 x i
x (2) i 1 e
j 1  ij
2
s  i 1
Where 
n1
To obtain a measure of variability in the same unit ni. 
eij  n
.j
as the data, we take the positive square root of the
variance, called the sample standard deviation. The n..
standard deviation rather than the variance serves as a Information:
basic measure
of variability. 𝑛i. = total observation frequency for line i
𝑛.j = total observation frequency for column j
𝑛ij = observation value at line i and column j
𝑒ij = expectation value at line i and column j
3
Rejection Area: Reject H$, if 𝜒2 is greater than at a node. Conversely, it reaches its maximum value
𝜒(21)𝛼)(𝐼)1)(𝐽)1) [8]. when class sizes at the node are equal [13].
E. Multicollinearity Test 2. The entropy (impurity), used mainly in ID3, C4.5
Multicollinearity testing serves to minimize the and C5.0, is based on selecting the splitting point
error value on the resulting estimate. This that maximizes the information gain (i.e., maximum
multicollinearity test is an assumption that must be entropy reduction) (idem). Similarly, the minimum
fulfilled in regression. If there is a high correlation value is zero, when all records belong to one class,
between the predictor variables, it will make a very large implying most information.
error in the resulting estimate. Multicollinearity testing 3. The misclassification measure, based on
can be detected by the value of Variance Inflation Factors classification error as its name suggest, is sometimes
(VIF) [9]. VIF values greater than 5 indicate a correlation used to measure the node ‘impurity’ (idem). Once
between the predictor variables [10]. VIF value can be again, the minimum value is zero, when all records
calculated with the following formula. belong to one class, implying most information.
4. Chi-square measure, which is similar to the standard
1 Chi-square value computed for the expected and
VIFk (5)
 1  R2 observed classifications [14].
k
5. G-square measure, which is similar to the
Where maximum- likelihood Chi-square (idem).
SSR
 kR  
2 2   Yˆ  Y
i
G. Random Forest
Random forest is an ensemble learner, a method that
 Y  Y 
2
SST i generates many classifiers and aggregates their results.
Information: RF will create multiple classification and regression
(CART)
𝑘 = number of predictor variable trees, each trained on bootstrap sample of the original
𝑅𝑘2 = coefficient of determination between the j-th training data and searches across a randomly selected
predictor variable with other predictor variables subset of input variables to determine the split. CARTs
𝑆𝑆𝑅 = sum of square regression are binary decision trees that are constructed by splitting
𝑆𝑆𝑇 = sum of square total the data in a node into child nodes repeatedly, starting
with the root node that contains the whole learning
F. Decision Tree sample [15]. Each tree in RF will cast a vote for some
In principle, decision trees are used to predict the input x, then the output of the classifier is determined by
membership of objects to different categories (classes), majority voting of the trees. RF can handle high
taking into account the values that correspond to their dimensional data and use a large number of trees in the
attributes (predictor variables) [11]. The flexibility of this ensemble. Some important features of RF are [16]:
technique makes it particularly attractive, especially 1. It has an effective method for estimating missing
because it presents the advantage of a very suggestive data.
visualization (a ‘tree’ which synthetically summarizes the 2. It has a method, weighted random forest (WRF), for
classification). Although decision trees are not so balancing error in imbalanced data.
widespread in the pattern recognition field from a 3. It estimates the importance of variables used in the
probabilistic statistical point of view, they are widely used classification.
in other domains such as, for instance, medicine
H. Logistic Regression
(diagnosis), computer science (data structures), botany
The difference between simple linear regression and
(classification), psychology (behavioral decision theory),
logistic regression is the response variable. Logistic
etc. The general way of inducing a decision tree
regression is one of method that can be used to find the
according to Hunt’s algorithm lies in the following steps
relationship between response variables that are
[12]:
dichotomous (nominal or ordinal scale with two
1. Denote by 𝐷𝑡 the set of training objects (Data) that
categories) or polychotomous (normal or ordinal scale
reach node t.
with more than two categories) with one or more
2. If 𝐷𝑡 is an empty set, the t is a terminal node (a leaf
predictor variables. While the response variables is
node), labeled by the class Φ𝑡.
continuous or categorical [8].
3. If 𝐷𝑡 contains objects that belong to the same class
𝐶𝑡, the t is also a leaf node, labeled as 𝐶𝑡. 1. Binary Logistic Regression
4. If 𝐷𝑡 contains objects that belong to more than one Binary logistic regression is a data analysis method
class, then we use an attribute test to split the objects used to find the relationship between the binary or
into smaller subsets. dichotomous response variable (𝑦) with polycotomous
Here the most commonly used in decision tree predictor variable (𝑥) [9]. The outcome of the response
method are: variable consists of 2 categories, namely “success” and
1. The GINI (impurity) index, used mainly in CART “failed” which is denoted by 𝑦 = 1 (success) and 𝑦 = 0
(C&RT) and SPRINT algorithms, represents a (failed). In such circumstances, the response variable
measure of how often a randomly chosen object follows the Bernoulli distribution for every single
from the training dataset could be incorrectly labeled observation. The probability function for each
if it were randomly labeled according to the observation is given as follows.
distribution of labels in the dataset. As an impurity
measure, it
4
reaches a value of zero when only one class is f  y    1  
y
(6)
present

1 y
5
Where if 𝑦 = 0 then 𝑓(𝑦) = 1 − 𝜋 and if 𝑦 = 1 then Then
𝑓(𝑦) = 𝜋. The logistic regression function can be written n n
as follows.
 y x   x ˆ  x i ij ij i
(10)
0
   x …  x  i 1 i 1

  x e 0 1 1 p p

(7) Estimates of variance and covariance are developed


    x …  x 
1e 0 1 1 p p through the MLE (Maximum Likelihood Estimation)
A transformation of 𝜋(𝑥) that is central of logistic theory of the parameter coefficients [9]. The theory states
regression. This transformation is defined in terms of 𝜋(𝑥) that the estimated covariance variance is obtained through
is as follows. the second derivative of L(𝛽).

g  x      x  L  n

ln 
1 x
  0  1 x1 …   (8)    x x   x 1    x 
ij iu i i

 i 1

 j  u
p
xp
2. Parameter Estimation The covariance variance matrix based on parameter
Parameter estimation in logistic regression are estimation is obtained through the inverse matrix and
carried out using the Maximum Likelihood method. The given as follows.
method estimates the parameter 𝛽 by maximizing the
Cov  ˆ   x Diag ˆ i x  1  ˆ  x  x
1
T
i

likelihood function and requires that the data must follow Where
a certain distribution. In logistic regression, each
observation follows the Bernoulli distribution so that its  x1 x
1 !
!
1 
x
x   11
T
likelihood function can be determined. If 𝑥i and 𝑦i are 21 n1

pairs
of independent variables and bound to i-th observations ⁝ ⁝ ⁝  ⁝
and it is assumed that each pair of observations is  
independent of the other observation pairs, with i = 1, 2,  x1k x2 k ! xnk 
..., n then the probability function for each pair is as
follows. Diag ˆ  xi  1 ˆ  xi    is a diagonal matrix (𝑛 ×
𝑛)
 x     x  1    x with the main diagonal is ˆ  x  1 ˆ  x    . The
yi
f (9)

1 yi
SE>𝛽? @
i i i
 i i 
Where estimator will be used in the parameter estimation testing
 p 
  j xj  stage. To get the estimated value of 𝛽 from the first

x  e
 j 0  derivative of the non-linear L(𝛽) function, the Newton
i
Raphson iteration method is used. The equation used is as
 p

follows.
1e
  j xj 
j 0
     1  

Each pair of observations is assumed to be independent,


so that the likelihood function is a combination of the 
t 
t
 H  t
 q
t
(11)
Where 1
t t 
t 
distribution functions of each pair as follows. t   L    L    L    
n n
q  , ,… ,

l    f  x      x  1    x   
yi
i i i  K 
 
0 1


1 yi
h
i 1 i 1
     xi  i
y
 h11 ! h1k 

12

  1  x   
ln
  

  1   
 
 n
xi   t  h !
  
i
21 22 2
 H   
e
n
i 1  ⁝ ⁝ 
⁝ h⁝
 i 1 h 

 ! h 
 n  p 1
   y x  
p n
h h
x 
 
j ij i ij j k1 k2 kk

   1  e
j
 j 0    3. Parameter Estimation Testing
0  e
i 0

 
 i 1

After estimating the parameters obtained, then
   
6
The likelihood function is more easily maximized in the testing the significance of the coefficient 𝛽 is carried out
form of ln 𝑙(𝛽) and expressed as L(𝛽). simultaneously with the simultaneous testing hypothesis as
L     ln l    follows.
H$ : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0
p n n   x  H : There is at least one 𝛽 ≠ 0 ; j = 1, 2, ..., p
 
p

   y ix ij   j   ln 1 
1 j
 j
Statistical Test:

0 j ij

e 
j 0  i 1 
n1 n0
i 1
  n  n 
The maximum of 𝛽 value is obtained through the
 n   n 
1 0

derivative of L(𝛽) to 𝛽 and the result is equal to zero. (12)


G  2 ln n
    ˆ 1  ˆ 
i

x yi
p
j ij 1 y 

n n

L

  yx  x i ij ij e j 0

p  Where i 1
i i


j i 1
i 1
 x j ij

1e 
j 0
7

n1  
n
yi n The accuracy value is a measure of how well a
i 1 n0   1  yi  n  n1  n classifier distinguishes both positive and negative cases
i 1 [19]. The AUC shows the ability of classifier to prevent
The G statistical test is a Likelihood Ratio Test where the false classification. It measures the area under Receiver
G value follows the Chi-Squared distribution, so H$ is Operating Characteristic (ROC) curve, which is a two-
rejected if G > 𝜒22𝛼3 ,𝑣6 with v degree of freedom is the dimensional plot of the sensitivity (the measure of how
2
number of parameters in the model without 𝛽$. well a classifier distinguishes positive cases) versus
When the simultaneous test doesn’t produce specificity (the measure of how well a classifier
significant results, then it tested the significance of the distinguishes negative cases) at various discrimination
coefficient 𝛽 partially on the response variable by thresholds [20].
comparing the maximum likelihood yield parameters, the G. Online Shop
alleged 𝛽 with standard error parameter. The partial Online shopping is a form of electronic commerce
testing hypothesis is as follows. which allows consumers to directly buy goods or services
H$ : 𝛽j = 0 from a seller over the internet using a web browser. There
H1 : 𝛽j ≠ 0 ; j = 1, 2, ..., p are many reasons why people shop online. Empirical
research shows that convenient of the internet is one of
Statistical Test: the impacts on consumer’s willingness to buy online [21].
The
2
ˆj
W    internet has made the data accessing easier, so given

2
 (13) customers rarely have a chance to touch and feel product
 SE   j  
ˆ and service online before they make decision, online
The W statistical test follows the Chi-Squared sellers normally provide more product information that
distribution,
so H$ is rejected if W > 𝜒2 𝛼 with v degree of freedom customers can use making a purchase [22]. E-commerce
2 3 ,𝑣6 2 has made a transaction easier than it was and online stores
or the number of predictors. offer consumers benefits by providing more variety of
I. Performance Evaluation of Classifiers products and services that they can choose from [23].
According to [17] evaluation of classification
procedure is an evaluation that looks at the possibility of III. METHODOLOGY
misclassification made by a classification function. The The methodology will explain the steps from the
measure used is Apparent Error Rate (APER). APER analysis and the variable that we will use for the model.
value by the classification function. The determination of
A. Data Source
classification can be seen from Table 2. In Table 2, The data explain about the online shopper’s
subjects are classified into two groups namely 𝜋1 and 𝜋2 intention. There are 12330 observation and 18 variables,
as follows. we will split the data 70% for training data and 30 % for
Table 2 Classification Table. testing data to make the prediction. The data was get from
Observation Estimate https://archive.ics.uci.edu/ml/datasets/Online+ShoppersP
Results 𝜋1 𝜋2 urchasing+Intention+Dataset.
𝑦1 𝑛11 𝑛12
𝑦2 𝑛21 𝑛22 B. Variable
In this paper we use some variables that we can see
Information:
in the table 3.
𝑛11 = number of subjects from 𝑦1 was correctly classified
Table 3 Variable for The Analysis.
as 𝜋1. Variable Variable Name Scale
𝑛12 = number of subjects from 𝑦1 was incorrectly
classified as 𝜋2 . 1 = Revenue True Nominal
𝑛21 = number of subjects from 𝑦2 was incorrectly Y 0 = Revenue False
classified as 𝜋1. X1 Administrative Ratio
𝑛22 = number of subjects from 𝑦2 was correctly classified Administrative
X2 Ratio
as 𝜋2. Duration
Formula for determining errors classification as follows. X& Informational Ordinary
n n Informational Ratio
X4
APER %   100%
12 21
(14) Duration
n n n n X( Product Related Ratio
12 11 21 22

In addition, the confusion matrices were formed for Product Related


X) Ratio
Duration
classifiers and form these matrices the measures of
X* Bounce Rates Ordinary
accuracy and Area Under the Curve (AUC) criteria were X+ Exit Rate Interval
calculated using Eqs. (15) and (16) [18]. X9 Page Values Ratio
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = X1- Special Date Ordinary
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 (15) X11 Month Nominal
1
𝐴𝑈𝐶 = O X12 Operating Systems Nominal
𝑇𝑃 𝑇𝑁 (16) X1& Browser Nominal
P
2 𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 + 𝐹𝑃 X14 Region Nominal
where TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative measures,
8
respectively. X1( Traffic Type Nominal
X1) Visitor Type Nominal
X1* Weekend Nominal
9
C. Steps Analysis B. Feature Selection
The steps that we do during the analysis are: Effective feature selection eliminates redundant
1. Data cleaning (fill the missing value detection) variables and keeps only the best subset of predictors in
2. Feature Selection the model which also gives shorter training times. In this
3. Multicollinearity Test paper we used chi-square test for the feature selection.
4. Exploratory Data The Chi- square test is used in statistics to test the
5. Split the data for training and testing independence of two events.
6. Make the model using binary logistic regression.
Table 4 Chi-Square Test.
7. Make the model using Decision Tree. Variable
Chi-square
Score Result
8. Make the model using Random Forest. Administrative 1133.97 Significant
9. Compare the accuracy between the models. Administrative Duration 41754.84 Significant
Informational 357.98 Significant
IV. RESULTS AND DISCUSSION Informational Duration 35059.78 Significant
Product Related 19317.29 Significant
In this research several analytical methods are used ProductRelated Duration 877404.34 Significant
to achieve the research objectives. The results of the Bounce Rates 29.65 Significant
analysis and discussion in predicting of online shopper’s Exit Rate 28.99 Significant
purchasing intention given as follows. PageValues 175126.81 Significant
Special Date 53.80 Significant
A. Pre-Processing Month 86.16 Significant
Operating Systems 1.04 Significant
This dataset doesn’t contain any missing values.
Browser 8.87 Significant
However, it contains imbalance data in target value. The Not
proportion of categories is shown in figure 1 Region 3.04
Significant
Traffic Type 1.28 Significant
Visitor Type 37.55 Significant
Weekend 8.12 Significant
From the table 4 we compare between chi-square
score and chi-squares’ value from the chi-square
distribution table. The result from the dependence test is
we eliminate variable region from model.
C. Multicollinearity Test
Multicollinearity is a phenomenon in which two or
more predictor variables in a multiple regression model
are highly correlated, meaning that one can be linearly
predicted from the others with a substantial degree of
accuracy The VIF is widely used as a measure of the
degree of multi-collinearity of the independent variable
with the other independent variables in a model.
Table 5 VIF TEST.
Figure 1 Bar Chart of Revenue. Variable VIF Result
There are majority class of No Revenue than Administrative 2.832 Significant
Revenue. This makes the data is unbalanced which might Administrative Duration 2.044 Significant
Informational 2.120 Significant
affect the performance of the prediction models. Because Informational Duration 1.779 Significant
of that we do oversampling using SMOTE method to treat Product Related 6.667 Not Significant
the imbalance data shows in figure 2 as follows. Product Related Duration 6.015 Not Significant
Bounce Rates 7.345 Not Significant
Exit Rate 11.721 Not Significant
PageValues 1.142 Significant
Special Date 1.132 Significant
Month 4.579 Significant
Operating Systems 5.664 Not Significant
Browser 2.940 Significant
Traffic Type 2.120 Significant
Visitor Type 6.155 Not Significant
Weekend 1.287 Significant
From Ringle’s rule to avoid the multicollinearity,
we use VIF less than 5. We eliminate variable one by one
from the biggest VIF first. After four times eliminate the
variables, we get the significant variable refers to table 6.

Figure 2 Bar Chart of Revenue After Over Sampling.


1
Table 6 VIF TEST After Elimination.
Variable VIF Result
Administrative 2.574 Significant
Administrative Duration 1.968 Significant
Informational 2.105 Significant
Informational Duration 1.761 Significant
ProductRelated Duration 1.905 Significant
Bounce Rates 1.330 Significant
PageValues 1.126 Significant
Special Date 1.128 Significant
Month 4.255 Significant
Browser 2.551 Significant
Traffic Type 1.999 Significant
Visitor Type 4.777 Significant
Weekend 1.278 Significant
D. Exploratory Data Analysis
Exploratory Data Analysis will give more insight Figure 5 Revenue by Duration.
and information about the features.
From figure 5 shows total amount of time (in
seconds) spent by the visitor on product related pages
give the biggest impact to visitor making decision for do
online shopping.

Figure 6. KDE Plot of Bounce Rate


Refer to figure 6, low bounce rate will make the
Figure 3 Revenue by Visitor Type. probability of the customer do online shopping bigger.
From figure 3, we can see that the new visitors give User interface is important to make the visitor
the biggest impact to revenue 25%. Otherwise the comfortable and enjoy while do online shopping.
returning visitors give impact 14% and Other give impact E. Classification Analysis
19%. The methods used to predict online shopper’s
purchasing intention in this research are binary logistic
regression, decision tree, and random forest with
evaluation methods are 5 and 10-fold cross validation.
1. Binary Logistic Regression
Predict online shopper’s purchasing intention with
binary logistic regression is as follow.
a. Binary Logistic Regression with 5-Fold Cross
Validation
The results of the classification using training data
from 5-fold cross validation on binary logistic regression
methods are shown in Table 5 below.
Table 7 Confusion Matrix of Training Data from 5-Fold Cross
Validation with The Binary Logistic Regression.
Predict
Actual
Positive Negative
Positive 6631 714
Figure 4. Revenue by Month.
Negative 1655 5690
From figure 4 shows the biggest sales occur in May
and November. For increasing the revenue, we suggest After the confusion matrix is formed as in Table 7,
the e-commerce to give more discount and promotion in the next step is to calculate the accuracy of the
May and November. classification of the training data from 5-fold cross
validation by using equations (15) and (16) with the
results summarized in Table 8 as follow.
1
Table 8 The Accuracy of Training Data Classification from 5-Fold
Cross Validation with The Binary Logistic Regression. possible accuracy rate of customer disagree to buy is
Accuracy Sensitivity Specificity AUC 89,57%. The accuracy value is not too far different from
83,87% 77,47% 90,28% 83,87% the classification results in the training data. It shows that
The measure of classification accuracy used is the the classification results are quite good to predict online
accuracy value because the data on the online shopper’s shopper’s purchasing intention. The ROC curve for the
purchasing intention have a balanced category. In Table 8 testing data from 5-fold cross validation is shown in
it is shown that with the binary logistic regression method Figure 8 below.
with 5-fold cross validation, the accuracy of classification
obtained in the training data is 83,87% with a possible
accuracy rate of customer agree to buy is 77,47% and
possible accuracy rate of customer disagree to buy is
90,28%. The ROC curve for the training data from 5-fold
cross validation is shown in Figure 7 below.

Figure 8 ROC Curve of Testing Data from 5-Fold Cross Validation


with Binary Logistic Regression.
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use
threshold 0,5. Figure 8 shows that the area of the ROC
curve or can be referred as the AUC value is 80,31%. The
Figure 7 ROC Curve of Training Data from 5-Fold Cross Validation
with Binary Logistic Regression. results also show that the evaluation results on the testing
data are not too far from the classification results in the
Because of the possibility of online shopper’s
training data. So, it can be concluded that there is no
purchasing intention have a balanced category, we use
overfitting or underfitting in the result of prediction
threshold 0,5. Figure 7 shows that the area of the ROC
models.
curve or can be referred as the AUC value is 83,87%.
Then do the evaluation using testing data with evaluation b. Binary Logistic Regression with 10-Fold Cross
method is 5-fold cross validation. The evaluation results Validation
are shown in Table 9 below. Then the classification results are compared with 10-
Table 9 Confusion Matrix of Testing Data from 5-Fold Cross
fold cross validation to get the best accuracy value. The
Validation with The Binary Logistic Regression. results of the classification using training data from 10-
Predict fold cross validation on binary logistic regression
Actual methods are shown in Table 11 below.
Positive Negative Table 11 Confusion Matrix of Training Data from 10-Fold Cross
Positive 2756 321 Validation with The Binary Logistic Regression.
Predict
Negative 180 442 Actual
Positive Negative
After the confusion matrix is formed as in Table 9, Positive 6636 709
the next step is to calculate the accuracy of the
classification of the testing data from 5-fold cross Negative 1655 5690
validation by using equations (15) and (16) with the After the confusion matrix is formed as in Table 11,
results summarized in Table 10 as follow. the next step is to calculate the accuracy of the
Table 10 The Accuracy of Testing Data Classification from 5-Fold classification of the training data from 10-fold cross
Cross Validation with The Binary Logistic Regression.
validation by using equations (15) and (16) with the
Accuracy Sensitivity Specificity AUC
results summarized in Table 12 as follow.
86,46% 71,06% 89,57% 80,31%
Table 12 The Accuracy of Training Data Classification from 10-Fold
The measure of classification accuracy used is the Cross Validation with The Binary Logistic Regression.
accuracy value because the data on the online shopper’s Accuracy Sensitivity Specificity AUC
purchasing intention have a balanced category. In Table 83,91% 77,47% 90,35% 83,91%
10 it is shown that with the binary logistic regression The measure of classification accuracy used is the
method with 5-fold cross validation, the accuracy of accuracy value because the data on the online shopper’s
classification obtained in the testing data is 84,46% with a purchasing intention have a balanced category. In Table
possible accuracy rate of customer agree to buy is 12 it is shown that with the binary logistic regression
71,06% and method
1
with 10-fold cross validation, the accuracy of
classification obtained in the training data is 83,91% with
a possible accuracy rate of customer agree to buy is
77,47% and possible accuracy rate of customer disagree
to buy is 90,35%. The ROC curve for the training data
from 10-fold cross validation is shown in Figure 9 below.

Figure 10 ROC Curve of Testing Data from 10-Fold Cross Validation


with Binary Logistic Regression.
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use
threshold 0,5. Figure 10 shows that the area of the ROC
Figure 9 ROC Curve of Training Data from 10-Fold Cross Validation curve or can be referred as the AUC value is 80,51%.
with The Binary Logistic Regression. When compared to the results of 5-fold cross validation
Because of the possibility of online shopper’s accuracy, the value of testing data accuracy with 10-fold
purchasing intention have a balanced category, we use cross validation is better than 5-fold cross validation. The
threshold 0,5. Figure 9 shows that the area of the ROC results also show that the evaluation results on the testing
curve or can be referred as the AUC value is 83,91%. data are not too far from the classification results in the
When compared to the results of 5-fold cross validation training data. So it can be concluded that there is no
accuracy, the value of training data accuracy with 10-fold overfitting or underfitting in the result of prediction
cross validation is better than 5-fold cross validation. models.
Then do the evaluation using testing data with evaluation 2. Decision Tree
method is 10-fold cross validation. The evaluation results Predict online shopper’s purchasing intention with
are shown in Table 13 below. decision tree is as follow.
Table 13 Confusion Matrix of Testing Data from 10-Fold Cross
Validation with The Binary Logistic Regression. a. Decision Tree with 5-Fold Cross Validation
Predict The results of the classification using training data
Actual from 5-fold cross validation on decision tree methods are
Positive Negative shown in Table 15 below.
Positive 2758 319 Table 15 Confusion Matrix of Training Data from 5-Fold Cross
Validation with The Decision Tree.
Negative 178 444
Predict
After the confusion matrix is formed as in Table 13, Actual
Positive Negative
the next step is to calculate the accuracy of the
classification of the testing data from 10-fold cross Positive 6949 396
validation by using equations (15) and (16) with the Negative 325 7020
results summarized in Table 14 as follow.
After the confusion matrix is formed as in Table 15,
Table 14 The Accuracy of Testing Data Classification from 10-Fold
Cross Validation with The Binary Logistic Regression. the next step is to calculate the accuracy of the
Accuracy Sensitivity Specificity AUC classification of the training data from 5-fold cross
86,56% 71,38% 89,63% 80,51% validation by using equations (15) and (16) with the
The measure of classification accuracy used is the results summarized in Table 16 as follow.
accuracy value because the data on the online shopper’s Table 16 The Accuracy of Training Data Classification from 5-Fold
Cross Validation with The Decision Tree.
purchasing intention have a balanced category. In Table
Accuracy Sensitivity Specificity AUC
14 it is shown that with the binary logistic regression 95,09% 95,58% 94,61% 95,09%
method with 10-fold cross validation, the accuracy of
classification obtained in the testing data is 86,56% with a The measure of classification accuracy used is the
possible accuracy rate of customer agree to buy is 71,38% accuracy value because the data on the online shopper’s
and possible accuracy rate of customer disagree to buy is purchasing intention have a balanced category. In Table
89,63%. The accuracy value is not too far different from 16 it is shown that with the decision tree method with 5-
the classification results in the training data. It shows that fold cross validation, the accuracy of classification
the classification results are quite good to predict online obtained in the training data is 95,09% with a possible
shopper’s purchasing intention. The ROC curve for the accuracy rate of customer agree to buy is 95,58% and
testing data from 10-fold cross validation is shown in possible accuracy rate of customer disagree to buy is
Figure 10 below. 94,61%. The ROC curve for the training data from 5-fold
cross validation is shown in Figure 11 below.
1
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use
threshold 0,5. Figure 12 shows that the area of the ROC
curve or can be referred as the AUC value is 80,55%. The
accuracy value is quite different from the classification
results in the training data. This shows that the
classification results are not good enough to predict
online shopper’s purchasing intention. So, it can be
suspected that there is overfitting in the result of
prediction models. One of solution to resolve this
problem is to change the comparison of the amount of
training data and testing data. But in this research, no
Figure 11 ROC Curve of Training Data from 5-Fold Cross Validation further analysis was carried out in handling this matter
with Decision Tree. and it was assumed that overfitting didn’t occur.
Because of the possibility of online shopper’s b. Decision Tree with 10-Fold Cross Validation
purchasing intention have a balanced category, we use Then the classification results are compared with 10-
threshold 0,5. Figure 11 shows that the area of the ROC fold cross validation to get the best accuracy value. The
curve or can be referred as the AUC value is 95,09%. results of the classification using training data from 10-
Then do the evaluation using testing data with evaluation fold cross validation on decision tree methods are shown
method is 5-fold cross validation. The results of the in Table 19 below.
evalution are shown in Table 17 below. Table 19 Confusion Matrix of Training Data from 10-Fold Cross
Table 17 Confusion Matrix of Testing Data from 5-Fold Cross Validation with The Decision Tree.
Validation with The Decision Tree. Predict
Predict Actual Positive Negative
Actual
Positive Negative Positive 6897 448
Positive 2805 272 Negative 374 6971
Negative 187 435
After the confusion matrix is formed as in Table 19,
After the confusion matrix is formed as in Table 17, the next step is to calculate the accuracy of the
the next step is to calculate the accuracy of the classification of the training data from 10-fold cross
classification of the testing data from 5-fold cross validation by using equations (15) and (16) with the
validation by using equations (15) and (16) with the results summarized in Table 20 as follow.
results summarized in Table 18 as follow. Table 20 The Accuracy of Training Data Classification from 10-Fold
Table 18 The Accuracy of Testing Data Classification from 5-Fold Cross Validation with The Decision Tree.
Cross Validation with The Decision Tree. Accuracy Sensitivity Specificity AUC
Accuracy Sensitivity Specificity AUC 94,40% 94,91% 93,90% 94,40%
87,59% 69,94% 91,16% 80,55%
The measure of classification accuracy used is the
The measure of classification accuracy used is the accuracy value because the data on the online shopper’s
accuracy value because the data on the online shopper’s purchasing intention have a balanced category. In Table
purchasing intention have a balanced category. In Table 20 it is shown that with the decision tree method with 10-
18 it is shown that with the decision tree method with 5- fold cross validation, the accuracy of classification
fold cross validation, the accuracy of classification obtained in the training data is 94,40% with a possible
obtained in the testing data is 87,59% with a possible accuracy rate of customer agree to buy is 94,91% and
accuracy rate of customer agree to buy is 69,94% and possible accuracy rate of customer disagree to buy is
possible accuracy rate of customer disagree to buy is 93,90%. The ROC curve for the training data from 10-
91,16%. The ROC curve for the testing data from 5-fold fold cross validation is shown in Figure 13 below.
cross validation is shown in Figure 12 below.

Figure 13 ROC Curve of Training Data from 10-Fold Cross Validation


with The Decision Tree.
Figure 12 ROC Curve of Testing Data from 5-Fold Cross Validation Because of the possibility of online shopper’s
with Decision Tree. purchasing intention have a balanced category, we use
1
threshold 0,5. Figure 13 shows that the area of the ROC classification results are not good enough to predict
curve or can be referred as the AUC value is 94,40%. online shopper’s purchasing intention. So, it can be
When compared to the results of 5-fold cross validation suspected that there is overfitting in the result of
accuracy, the value of training data accuracy with 10-fold prediction models. One of solution to resolve this
cross validation is smaller than 5-fold cross validation. problem is to change the comparison of the amount of
Then do the evaluation using testing data with evaluation training data and testing data. But in this research, no
method is 10-fold cross validation. The evaluation results further analysis was carried out in handling this matter
are shown in Table 21 below. and it was assumed that overfitting didn’t occur.
Table 21 Confusion Matrix of Testing Data from 10-Fold Cross
Validation with The Decision Tree. 3. Random Forest
Predict Predict online shopper’s purchasing intention with
Actual random forest is as follow.
Positive Negative
a. Random Forest with 5-Fold Cross Validation
Positive 2810 267 The results of the classification using training data
Negative 174 448 from 5-fold cross validation on random forest methods
are shown in Table 23 below.
After the confusion matrix is formed as in Table 21,
Table 23 Confusion Matrix of Training Data from 5-Fold Cross
the next step is to calculate the accuracy of the Validation with The Random Forest.
classification of the testing data from 10-fold cross Predict
validation by using equations (15) and (16) with the Actual
Positive Negative
results summarized in Table 22 as follow.
Table 22 The Accuracy of Testing Data Classification from 10-Fold Positive 7345 0
Cross Validation with The Decision Tree.
Negative 0 7345
Accuracy Sensitivity Specificity AUC
88,08% 72,03% 91,32% 81,67% After the confusion matrix is formed as in Table 23,
The measure of classification accuracy used is the the next step is to calculate the accuracy of the
accuracy value because the data on the online shopper’s classification of the training data from 5-fold cross
purchasing intention have a balanced category. In Table validation by using equations (15) and (16) with the
22 it is shown that with the decision tree method with 10- results summarized in Table 24 as follow.
fold cross validation, the accuracy of classification Table 24 The Accuracy of Training Data Classification from 5-Fold
obtained in the testing data is 88,08% with a possible Cross Validation with The Random Forest.
accuracy rate of customer agree to buy is 72,03% and Accuracy Sensitivity Specificity AUC
100% 100% 100% 100%
possible accuracy rate of customer disagree to buy is
91,32%. The ROC curve for the testing data from 10-fold The measure of classification accuracy used is the
cross validation is shown in Figure 14 below. accuracy value because the data on the online shopper’s
purchasing intention have a balanced category. In Table
24 it is shown that with the random forest method with 5-
fold cross validation, the accuracy of classification
obtained in the training data is 100% with a possible
accuracy rate of customer agree to buy is 100% and
possible accuracy rate of customer disagree to buy is
100%. The ROC curve for the training data from 5-fold
cross validation is shown in Figure 15 below.

Figure 14 ROC Curve of Testing Data from 10-Fold Cross Validation


with Decision Tree.
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use
threshold 0,5. Figure 14 shows that the area of the ROC
curve or can be referred as the AUC value is 81,67%.
When compared to the results of 5-fold cross validation
accuracy, the value of testing data accuracy with 10-fold Figure 15 ROC Curve of Training Data from 5-Fold Cross Validation
cross validation is better than 5-fold cross validation. The with Random Forest.
accuracy value is quite different from the classification Because of the possibility of online shopper’s
results in the training data. This shows that the purchasing intention have a balanced category, we use
1
threshold 0,5. Figure 15 shows that the area of the ROC handling this matter and it was assumed that overfitting
curve or can be referred as the AUC value is 100%. Then didn’t occur.
do the evaluation using testing data with evaluation
method is 5-fold cross validation. The evaluation results b. Random Forest with 10-Fold Cross Validation
are shown in Table 25 below. Then the classification results are compared with 10-
Table 25 Confusion Matrix of Testing Data from 5-Fold Cross
fold cross validation to get the best accuracy value. The
Validation with The Random Forest. results of the classification using training data from 10-
Predict fold cross validation on random forest methods are shown
Actual in Table 27 below.
Positive Negative Table 27 Confusion Matrix of Training Data from 10-Fold Cross
Positive 2848 229 Validation with The Random Forest.
Predict
Negative 202 420 Actual
Positive Negative
After the confusion matrix is formed as in Table 25, Positive 7345 0
the next step is to calculate the accuracy of the
classification of the testing data from 5-fold cross Negative 0 7345
validation by using equations (15) and (16) with the After the confusion matrix is formed as in Table 27,
results summarized in Table 26 as follow. the next step is to calculate the accuracy of the
Table 26 The Accuracy of Testing Data Classification from 5-Fold classification of the training data from 10-fold cross
Cross Validation with The Random Forest.
validation by using equations (15) and (16) with the
Accuracy Sensitivity Specificity AUC
results summarized in Table 27 as follow.
88,35% 67,52% 92,56% 80,04%
Table 28 The Accuracy of Training Data Classification from 10-Fold
The measure of classification accuracy used is the Cross Validation with The Random Forest.
accuracy value because the data on the online shopper’s Accuracy Sensitivity Specificity AUC
purchasing intention have a balanced category. In Table 100% 100% 100% 100%
26 it is shown that with the random forest method with 5- The measure of classification accuracy used is the
fold cross validation, the accuracy of classification accuracy value because the data on the online shopper’s
obtained in the testing data is 88,35% with a possible purchasing intention have a balanced category. In Table
accuracy rate of customer agree to buy is 67,52% and 28 it is shown that with the random forest method with
possible accuracy rate of customer disagree to buy is 10- fold cross validation, the accuracy of classification
92,56%. The ROC curve for the testing data from 5-fold obtained in the training data is 100% with a possible
cross validation is shown in Figure 16 below. accuracy rate of customer agree to buy is 100% and
possible accuracy rate of customer disagree to buy is
100%. The ROC curve for the training data from 10-fold
cross validation is shown in Figure 17 below.

Figure 16 ROC Curve of Testing Data from 5-Fold Cross Validation


with Random Forest.
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use
threshold 0,5. Figure 16 shows that the area of the ROC Figure 17 ROC Curve of Training Data from 10-Fold Cross Validation
with The Random Forest.
curve or can be referred as the AUC value is 80,04%. The
accuracy value is quite different from the classification Because of the possibility of online shopper’s
results in the training data. This shows that the purchasing intention have a balanced category, we use
classification results are not good enough to predict threshold 0,5. Figure 17 shows that the area of the ROC
online shopper’s purchasing intention. So it can be curve or can be referred as the AUC value is 100%. When
suspected that there is overfitting in the result of compared to the results of 5-fold cross validation
prediction models. One of solution to resolve this accuracy, the value of training data accuracy with 10-fold
problem is to change the comparison of the amount of cross validation and 5-fold cross validation are same.
training data and testing data. But in this research, no Then do the evaluation using testing data with evaluation
further analysis was carried out in method is
1
10-fold cross validation. The evaluation results are shown handling this matter and it was assumed that overfitting
in Table 29 below. didn’t occur.
Table 29 Confusion Matrix of Testing Data from 10-Fold Cross
Validation with The Random Forest. F. Determine Best Model
Predict After analyzing the classification using the binary
Actual logistic regression, decision tree, and random forest with
Positive Negative comparing the results of the accuracy by 5-fold cross
Positive 2846 231 validation dan 10-fold cross validation methods are
Negative 205 417 summarized in Table 31.
Table 31 Summary of the Results of Classification Analysis.
After the confusion matrix is formed as in Table 29, Cross
Method Accuracy AUC
the next step is to calculate the accuracy of the Validation
classification of the testing data from 10-fold cross Binary Logistic 5-fold 86,46% 80,31%
validation by using equations (15) and (16) with the Regression 10-fold 86,56% 80,51%
5-fold
results summarized in Table 30 as follow. Decision Tree 87,59% 80,55%
10-fold 88,08% 81,67%
Table 30 The Accuracy of Testing Data Classification from 10-Fold 5-fold
Cross Validation with The Random Forest.
Random Forest 88,35% 80,04%
10-fold 88,21% 79,77%
Accuracy Sensitivity Specificity AUC
88,21% 67,04% 92,49% 79,77% Table 31 shows the accuracy and AUC values for
The measure of classification accuracy used is the each method and each fold cross validation. Because of
accuracy value because the data on the online shopper’s the possibility of online shopper’s purchasing intention
purchasing intention have a balanced category. In Table have a balanced category, then in determining the best
30 it is shown that with the random forest method with method is seen from accuracy value and threshold 0,5 for
10- fold cross validation, the accuracy of classification AUC value. Based on Table 31 it can be concluded that
obtained in the testing data is 88,21% with a possible the classification method with random forest using 5-fold
accuracy rate of customer agree to buy is 67,04% and cross validation is the best accuracy value and also the
possible accuracy rate of customer disagree to buy is best AUC value for predict online shopper’s purchasing
92,49%. The ROC curve for the testing data from 10-fold intention.
cross validation is shown in Figure 18 below.
V. CONCLUSION
The summary of the analysis and discussion in this
research is as follows.
1. The biggest probability of gaining revenue are new
visitor, visitors who has low bounce rate, month sales
in November and May, visitor who spent more times
in product related.
2. The best model for predict online shopper’s
purchasing intention is random forest using 5-fold
cross-validation with 88,21% accuracy.

VI. SUGGESTION
Suggestions obtained from this research for further
research if you want to do a classification analysis for
predict online shopper’s purchasing intention with a
better accuracy value, then can do trials with other
classification methods besides the binary logistic
Figure 18 ROC Curve of Testing Data from 10-Fold Cross Validation
with Random Forest. regression, decision tree, and random forest that have
been analyzed at this research.
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use REFERENCES
threshold 0,5. Figure 18 shows that the area of the ROC
curve or can be referred as the AUC value is 79,77%.
When compared to the results of 5-fold cross validation [1] E. Nedlund, "Bankrupt Forever 21 is closing 200 stores," 29
accuracy, the value of testing data accuracy with 10-fold October 2019. [Online]. Available: https://edition.cnn.com.
cross validation is smaller than 5-fold cross validation. [2] J. Clement, "E-commerce worldwide - Statistics & Facts," 12
The accuracy value is quite different from the March 2019. [Online]. Available: https://www.statista.com/.
classification results in the training data. This shows that [3] J. Han, M. Kamber and J. Pei, Data Mining Concepts and
Techniques Third Edition, United States of America: Elsevier
the classification results are not good enough to predict Inc., 2012.
online shopper’s purchasing intention. So, it can be [4] I. H. Witten, E. Frank and M. A. Hall, Data Mining Practical
suspected that there is overfitting in the result of Machine Learning Tools and Techniques Third Edition, United
prediction models. One of solution to resolve this States: Elsevier Inc., 2011.
problem is to change the comparison of the amount of
training data and testing data. But in this research, no
further analysis was carried out in
1
[5] M. Hubert and E. Vandervieren, "An Adjusted Boxplot for
Skewed Distributions," Computational Statistics & Data
Analysis, vol. 52, pp. 5186-5201, 2008.
[6] R. A. Johnson and G. K. Bhattacharyya, Statistics Principles and
Methods Sixth Edition, United States of America: John Wiley &
Sons, Inc., 2010.
[7] M. Kantardzic, Data Mining: Concepts, Models, Methods, and
Algorithms, Hoboken, New Jersey: John Wiley & Sons, Inc,
2011.
[8] A. Agresti, Categorical Data Analysis Third Edition, New
Jersey: John Wiley & Sons, Inc., 2013.
[9] D. W. Hosmer, Applied Logistic Regression, United States of
America: John Wiley & Sons, Inc., 2013.
[10] J. F. Hair, G. T. M. Hult, C. M. Ringle and M. Sarstedt, A
Primer on Partial Least Squares Structural Equations Modeling
(PLS- SEM), Bönningstedt: SAGE Publications, 2014.
[11] F. Gorunescu, Data Mining: Concepts, Models and Techniques,
Romania: Department of Computer Science Faculty of
Mathematics and Computer Science University of Craiova,
2011.
[12] E. B. Hunt, J. Marin and P. T. Stone, Experiments in Induction,
New York: Academic Press, 1966.
[13] R. O. Duda, P. E. Hart and D. H. Stork, Pattern Classification, 2nd
edn, Hoboken: Wiley Interscience, 2001.
[14] T. Hill and P. Lewicki, Statistics, methods and applications: A
comprehensive reference for science, industry, and data mining,
StatSoft, 2006.
[15] L. Breiman, Classification and regression trees, Belmont, CA:
Wadsworth,Inc, 1984.
[16] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1,
pp. 5-32, 2001.
[17] R. A. Johnson and D. W. Wichern, Applied Multivariate
Statistical Analysis Sixth Edition, United States of America:
Pearson Education, Inc, 2007.
[18] C. Riccioli, D. Pérez-Marín and A. Garrido-Varo, "Identifying
animal species in NIR hyperspectral images of processed animal
proteins (PAPs): comparison of multivariate techniques,"
Chemommetrics and Intelligent Laboratory Systems, 2018.
[19] N. Aggarwal and R. Agrawal, First and second order statistics
features for classification of magnetic resonance brain images,
vol. 2, J. Signal Inf. Process, 2012, p. 146.
[20] F. R. A. e. al., Application of machine vision for classification of
soil aggregate size, Soil Tillage Res, 2016, pp. 8-17.
[21] C. L. Wang, L. R. Ye, Y. Zhang and D. D. Nguyen,
"Subscription to fee-based online services: What makes consumer
pay for online content?," Journal of Electronic Commerce
Research, vol. 6, no. 4, pp. 301-311, 2005.
[22] H. Lim and A. J. Dubinsky, "Consumer's perceptions of e-
shopping characteristics: An expectancy-value approach," The
Journal of Services Marketing, vol. 18, no. 6, pp. 500-513, 2004.
[23] C. Prasad and A. Aryasari, "Determinants of shopper behavior in
e-tailing: An empirical analysis," Paradigm, vol. 13, no. 1, pp.
73-83, 2009.
1
View publication stats

You might also like