Professional Documents
Culture Documents
net/publication/341997784
CITATIONS
READS
0
3,374
3 authors, including:
All content following this page was uploaded by Muhammad Adlansyah Muda on 08 June 2020.
O
Real word data tend to be incomplete, noisy, and
nline shopping is becoming a trend in recent
inconsistent. Data cleaning (or data cleansing) routines
year. Burdened by expensive store leases,
attempt to fill in missing values, smooth out noise while
unprofitable markets and continuing
identifying outliers, and correct inconsistencies in the
disruption from online shopping, Forever 21 will close
data.
200 stores [1]. Compared to the physical stores,
customers prefer do online shopping, mainly because 1. Missing Values
people find it convenient and easy to bargain shop from Missing values are endemic in real world datasets.
the comfort of their home or office. Online shop platform There are 6 methods how to resolve missing values is as
can provide many products which fulfill our needs. follows [3].
People only need internet service and digital payment, a. Ignore the tuple
after that the product will deliver to their house of office. This is usually done when the class label is missing
Consumers can get full information about the product (assuming the mining task involves classification). This
with its reviews being passed by the existing users. If they method is not very effective, unless the tuple contains
want to buy a product, they are no longer limited to several attributes with missing values.
asking their friends and families because there are many b. Fill in the missing value manually
products reviews on the web which gives opinions of the
In general, this approach is time consuming and
existing users of the product. may not be feasible given a large data set with many
There are some advantages of online shopping are missing values.
save the time, save energy, 24/7 availability, and
comparison of prices. In 2018, an estimated 1.8 billion c. Use a global constant to fill in the missing value
people worldwide purchase goods online. During the Replace all missing attribute values by the same
same year, global e-retail sales amounted to 2.8 trillion constant such as a label like “Unknown” or −∞.
U.S. dollars and projections show a growth of up to 4.8 d. Use a measure of central tendency for the attribute
trillion For normal (symmetric) data distributions, the mean
U.S. dollars by 2021 [2]. The instance of online shopping can be used, while skewed data distribution should
platforms are amazon, Alibaba, Tokopedia, Rakuten, etc. employ the median.
On the special day like Christmas, valentine, summer e. Use the attribute mean or median for all samples
belonging to the same class as the given tuple
2
If the data distribution for a given class is normal,
n
the mean can be used. But, if the data distribution for a
x x
2
n1
value
C. Feature Selection
This may be determined with regression, inference Roughly speaking, feature selection methods are
based tools using a Bayesian formalism, or decision tree applied in one of three conceptual frameworks: the filter
induction. model, the wrapper model, and embedded methods. These
2. Outlier Detection three basic families differ in how the learning algorithm is
One class classification is often called outlier (or incorporated in evaluating and selecting features. In the
novelty) detection because the learning algorithm is being filter model the selection of features is done as a
used to differentiate between data that appears normal and preprocessing activity, without trying to optimize the
abnormal with respect to the distribution of the training performance of any specific data mining technique
data [4]. One of method to outlier detection is looking directly. This is usually achieved through an (ad hoc)
outlier plot at boxplot. It should be noted that the boxplot evaluation function using a search method in order to
rule for detecting outliers has been criticized on the select a subset of features that maximizes this function.
grounds that it might declare too many points outliers Performing an exhaustive search is usually intractable due
when there is skewness. More precisely, if a distribution to the large number of initial features. Therefore, different
is skewed to the right, among the larger values that are methods may apply a variety of search heuristics.
observed, too many might be declared outliers [5]. Wrapper methods select features by “wrapping” the
B. Descriptive Statistics serach around the selected learning algorithm and
Descriptive statistics is a primary step to organize evaluate feature subsets based on the learning
the information and extract a descriptive summary that performance of the data mining technique for each
highlights its salient features. candidate feature subset. The main drawback of this
approach is its computational complexity. Finally,
1. Measures of Center embedded methods incorporate feature search and the
The most important aspect of studying the learning algorithm into a single optimization problem
distribution of a sample of measurements is locating the formulation. When the number of samples and
position of a central value about which the measurements dimensions becomes very large, the filter approach is
are distributed. The two most commonly used indicators usually a choice due to its computational efficiency and
of center are the mean and the median [6]. neutral bias toward any learning methodology [7].
The sample mean of a set of n measurements 𝑥1, 𝑥2,
D. Chi-Squared Test of Independence
..., 𝑥𝑛 is the sum of these measurements divided by n. The
Chi-Squared test is used to check relationship
sample mean is denoted by 𝑥̅.
n between the predictor variable and the response variable.
x i
(1)
The I × J contingency table that cross-classifies n subjects
is as follows.
x i 1
x e 0 1 1 p p
g x x L n
ln
1 x
0 1 x1 … (8) x x x 1 x
ij iu i i
i 1
j u
p
xp
2. Parameter Estimation The covariance variance matrix based on parameter
Parameter estimation in logistic regression are estimation is obtained through the inverse matrix and
carried out using the Maximum Likelihood method. The given as follows.
method estimates the parameter 𝛽 by maximizing the
Cov ˆ x Diag ˆ i x 1 ˆ x x
1
T
i
likelihood function and requires that the data must follow Where
a certain distribution. In logistic regression, each
observation follows the Bernoulli distribution so that its x1 x
1 !
!
1
x
x 11
T
likelihood function can be determined. If 𝑥i and 𝑦i are 21 n1
pairs
of independent variables and bound to i-th observations ⁝ ⁝ ⁝ ⁝
and it is assumed that each pair of observations is
independent of the other observation pairs, with i = 1, 2, x1k x2 k ! xnk
..., n then the probability function for each pair is as
follows. Diag ˆ xi 1 ˆ xi is a diagonal matrix (𝑛 ×
𝑛)
x x 1 x with the main diagonal is ˆ x 1 ˆ x . The
yi
f (9)
1 yi
SE>𝛽? @
i i i
i i
Where estimator will be used in the parameter estimation testing
p
j xj stage. To get the estimated value of 𝛽 from the first
x e
j 0 derivative of the non-linear L(𝛽) function, the Newton
i
Raphson iteration method is used. The equation used is as
p
follows.
1e
j xj
j 0
1
1 yi
h
i 1 i 1
xi i
y
h11 ! h1k
12
1 x
ln
1
n
xi t h !
i
21 22 2
H
e
n
i 1 ⁝ ⁝
⁝ h⁝
i 1 h
! h
n p 1
y x
p n
h h
x
j ij i ij j k1 k2 kk
1 e
j
j 0 3. Parameter Estimation Testing
0 e
i 0
i 1
After estimating the parameters obtained, then
6
The likelihood function is more easily maximized in the testing the significance of the coefficient 𝛽 is carried out
form of ln 𝑙(𝛽) and expressed as L(𝛽). simultaneously with the simultaneous testing hypothesis as
L ln l follows.
H$ : 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑝 = 0
p n n x H : There is at least one 𝛽 ≠ 0 ; j = 1, 2, ..., p
p
y ix ij j ln 1
1 j
j
Statistical Test:
0 j ij
e
j 0 i 1
n1 n0
i 1
n n
The maximum of 𝛽 value is obtained through the
n n
1 0
x yi
p
j ij 1 y
n n
L
yx x i ij ij e j 0
p Where i 1
i i
j i 1
i 1
x j ij
1e
j 0
7
n1
n
yi n The accuracy value is a measure of how well a
i 1 n0 1 yi n n1 n classifier distinguishes both positive and negative cases
i 1 [19]. The AUC shows the ability of classifier to prevent
The G statistical test is a Likelihood Ratio Test where the false classification. It measures the area under Receiver
G value follows the Chi-Squared distribution, so H$ is Operating Characteristic (ROC) curve, which is a two-
rejected if G > 𝜒22𝛼3 ,𝑣6 with v degree of freedom is the dimensional plot of the sensitivity (the measure of how
2
number of parameters in the model without 𝛽$. well a classifier distinguishes positive cases) versus
When the simultaneous test doesn’t produce specificity (the measure of how well a classifier
significant results, then it tested the significance of the distinguishes negative cases) at various discrimination
coefficient 𝛽 partially on the response variable by thresholds [20].
comparing the maximum likelihood yield parameters, the G. Online Shop
alleged 𝛽 with standard error parameter. The partial Online shopping is a form of electronic commerce
testing hypothesis is as follows. which allows consumers to directly buy goods or services
H$ : 𝛽j = 0 from a seller over the internet using a web browser. There
H1 : 𝛽j ≠ 0 ; j = 1, 2, ..., p are many reasons why people shop online. Empirical
research shows that convenient of the internet is one of
Statistical Test: the impacts on consumer’s willingness to buy online [21].
The
2
ˆj
W internet has made the data accessing easier, so given
2
(13) customers rarely have a chance to touch and feel product
SE j
ˆ and service online before they make decision, online
The W statistical test follows the Chi-Squared sellers normally provide more product information that
distribution,
so H$ is rejected if W > 𝜒2 𝛼 with v degree of freedom customers can use making a purchase [22]. E-commerce
2 3 ,𝑣6 2 has made a transaction easier than it was and online stores
or the number of predictors. offer consumers benefits by providing more variety of
I. Performance Evaluation of Classifiers products and services that they can choose from [23].
According to [17] evaluation of classification
procedure is an evaluation that looks at the possibility of III. METHODOLOGY
misclassification made by a classification function. The The methodology will explain the steps from the
measure used is Apparent Error Rate (APER). APER analysis and the variable that we will use for the model.
value by the classification function. The determination of
A. Data Source
classification can be seen from Table 2. In Table 2, The data explain about the online shopper’s
subjects are classified into two groups namely 𝜋1 and 𝜋2 intention. There are 12330 observation and 18 variables,
as follows. we will split the data 70% for training data and 30 % for
Table 2 Classification Table. testing data to make the prediction. The data was get from
Observation Estimate https://archive.ics.uci.edu/ml/datasets/Online+ShoppersP
Results 𝜋1 𝜋2 urchasing+Intention+Dataset.
𝑦1 𝑛11 𝑛12
𝑦2 𝑛21 𝑛22 B. Variable
In this paper we use some variables that we can see
Information:
in the table 3.
𝑛11 = number of subjects from 𝑦1 was correctly classified
Table 3 Variable for The Analysis.
as 𝜋1. Variable Variable Name Scale
𝑛12 = number of subjects from 𝑦1 was incorrectly
classified as 𝜋2 . 1 = Revenue True Nominal
𝑛21 = number of subjects from 𝑦2 was incorrectly Y 0 = Revenue False
classified as 𝜋1. X1 Administrative Ratio
𝑛22 = number of subjects from 𝑦2 was correctly classified Administrative
X2 Ratio
as 𝜋2. Duration
Formula for determining errors classification as follows. X& Informational Ordinary
n n Informational Ratio
X4
APER % 100%
12 21
(14) Duration
n n n n X( Product Related Ratio
12 11 21 22
VI. SUGGESTION
Suggestions obtained from this research for further
research if you want to do a classification analysis for
predict online shopper’s purchasing intention with a
better accuracy value, then can do trials with other
classification methods besides the binary logistic
Figure 18 ROC Curve of Testing Data from 10-Fold Cross Validation
with Random Forest. regression, decision tree, and random forest that have
been analyzed at this research.
Because of the possibility of online shopper’s
purchasing intention have a balanced category, we use REFERENCES
threshold 0,5. Figure 18 shows that the area of the ROC
curve or can be referred as the AUC value is 79,77%.
When compared to the results of 5-fold cross validation [1] E. Nedlund, "Bankrupt Forever 21 is closing 200 stores," 29
accuracy, the value of testing data accuracy with 10-fold October 2019. [Online]. Available: https://edition.cnn.com.
cross validation is smaller than 5-fold cross validation. [2] J. Clement, "E-commerce worldwide - Statistics & Facts," 12
The accuracy value is quite different from the March 2019. [Online]. Available: https://www.statista.com/.
classification results in the training data. This shows that [3] J. Han, M. Kamber and J. Pei, Data Mining Concepts and
Techniques Third Edition, United States of America: Elsevier
the classification results are not good enough to predict Inc., 2012.
online shopper’s purchasing intention. So, it can be [4] I. H. Witten, E. Frank and M. A. Hall, Data Mining Practical
suspected that there is overfitting in the result of Machine Learning Tools and Techniques Third Edition, United
prediction models. One of solution to resolve this States: Elsevier Inc., 2011.
problem is to change the comparison of the amount of
training data and testing data. But in this research, no
further analysis was carried out in
1
[5] M. Hubert and E. Vandervieren, "An Adjusted Boxplot for
Skewed Distributions," Computational Statistics & Data
Analysis, vol. 52, pp. 5186-5201, 2008.
[6] R. A. Johnson and G. K. Bhattacharyya, Statistics Principles and
Methods Sixth Edition, United States of America: John Wiley &
Sons, Inc., 2010.
[7] M. Kantardzic, Data Mining: Concepts, Models, Methods, and
Algorithms, Hoboken, New Jersey: John Wiley & Sons, Inc,
2011.
[8] A. Agresti, Categorical Data Analysis Third Edition, New
Jersey: John Wiley & Sons, Inc., 2013.
[9] D. W. Hosmer, Applied Logistic Regression, United States of
America: John Wiley & Sons, Inc., 2013.
[10] J. F. Hair, G. T. M. Hult, C. M. Ringle and M. Sarstedt, A
Primer on Partial Least Squares Structural Equations Modeling
(PLS- SEM), Bönningstedt: SAGE Publications, 2014.
[11] F. Gorunescu, Data Mining: Concepts, Models and Techniques,
Romania: Department of Computer Science Faculty of
Mathematics and Computer Science University of Craiova,
2011.
[12] E. B. Hunt, J. Marin and P. T. Stone, Experiments in Induction,
New York: Academic Press, 1966.
[13] R. O. Duda, P. E. Hart and D. H. Stork, Pattern Classification, 2nd
edn, Hoboken: Wiley Interscience, 2001.
[14] T. Hill and P. Lewicki, Statistics, methods and applications: A
comprehensive reference for science, industry, and data mining,
StatSoft, 2006.
[15] L. Breiman, Classification and regression trees, Belmont, CA:
Wadsworth,Inc, 1984.
[16] L. Breiman, "Random forests," Machine learning, vol. 45, no. 1,
pp. 5-32, 2001.
[17] R. A. Johnson and D. W. Wichern, Applied Multivariate
Statistical Analysis Sixth Edition, United States of America:
Pearson Education, Inc, 2007.
[18] C. Riccioli, D. Pérez-Marín and A. Garrido-Varo, "Identifying
animal species in NIR hyperspectral images of processed animal
proteins (PAPs): comparison of multivariate techniques,"
Chemommetrics and Intelligent Laboratory Systems, 2018.
[19] N. Aggarwal and R. Agrawal, First and second order statistics
features for classification of magnetic resonance brain images,
vol. 2, J. Signal Inf. Process, 2012, p. 146.
[20] F. R. A. e. al., Application of machine vision for classification of
soil aggregate size, Soil Tillage Res, 2016, pp. 8-17.
[21] C. L. Wang, L. R. Ye, Y. Zhang and D. D. Nguyen,
"Subscription to fee-based online services: What makes consumer
pay for online content?," Journal of Electronic Commerce
Research, vol. 6, no. 4, pp. 301-311, 2005.
[22] H. Lim and A. J. Dubinsky, "Consumer's perceptions of e-
shopping characteristics: An expectancy-value approach," The
Journal of Services Marketing, vol. 18, no. 6, pp. 500-513, 2004.
[23] C. Prasad and A. Aryasari, "Determinants of shopper behavior in
e-tailing: An empirical analysis," Paradigm, vol. 13, no. 1, pp.
73-83, 2009.
1
View publication stats