You are on page 1of 4

Knowledge Discovery from Online Travel Agent

Transaction Databases in Indonesia


Agung Bayu Aji (1806152861)
Graduate Program for Data and Quality Engineering
Industrial Engineering, Universitas Indonesia
Jakarta, Indonesia
agung.bayu@ui.ac.id

Abstract — In recent years, many service industries have Indonesia have many OTA and city, the number of
widely adopted information technology (IT) to enhance their transaction data can increased as well as the number of days.
operation in communication with their customer or other
industries. One example of that service industries is tourism In previous research, there are many service innovation
industry. To support the tourism industry based on technology that developed in OTA to perform their e-service based on
and people behaviors, Online Travel Agent (OTA) can help us their transactional databases, such as revealed many
to get a continuous improvement because they have a big electronic payment scheme [5], mobile bus ticketing services
transactional data. This paper use transactional data that [6], ticket home delivery and cash on delivery systems [7],
consist of Surabaya city as a point in one day of April 2019. implementing general ticket for public transport (bus and
From the data, we can get an information using Knowledge train) using card in Switzerland [8]. This paper wants to
Discovery in Databases (KDD) process and get the result there discover a knowledge that we can know from the one day
are any point or location that can associate as route with many one point transactional data from OTA using KDD process.
transit. With Amoeba Algorithm, we can get 5 best route
combination that centered in Surabaya. The route are JKT-
SUB, SUB-JR, SUB-ML, JKT-SUB-JKT, and JKT-SUB-JR. II. LITERATURE REVIEW & METHODOLOGY
This paper will use many techniques of Knowledge
Keywords — Amoeba algorithm, Association rule mining, Discovery (KDD) Process in Data Mining, as in figure
Data mining, Knowledge discovery in databases, Online travel below:
agent, Transaction

I. INTRODUCTION
Nowadays, many service industries have widely adopted
information technology (IT) to enhance their operation in
communication with their customer or other industries. One
example of that service industries is tourism industry [1],
[2]. The World Tourism Organization defines tourists as
people "traveling to and staying in places outside their usual
environment for not more than one consecutive year for Fig. 1. The Knowledge Discovery in Databases (KDD) Process [2]
leisure, business and other purposes" [3]. So, traveller who
want to leisure or recreation outside their boredness in usual
A. Data Selection
environment usually want to plan and prepare their travelling
in easy way, such as search the accomodation and destination First, we should select data related to the analysis task
in their laptop or smartphone which are capable and more from the raw database. In this paper, we use raw data from
versatile. There are 5 different area where IT is used in Online Travel Agent transactions and analyze the data for
tourism industry are in Marketing, Accomodation and domestic route and from/to Surabaya only. Then, we can do
Booking System, Delivery of visitor experiences, Customer statistical analysis using Explanatory Data Analysis (EDA).
relationships and follow-up, and Digital Coach Program [4].
B. Data Preprocessing
One of tourism firm who use IT for accomodation and
booking system is Online Travel Agent (OTA). Due to Second, we should cleaning the selected data such as
technology evolution, people changes their behaviour from replacing missing data, removing outliers, extreme values,
face to face in conventional travel agent to access online noise and inconsistent data.
travel agent in their smart phone. The consequences from
that behavioural changes, traveller must fill their personal C. Data Transformation
data in OTA system to search and book their accomodation Third, we should cleaning the selected data such as
plan for travelling. So, OTA will have big data that capture replacing missing data, removing outliers, extreme values,
location and accomodation plan from many traveller. noise and inconsistent data. In this paper, we use Principal
Component Analysis (PCA) to reduce the dimension the
In Indonesia, there are many OTA such as Traveloka,
selected data.
Tiket.com, Airy rooms, etc. They have already getting
knowledge from their transactional data to perform a
continuous improvement. In this paper, we can know that D. Data Mining
they get 48 transaction data only in one day, one point Next, after the data have already processed, we do the
(Surabaya city), and only from 2 OTA. As we know, if data mining process using classification / clustering /
association rule algorithm.

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


E. Interpretation / Evaluation 2) Removing outliers and extreme values : we didn’t
Last, after we do a data mining process, we can interpret remove the outliers data for the analysis
or evaluate the result and get the new knowledge from the 3) Repair noise and inconsistent data : because the
data. dataset is combined from 2 OTA datasource and
many points, we need to replace the different phrase
III. DATA MINING PROCESS such as :
After we can get transaction data from 5 (five) different - Two phrase “airline” & “flight” replaced by “flight”
OTA, we start the Knowledge Discovery Process as follows:
- Three location code for Surabaya : SUB (for
Surabaya Juanda Airport), SGU (for Surabaya
A. Data Selection
Gubeng Railway Station), and BUN (for Purabaya
From 2 (two) different OTA’s raw data transactions, we Bus Terminal) replaced by “SUB”
combined it to 4 (four) tables and we can figure out the
relationship from that tables as in figure below: - Two phrase for airline name, in example in first
OTA stated “Sriwijaya”, another OTA stated
“Sriwijaya Air”, then all replaced by “Sriwijaya”
After data clean, before data mining process, we analyze
first to understanding the data using statistical analysis or
Explanatory Data Analysis (EDA) method. So, we can get
the information from the data:
1) The dataset contain information about 48 customers on
the 14 fields.
2) The properties for each field as follows:
a) Transaction_ID : Numeric data, Primary key
Fig. 2. Transactional Data Relationship Table b) Transaction_Date : Date Type data, all record are
same in 19-04-2019
After we know the relationship structure of the data, we c) User_ID : Numeric data with histogram below
combined it into 1 table, then we only select the data from/to
Surabaya and domestic route to develop the model, and we
can get 48 data one day transaction in 19 April 2019 with
field below:

TABLE I. LIST OF FIELD


Field
Field Name Note
Number
1 Transaction_ID Primary Key
2 Transaction_Date The number of User is 13, and User number
3 User_ID
4 Pax_Name 1002512137 have highest number of transaction : 15
5 Pax_Sex d) Pax_Name : Text data. We can know the number
6 Pax_Age of people doing transaction in 19 April 2019 is 26
7 Product_Modal
8 Product_Brand people.
9 Origin
10 Destination
e) Pax_Sex : Binary Text data. M for Male (13
11 Travel_Date persons) and F for Female (13 persons).
12 Payment_Method
13 Payment_Bank
f) Pax_Age : Binary Text data. Adult 23 persons,
14 OTA_Code Infant 3 childrens.
g) Product_Modal : Text data with histogram below
We hide the “Price” field because data confidentiality
reason. From this step, we can get data with 14 field and 48
record (48x14 data) that can be proceed in next step.

B. Data Preprocessing
Then, we continue to next step for cleaning or preparing
data before data mining process. In this step we need to
conduct many activities such as: The highest number is flight with 28 transactions.
1) Replacing missing data : there are no missing data
because automatically generated by system with must
filled field
h) Product_Brand : Text data with histogram below Based on Minitab output for eigen value analysis, we get
eigen value that more than 1 is in 4th order. So, we should
reduce the 14 field data to 4 field data only. The Principal
Component Matrix if we reduce data to only 4 field is below.

The highest number is Garuda with 9 transactions


i) Origin and Destination : The text data that show us
the route of each travelers. With this data we can
get 35 data from 48 data are connecting travel. 22
data are conecting flight (45%).
j) Travel_Date : Date Type data
k) Payment_Method : Text data with histogram below
Fig. 4. Principal Component Matrix (Minitab output)

Based on the Principal Component Matrix in Figure 4,


we know that field Transaction_Date excluded in new
reduced data because the data is flat (all records is 19 April
2019). So, we can give name of the new transformation data
as follows:
PC1 : Interaction of Product modal vs Payment
The highest number is Transfer in 23 transactions PC2 : Interaction of Pax vs Product
l) Payment_Bank : Text data with histogram below
PC3 : Interaction of Pax vs Origin-Destination vs Payment
PC4 : Interaction of Pax vs Origin-Destination vs Product

D. Data Mining Process (Association Rule)


After we get a transformation data, we conduct to data
mining process. In this research, we want to know the
sequential of the connecting travel (origin and destination)
point. So, we can calculate Association Rule Mining and use
The highest number is Mandiri with 16 transactions Amoeba Algorithm. The pseudocode of the Amoeba
Algorithm is below.
m)OTA_Code : The numeric number to show the
source data. We can know that we get 33
transaction data from OTA 1, and 15 transaction
data from OTA 2.
C. Data Transformation
We continue to next step for transform the data before
using PCA. First, we need to normalize the data using this
equation:
(1)

where : = vector n × 1, with n is number of record


= average data of
= standar deviation of data
After normalize data, we count eigen value to know the
number of Principal Component that we must extract:

Fig. 3. Eigenvalue (Minitab output)


E. Interpretation / Evaluation
Last, after we do a data mining process, we can interpret
or evaluate the result and get the new knowledge from the
data. After conduct a data mining process using amoeba
algorithm to association rule mining, we get knowledge that
for 19 April 2019 transaction, there are 5 favorite route that
centered in Surabaya, there are JKT-SUB, SUB-JR, SUB-
ML, JKT-SUB-JKT, and JKT-SUB-JR.

IV. RESULT & CONCLUSION


Study showed that we can get an information from day-
to-day data transaction in OTA. From the knowledge
discovery from 19 April 2019 transactional data, we get 5
favorite route that centered in Surabaya, there are JKT-SUB,
Fig. 5. Amoeba algorithm pseudocode SUB-JR, SUB-ML, JKT-SUB-JKT, and JKT-SUB-JR. This
conclusion might be better for next research if we use bigger
Amoeba algorithm is introduced by Tirumalasetty, et al. data in transactions for many days and many center points. In
in 2015. This algorithm is inspired from amoeba movement example using monthly transactional data for airport in
and used to discover the association rule. Amoeba is a Indonesia.
unicellular organism that haven’t a definite shape and
belongs to phylum protozoa. Amoeba moves by using ACKNOWLEDGMENT
pseudopodia or "false feet" with not specific direction [9].
Amoeba algorithm works on two mainly principles: This paper is proposed as perquisite for completing Data
Mining Course in 2nd Semester Post Graduate Program in
- Determining another attribute value in a data set using an Data and Quality Engineering. I would say thank you to both
attribute value or determining another attribute value in a lecturer, Mrs. Prof. Ir. Isti Surjandari Prajitno, MT, MA,
data set which determined the attribute value. Ph.D. and Mr. Zulkarnain, S.T., M.T., D.Sc.(Tech.) and also
- Probability of an attribute value being determined by an my colleagues from Post Graduate Program for Data and
attribute value. Quality Engineering, Industrial Engineering, Universitas
Indonesia who provided insight and expertise that greatly
assisted this paper.

REFERENCES

[1] J. A. Fitzsimmons and M. J. Fitzsimmons, Service Management:


Operations, Strategy, Information Technology, New York: The McGraw-
Hill Companies, 2011.
[2] P. Juwattanasamran, S. Supattranuwong and S. Sinthupinyo, "Applying
Data Mining to Analyze Travel Pattern in Searching Travel Destination
Choices," The International Journal of Engineering and Sciences (IJES),
vol. 2, no. 4, pp. 38-44, 2013.
[3] "UNWTO technical manual: Collection of Tourism Expenditure Statistics
(PDF)," 1995. [Online]. Available:
http://pub.unwto.org/WebRoot/Store/Shops/Infoshop/Products/1034/1034-
Fig. 6. Amoeba movement to right and left 1.pdf. [Accessed 5 June 2019].
[4] V. S. Jadhav and S. D. Mundhe, "Information technology in Tourism,"
By using Amoeba Algorithm for Association Rule International Journal of Computer Science and Information Technologies
Mining use support and confident 0.1, we get the results as (IJCSIT), vol. 2, no. 6, pp. 2822-2825, 2011.
follows: [5] V. Majstrovic, "E-Ticketing Systems in Culture and Tourism:Experience
in Croatia," Recent advances in economics, management & marketing, pp.
59-63, 2013.
TABLE II. AMOEBA ALGORITHM OUTPUT (MATLAB)
[6] A. M. A. Akouni, "Mobile based applications for bus ticketing services,"
City 1 City 2 City 3 Support Confident College of Arts and Sciences Universiti of Malaysia, 2009.
JKT SUB 0.188 1 [7] M. S. Jalil, "E-Service Innovation: A Case Study of Shohoz.com,"
SUB JKT 0.208 1 Procedia: Social and Behavioral Sciences, pp. 531-539, 2016.
SUB JR 0.146 1 [8] A. Wittmer and B. Riegler, "Purchasing a general ticket for public
SUB MLA 0.125 1 transport – A means end approach," Travel Behaviour and Society, vol. 1,
JKT SUB JKT 0.167 0.421 pp. 106-112, 2014.
JKT SUB JR 0.208 0.833 [9] S. Tirumalasetty and S. R. Edara, “A New Algorithm in Association
Mining, Amoeba for Finding Frequent Patterns Using Functional
The algorithm are run in Matlab R2015a software and Dependency and Probability,” Procedia Computer Science pp. 31-36,
have computation time 0.9118 seconds. 2015.

You might also like