You are on page 1of 5

Big Data Implementation Using Machine Learning on

Question and Answer Platform


(case study ngepo.com)
Sridhatu Sridhatu Gunawan Wang
Information Systems Management Department, Information Systems Management Department,
BGP-Master of Information Systems Management BGP-Master of Information Systems Management
Bina Nusantara University Bina Nusantara University
Jakarta, Indonesia 11480 Jakarta, Indonesia 11480
sridhatuyoshi@binus.edu gwang@binus.edu

Abstract— Ngepo.com is a website-based question and answer Ngepo.com is a website site that develops intelligent system in
platform created by start-up Indonesia, so people can access the question and answer form. With ngepo.com anyone who asks
website from various circles. Users of this website can send questions on this system, then the system will display the
questions to ask questions and add knowledge. The system can question only to people who have interest or have the ability to
automatically detect what topics match the questions sent by the
answer it. Indirectly this system can bring together someone's
user and then will be answered by a particular field expert. This
website is growing more and more people are joining. Users and problem with the problem solver. Everyone must have basic
data continue to increase every day so ngepo.com website system knowledge of each that is not the same as others as some
must provide large enough data processing and store and people like science and less interested in art, or vice versa.
analyses data. Big Data refers to technologies and initiatives that Ngepo.com can help users to share their knowledge with other
involve data that is so diverse, rapidly changing, or super-large users who have difficulty in certain areas that are not
that it is too difficult for conventional technology, expertise, and mastered. Some obstacles encountered in which users are lazy
infrastructure to handle it effectively. Therefore, big data to enter the desired topic, then have to choose categories,
implementation is required using machine learning on the suppose the system is still manual input such as yahoo answer
question and answer platform to help and improve the ease of
website, the user must input the question, then the user is also
accessing this website to be more effective, and to analyses the
previous data to be able to create brilliant ideas or ideas for the asked to input the topic and asked to input the question topic.
present and future. By utilizing big data, the topic chosen by the user can be
automatically generated.
Keywords— information and communication technology, big With less than a year, since its launching, ngepo.com
data, machine learning, web API already has 140 active users that spread within Indonesian
cities, and expected to grow consistently in various races and
I. INTRODUCTION generations. Frequently database and infrastructure updates
need to be done to catchup with increasing trends. The use of
The fast growth of internet technology and online business big data and advance machine learning become necessary
have delivered tremendous impacts to our life nowadays. tools to assist the growth the needs of users. The article
Study that have been done by Google and Temasek addressed examines the role of both technologies to increase the
that as result of the growth of internet technology in Southeast accuracy of the system and increase user satisfaction.
Asia, especially in Indonesia, has delivered online economy
growing faster than is predicted [1]. Both companies estimate II. LITERATURE REVIEW
that internet users in Indonesia from 2015 to 2020, will
increase each year by 19%. It provides the insight that online A. Advance Technology Growth.
information is getting easier and faster to obtain. The fast Recent advance technology growth enables to enhance of
growth of Internet also provides the changing paradigm in people skills to solve their problems. The role of information
product transaction and life styles. The change in life style has processing becomes more advancing in identification,
influences to the young generation in the use of social media. manipulation and organizing or structuring of a group of data
Currently, the interaction amongst youngsters are dominated and convert them into valuation information for the users [6].
with the use of social media. That relevant information is expected to be more accurate and
The vast use of social media for connecting these real-time to be used for all stakeholders such as: personal use,
youngsters are sensed by several college students in creating business, government and strategic information for decision
only startup in early January 2018, as known as Ngepo.com. making.
This site has claimed with fast growth of followers.

978-1-5386-5821-5/18/$31.00 ©2018 IEEE 3-5 September 2018, Bina Nusantara University, Jakarta, Indonesia
2018 International Conference on Information Management and Technology (ICIMTech)
Page 184
The essence of information processing can be defined as a results. [17]; (3) variety: the data has various formats and
device-to-device technology, consisting of hardware, software, types. The data exist in the form of structured data and
processes and systems that is used to assist the communication unstructured data. Information is generated from business
process. It enables to support effective communication in applications used by organizations and involves structured
timely response. Information and communication technology data or unstructured data. Managing, combining and managing
(ICT) are defined as an electronic device consisting of data that has a variety of formats and types is a challenge that
hardware and software and all activities related to the must be answered by the organization. In addition to
processing, manipulation, management and transfer or transfer generating knowledge from large amounts of data it is
of information among the media [6]. necessary to connect all data of various types and formats
The current trends ICT such as the introduction of Big data [17].
and machine learning, and combined with internet technology. D. Apache Hadoop
The growth of mobile devices triggers the growth of Big data Hadoop is a Java-based and open source software framework
and machine learning exponentially. Scholars have addressed that works to process large distributed data and runs on a
the use of mobile devices and Internet technology has cluster consisting of multiple connected computers (parallel
increased the use of data either its volume and data types computing). Based on Hadoop can process large amounts of
significantly and tremendously in cyberspace [5]. It involves data to petabytes (1 petabyte = 10245 bytes) and run over
creating various data types such as text, images or photos, hundreds or even thousands of computers. Hadoop made by
video and other form of data types that can be identified by the Doug Cutting which originally Hadoop is a sub project of
computerized system. Nautch used for search engines. Hadoop is open source and is
By using big data, it helps and provides solution that can under the banner of the Apache Software Foundation [18].
be discussed timely, and its emergence is an effective solution Core Hadoop has two main components; (1) Hadoop
to manage the data growth over time, that has exceeded the Distributed File System (HDFS): self-healing, high-bandwidth
limit of media overruns and database systems [4]. clustered storage, reliable, redundant, distributed file system
optimized for large files; (2) MapReduce: fault-tolerant
distributed processing, programming model for processing sets
of data Mapping inputs to outputs and reducing the output of
multiple Mappers to one (or a few) answers [19].

C. Machine Learning.
Machine learning is one of the disciplines of computer science
that studies how to make computers or machines with an
intelligence. To create intelligence system, the computer
system should have capability to learn. In other words,
machine learning is a scientific field that contains about
computer or machine learning to be smart [4].
In general, the learning process in machine learning is
divided into six part, namely: (1) supervised learning:
generates functions that map the input to the output; (2)
Fig. 1. Characteristics of Big Data [17].
unsupervised learning: automatically modeling inputs without
guidance; (3) Semi-supervised Learning: supervised and
B. Big Data unsupervised combinations; (4) reinforcement learning: learns
Figure 1 summaries the 3 characteristics of big data such as: a policy on how to act based on observations on the
(1) volume: the volume of data continues to increase over environment; (5) transduction: practice predicting new output
time. Many factors rapidly increasing the data volumes, based on training inputs, training outputs, and test inputs
including almost all business transactions involving data an available during the training process (learning); (6) learning to
increasing number of unstructured data flowing from social learn: studying its own inductive bias based on experience.
media, and increasing amounts of data generated from Machine learning serves optimization criteria or groupings
machines and mobile devices. The data size determines values by using sample data or past experiences. In machine learning,
and deep potential and whether it can actually be considered it has a set of models connects to several parameters; and the
big data or not [17]; (2) velocity: the velocity dimension refers learning process is carried out through implementation of
to Data flows at an unprecedented pace and this should be computer programs to optimize model parameters using
handled in a timely manner. The use of RFID, sensor devices training data or past experience. Machine learning used
and smart measuring devices in business activities drive the statistical theory in constructing mathematical models, with
need to handle such large data flows in real-time. Rapid the core task to draw conclusions from the sample.
reactions to such swift data flows are challenges that the
organization must address. The type and nature of the data. It
helps people who analyze it is using effectively the in-depth

978-1-5386-5821-5/18/$31.00 ©2018 IEEE 3-5 September 2018, Bina Nusantara University, Jakarta, Indonesia
2018 International Conference on Information Management and Technology (ICIMTech)
Page 185
D. Latent Semantic Analysis the ngepo system has wide range of coverage of popular topics
LSA is an automated mathematical/statistical technique for such as: technology, internet, computers, games, android,
extracting and inferring relations of expected contextual usage lifestyle, business, opinion, health, relationships and social
of words in passages of discourse. It is not a traditional life.
program of natural language processing or artificial
Before accessing the content, the user is expected to make
intelligence program; LSA uses no humanly constructed
registration and when the process is completed, user can see
dictionaries, knowledge bases, semantic networks, grammars,
questions and previous answers posted by other users.
syntactic parsers, or morphologies, or the like, and takes as its
Figure 3 shows main page after the user login or register
input only raw text parsed into words defined as unique
into system. User membership is recorded into Hadoop data
character strings and was born because it is based on the idea
system. Once the user logs in, he/she can only input
that syntax and style alone are not sufficient to assess an essay
question(s) about the desired topic, and the topic will be
[20].
automatically generated on the system.

III. RESEARCH METHOD

Widyahapsari [1] defines the research method as a scientific


way of obtaining data with a specific purpose and usefulness".
She further elaborates the use of methodological research as
the means that regulate the research procedures in general, as
well as the implementation of science particularly. The use of
scientific procedure is expected to meet with research
objectives.
The article applies the qualitative approach with the
objectives to explore the data more comprehensively. The data
collection is not limited with number, but also comes from
multiple sources such as: interviews, focus group discussion,
observation, literature study and other relevant articles from
Fig. 3. User Login Page.
popular sources such as: IEEE, Association of Information
System (AIS), and Association Computing Machine (ACM). The more user interested with the system, he/she tends to
The outcome of the study is expected to illustrate the post more questions and get more reply. The large data
application of Big data and machine learning in addressing the processing system is required to accommodate the needs of
common question and answer platform that raised by the Q&A sessions. Current statistics shows the data has grown
users. The use of Big data and machine learning also is tremendously and can reach thousands of lines. The more
expected to address the needs of users and improve user visitor accesses into the system, and more data is generated.
experience in accessing ngepo.com. The Hadoop data system is used to manage the tremendous
growth of data. With the implementation of big data at
IV. DISCUSSION ngepo.com provides very useful and great benefits to this
system.

Fig. 4. Architecture of Ngepo.com with Hadoop.


Fig. 2. Ngepo.com.
Figure 4 illustrates the current Hadoop architecture system
The ngepo.com site is an interaction website that allows the with the API system. It allows connecting multiplatform data
users to raise question and answer. Current users come from to access the data. The API flow is the first authentication
range of ages starting from teenagers until seniors. Any users mechanism to the API, after which it can only then get a token
can get validated answers through Q&A session. Currently,

978-1-5386-5821-5/18/$31.00 ©2018 IEEE 3-5 September 2018, Bina Nusantara University, Jakarta, Indonesia
2018 International Conference on Information Management and Technology (ICIMTech)
Page 186
to access the data and acts as a bridge before connecting it to
the database, which is needed in Hadoop data processing.
There are several stages in Hadoop data processing such
as: firstly, the use of tokenization mechanism in indexing the
document by converting all characters into lowercase and
deleting all punctuation. The process continues with
identifying the unique words with its frequency and make
weighting of the words. See the following diagram for the
illustration.
Figure 5 shows the entire punctuation is deleted, and each
word is broken down and resulting in a unique word. The Fig. 7. Sample Question.
second stage of stop-word removal is a process of eliminating
words that often appears after the tokenization process. Figure 7 shows that question is sent by the user and the topic
is close to the question set by the system itself, without
inputting the topic manually. The system does self-selection
with statistics method and produce the following table.
The LSA method is used to find equations or grouping topics
(modelling topics). The ngepo system is used to calculate
proximity of content A with other content. Thus, the resulting
output is a score. For example A with B: 0.3737, A with C:
0.888 and so on. The higher the score it will be considered the
closer or similar content. If the content is the same then it is
considered to have the same topic as well. To compare content
A with other content (past content that already has a topic).
While the A content is the newly submitted content. So there
is no topic, from the results of this calculation it will be known
to go to which topic automatically. the content is the question
Fig. 5. Tokenization Results.

The objective of this stage is to improve the quality of the


indexing. The third stage of stemming is processing and
separating multiple words into a single word, and arrange
them into form of matrix. It is illustrated as follow:

Fig. 8. Self-selection with statistical method.

Figure 8 shows the role of machine learning tasks that


examines the tasks from pre-existing or previous data by using
certain method, then get the output which is the score of all the
existing documents. Next five documents with the highest
Fig. 6. Example of a figure caption. (fiBasic Words Matrix. score will be taken from the five documents the top three
topics were then taken. One document can have more than one
After stemming process is done, it is followed with topic. After getting the topic approaching it will be entered
calculating frequency with statistical tool. Each incoming into the database automatically. So, the implementation of big
question is compared with entire data that stored in the data in this system gives a big impact and benefits so that
database system. The example of incoming questions is shown users who access the web will be helpful and more effective.
in following figure.
V. CONCLUSION
Implementation of big data provides many benefits in the
field of technology, one of them on a question and answer

978-1-5386-5821-5/18/$31.00 ©2018 IEEE 3-5 September 2018, Bina Nusantara University, Jakarta, Indonesia
2018 International Conference on Information Management and Technology (ICIMTech)
Page 187
platform-based website ngepo.com. This website is present to
facilitate users from various circles who like to interact and
are curious to find a solution of ignorance or answer the
curiosity of a variety of simple problems in everyday
activities. Ngepo.com is a question and answer portal that
features social media features and topics that are currently
popular. Big data implementation using machine learning
provides a brilliant solution for startup in Indonesia, because
of the ease of accessing from social media, without the need to
manually input to select the topic to be asked, but the system
will do automatically. Furthermore, the benefits of big data
can analyze the data that already exist in the previous to get
ideas or ideas. Then become popular in the present for the
future in order to improve the convenience and satisfaction of
system users, that already exist in the previous to get ideas or
ideas. Then become popular in the present for the future in
order to improve the convenience and satisfaction of system
users.
REFERENCES
[1] T. Widyahapsari, “Pengaruh penguasaan kosakata terhadap
kemampuan menyimak bahasa Jepang Universitas Pendidikan
Indonesia | repository.upi.edu | perpustakaan.upi.edu,” pp. 42–48,
2014.
[2] H. Toba, “Big Data : Menuju Evolusi Era Informasi Big Data :
Menuju Evolusi Era Informasi Selanjutnya,” no. October, 2015.
[3] E. Alpaydin, Introduction to Machine Learning Second Edition. .
[4] G. Bollmer, “Big Data , Small Media,” Cult. Stud. Rev., vol. 20, no. 2
(Sept.), pp. 266–277, 2014.
[5] B. D. Emirul, “빅 데이터 , Big Data,” vol. 46, no. 8, pp. 53–61,
2012.
[6] D. Deni, “TEKNOLOGI INFORMASI DAN KOMUNIKASI,”
Teknol. Inf. DAN Komun., pp. 1–25, 2009.
[7] R. A. Prayogo, “Pengertian Yii Framework,” Slideshare.Net, p. 1,
2013.
[8] M. Ceceng and K. Iwan, “PEMANFAATAN BIG DATA MELALUI
MEDIA SOSIAL,” pp. 1–14, 2015.
[9] D. Rizky, “Penjelasan Tentang Web API , Web Services , SOAP ,
REST.”
[10] R. Faisal, “Seri Belajar ASP . NET : Pengenalan ASP . NET Web
API,” no. December 2014, 2015.
[11] F. Megantara and H. L. H. S. Warnars, “Implementasi Big Data
untuk pencarian pattern data gudang pada PT . Bank Mandiri (
Persero ) TBK,” vol. 6, no. November 2016, 2017.
[12] A. P. Sujana, “Memanfaatkan Big Data Untuk Mendeteksi Emosi,”
Tek. Komput. Unikom, vol. 2, no. 2, pp. 1–4, 2013.
[13] B. A. Sanjaya and S. Sulistyo, “Big Data: Inkonsistensi Data Dan
Solusinya,” Semnasteknomedia Online, vol. 3, no. 1, pp. 1–2, 2015.
[14] Y. H. Partogi, A. Bhawiyuga, and A. Basuki, “Rancang Bangun
Infrastruktur Pemrosesan Big Data Menggunakan Apache Drill (
Studi Kasus : SIRCLO ),” vol. 2, no. 3, pp. 951–957, 2018.
[15] E. N. Jannah, M. Masrur, S. Asiyah, and K. Kunci—, “Penerapan
Framework Yii dalam Pembangunan Sistem Informasi Asrama Santri
Pondok Pesantren sebagai Media Pencarian Asrama Berbasis Web,”
J. Inf. Syst. Eng. Bus. Intell., vol. 1, no. 2, pp. 49–58, 2015.
[16] E. R. E. Sirait, “Implementasi Teknologi Big Data Di Lembaga
Pemerintahan Indonesia,” J. Penelit. Pos dan Inform., vol. 6, no. 2, p.
113, 2016.
[17] P. Géczy, “Big Data Characteristics,” Macrotheme Rev., vol. 3, no. 6,
pp. 94–104, 2014.
[18] Apache.org, “Welcome to Apache TM Hadoop ® !,”
Http://Hadoop.Apache.Org/, no. November 2008, pp. 2010–2013,
2013.
[19] R. Peglar, “Introduction to Analytics and Big Data,” pp. 1–47, 2012.
[20] J. Steinberger and K. Ježek, “Using Latent Semantic Analysis in Text
Summarization,” Proc. ISIM 2004, pp. 93--100, 2004.

978-1-5386-5821-5/18/$31.00 ©2018 IEEE 3-5 September 2018, Bina Nusantara University, Jakarta, Indonesia
2018 International Conference on Information Management and Technology (ICIMTech)
Page 188

You might also like